Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argon2: Reconsider MKPC, OMP_SCALE, benchmark hashes #5567

Open
solardiz opened this issue Nov 7, 2024 · 10 comments
Open

Argon2: Reconsider MKPC, OMP_SCALE, benchmark hashes #5567

solardiz opened this issue Nov 7, 2024 · 10 comments

Comments

@solardiz
Copy link
Member

solardiz commented Nov 7, 2024

I notice that in 6b9f850 we got:

-#define MAX_KEYS_PER_CRYPT      1
+#define MAX_KEYS_PER_CRYPT      2
 
-#define OMP_SCALE               16
+#define OMP_SCALE               8 // tuned w/ MKPC for core i7m

These values puzzle me. For something as slow as Argon2 normally is, I'd expect us to be able to set all of these to 1 without performance loss. Like we have them in the scrypt format.

We should re-test this with different values on different systems.

Also, our default benchmark hashes for the Argon2 formats are rather unusual - they use 4 MiB. In scrypt, we use 16 MiB. We could want to upgrade these to 16 MiB as well, which would contribute to lower MKPC and OMP_SCALE being optimal.

@solardiz solardiz added this to the Potentially 2.0.0 milestone Nov 7, 2024
@magnumripper
Copy link
Member

I was actually going to complain about just that (not realizing I was the one who put the figures there). My 8-core laptop and some 4-core workstation I happen to have nearby show just as fine speeds with them at 1-1-1 and that's obviously better for heavier costs.

As for test vectors, FWIW KeePassXC defaults to Argon2d, 64 MiB, p=2 and then aims for a t that ends up in about a second (I think) which can be about 10 - 35 on vanilla hardware. During my work with keepass formats I created a test vector matching our current ones for Argon2 in order to see where I'm at (on GPU I only get 10% of Alain's speed). I think I will do the reverse as well, add a keepass-matching test vector to the argon2 formats that can be benchmarked with --cost.

magnumripper added a commit to magnumripper/john that referenced this issue Nov 7, 2024
@solardiz
Copy link
Member Author

solardiz commented Nov 7, 2024

My 8-core laptop and some 4-core workstation I happen to have nearby show just as fine speeds with them at 1-1-1 and that's obviously better for heavier costs.

Well, I've just tested in a VM on my laptop under other load, and 1-1-1 is usually significantly slower at max threads. But when I limit threads to core count minus 1, it's fine. Indeed, having more work to do per crypt_all lets any fluctuation in threads' performance even out better before the sync point at the end of loop is reached. I think it's similar for other formats we have, so we probably shouldn't tune default OMP_SCALE for best performance under load. Besides, things will be better with real Argon2 hashes, which tend to use way more than 4 MiB.

Edit: I ran these tests in code without the pending memory allocation regression fix. So the (de)allocation was still in the loop. I don't know if fixing that would maybe affect this.

add a keepass-matching test vector to the argon2 formats that can be benchmarked with --cost.

I wonder/worry how much it would slow down self-test. Even if not included in benchmarks by default, it would be in tests.

Also, we could need the equivalent of my 1d7397a that we currently only have in the Armory format. At large thread counts, having e.g. 256x 64 MiB still allocated during cracking when actual loaded hashes are e.g. 16 MiB can be quite wasteful. (It was even worse in the Armory format because it uses SIMD to handle multiple "hashes" per thread. So it was easily tens of gigabytes during self-test on a huge AWS instance.)

solardiz pushed a commit that referenced this issue Nov 7, 2024
@magnumripper
Copy link
Member

magnumripper commented Nov 7, 2024

We should add a flag field to format_tests. In this case flagging "test vectors that are to be ignored for normal self-test".

@solardiz
Copy link
Member Author

I was wondering why on a system under other load, where I get a speedup at --test --format=argon2 --tune=16, I am only getting an OMP_SCALE of 1 (same as our new default, and slower on this system) with --tune=auto. Looks like that's because for auto-tuning we pick a salt with the highest first tunable cost:

        // Find most expensive salt, for auto-tune
        {
                struct db_salt *s = db->salts;

                tune_cost = MIN(db->max_cost[0], options.loader.max_cost[0]);

                while (s->next && s->cost[0] < tune_cost)
                        s = s->next;
                salt = s->salt;
        }

Why is that? Perhaps so that we don't exceed the maximum crypt_all duration even with the highest cost? But anyway, it's only the highest among test vectors, and actual can be higher yet.

We could want to reconsider and instead auto-tune using one of the first two test vectors that we use for benchmarking.

@solardiz
Copy link
Member Author

Suggested auto-tune patch:

+++ b/src/omp_autotune.c
@@ -126,7 +126,7 @@ void omp_autotune_run(struct db_main *db)
        {
                struct db_salt *s = db->salts;
 
-               tune_cost = MIN(db->max_cost[0], options.loader.max_cost[0]);
+               tune_cost = MIN(db->real ? db->max_cost[0] : s->cost[0], options.loader.max_cost[0]);
 
                while (s->next && s->cost[0] < tune_cost)
                        s = s->next;

which I think makes us auto-tune to first test vector when benchmarking, but to highest first cost hash when cracking.

@solardiz
Copy link
Member Author

-               tune_cost = MIN(db->max_cost[0], options.loader.max_cost[0]);
+               tune_cost = MIN(db->real ? db->max_cost[0] : s->cost[0], options.loader.max_cost[0]);

No, this doesn't always work right - seemed to work for Argon2, but took a 32 MiB test vector for Armory whereas its first one is 8 MiB. I think the problem is the salts are not in the same order as test vectors. Looks like we'll need an extra loop to find the salt corresponding to the first test vector.

@solardiz
Copy link
Member Author

We've reset MKPC and OMP_SCALE to 1.

As to benchmark hashes, there doesn't appear to be one common set of Argon2 parameters in use for password hashing. 4 MiB is certainly too low, but at least it's what we've been using so far.

It would be relevant to see what NetBSD uses. But per these blog posts, they may have gone with something ridiculous for that use case (p > 1):

https://blog.netbsd.org/tnf/entry/gsoc_2019_report_incorporating_the
https://blog.netbsd.org/tnf/entry/gsoc_2019_report_update_incorporating

@solardiz
Copy link
Member Author

Testing on https://3v4l.org I see that recent PHP uses:

<?php
echo PASSWORD_ARGON2_DEFAULT_MEMORY_COST . "\n";
echo PASSWORD_ARGON2_DEFAULT_TIME_COST . "\n";
echo PASSWORD_ARGON2_DEFAULT_THREADS . "\n";
echo 'Argon2i hash: ' . password_hash('password', PASSWORD_ARGON2I) . "\n";
echo 'Argon2i hash: ' . password_hash('php', PASSWORD_ARGON2I) . "\n";
echo 'Argon2id hash: ' . password_hash('password', PASSWORD_ARGON2ID). "\n";
echo 'Argon2id hash: ' . password_hash('php', PASSWORD_ARGON2ID). "\n";
Output for 8.3.13 | released 2024-10-24 | took 461 ms, 82.45 MiB
    65536
    4
    1
    Argon2i hash: $argon2i$v=19$m=65536,t=4,p=1$PXk4GTSVyqSRp6hbK9JWEw$+W0OjHkdgsT1WGjxnz1Je0myvLqlirRW3N/Y8ZCX3mc
    Argon2i hash: $argon2i$v=19$m=65536,t=4,p=1$Z8iaCGW08RJ/cqtrnRAByA$3KDSPE/cJF0Olc+nI7lYVHv/9K94RDOud/o9ztYb7fc
    Argon2id hash: $argon2id$v=19$m=65536,t=4,p=1$7Au984bRf8QQjhKA7lwJeA$blEnuab1lTEYzRnTtOup73BM1Ca65sbMY8my0NLed/0
    Argon2id hash: $argon2id$v=19$m=65536,t=4,p=1$6wNfsywSbL+ohVxnOlPIpA$rG8ThMfszdwTe9Zprff0UFj3haeIk30oEPtcMedv27Y

(but many of their builds of PHP don't seem to support Argon2 at all - perhaps it's compile-time optional)

There doesn't appear to be a default between 2i and 2id. PASSWORD_DEFAULT still uses bcrypt.

Since Argon2id is newer, I guess if they ever want to make Argon2 the default, this is a more likely choice. So we may want to standardize on that for benchmarks.

@magnumripper
Copy link
Member

BTW isn't the CPU Argon2 code capable of threading (for p > 1) by itself? Would it ever be beneficial to use that instead of (or together with, depending on p) our usual one?

@solardiz
Copy link
Member Author

BTW isn't the CPU Argon2 code capable of threading (for p > 1) by itself? Would it ever be beneficial to use that instead of (or together with, depending on p) our usual one?

I was thinking of this too, and we could want to have a separate issue for that. Some notes and thoughts:

  1. It's easier to parallelize at JtR level rather than inside a hash computation. We need parallelization at JtR level anyway for efficient computation of hashes with insufficient p.
  2. It's difficult to support arbitrary mixes of p and total thread count efficiently - e.g., it's not always divisible by p.
  3. Yet there are memory savings and maybe speedup to be had if we do. (In OpenCL, we do already benefit from higher p.)
  4. As I recall, the original Argon2 code as submitted to PHC had multi-threading support in it, but this is not in the currently maintained code that we reuse. Also, we would probably want to unify on consistent usage of OpenMP in both Argon2 and our crypt_all, rather than use pre-existing upstream code with whatever threading model they use.
  5. We have similar concerns and potential task for (ye)scrypt, and for the various formats using scrypt and Argon2 as a primitive.
  6. My upstream yescrypt code has OpenMP support in it, which in JtR is currently disabled in favor of usage of OpenMP in crypt_all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants