-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full database size #46
Comments
Misread this, I was thinking " appearances > 1" not " >=1"
Sum of appearances was overflowing awk's 32-bit integer so I'll have to get creative there. I will put these numbers together, but there is a serious caveat to be aware of. Values that appeared fewer than 5 times are far more likely to be garbage data that somehow got into a wordlist as everything got passed around the internet before it ended up in my hands. There is a very high chance that values that appear 1 time are not "password" but are for example, values from a dictionary. Here we have 10 entries that appear about 1000 lines above the bottom of the list.
These are all dictionary words. Yes, they may be used as passwords, but it is highly likely that one of the large, encyclopedic wordlists contained entire dictionaries. The goal of this project is to include passwords that are common, not to build a large encyclopedic list. This is why I set the cutoff for inclusion to 5 appearances or higher. Understand that analysis of "passwords" because VERY dubious at low appearance counts |
Thanks for the detailed follow-up! My intention definitely isn't to use these passwords with less than 5 occurrences in any sort of analysis, it's more to characterise the size of the database that my analyses are derived from. I hope that makes more sense now :). |
I'll figure out how to put those files together for you soon. |
I'm doing some analyses based on the appearances data now added, but two specific numbers would be helpful in characterizing the full dataset that these top X appearances are then extracted from.
(1) How many unique passwords (i.e., >=1 appearance) were present in the full database? I.e., the "nearly 13 billion" value, but I would appreciate the specific number.
(2) What is the total number of password appearances in the full database, i.e., the sum of the appearances column across all nearly 13 billion passwords.
The text was updated successfully, but these errors were encountered: