1.4 Billion Text Credentials Analysis (NLP)

Using deep learning and NLP to analyze a large corpus of clear text passwords.

Objectives:

Train a generative model.
Understand how people change their passwords over time: hello123 -> h@llo123 -> h@llo!23.

Disclaimer: for research purposes only.

In the press

Get the data

Download any Torrent client.
Here is a magnet link you can find on Reddit:
- magnet:?xt=urn:btih:7ffbcd8cee06aba2ce6561688cf68ce2addca0a3&dn=BreachCompilation&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fglotorrents.pw%3A6969&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337
Checksum list is available here: checklist.chk
./count_total.sh in BreachCompilation should display something like 1,400,553,870 rows.

Get started (processing + deep learning)

Process the data and run the first deep learning model:

# make sure to install the python deps first. Virtual env are recommended here.
# virtualenv -p python3 venv3; source venv3/bin/activate; pip install -r requirements.txt
# Remove "--max_num_files 100" to process the whole dataset (few hours and 50GB of free disk space are required.)
./process_and_train.sh <BreachCompilation path>

Data (explanation)

INPUT:   BreachCompilation/
         BreachCompilation is organized as:

         - a/          - folder of emails starting with a
         - a/a         - file of emails starting with aa
         - a/b
         - a/d
         - ...
         - z/
         - ...
         - z/y
         - z/z

OUTPUT: - BreachCompilationAnalysis/edit-distance/1.csv
        - BreachCompilationAnalysis/edit-distance/2.csv
        - BreachCompilationAnalysis/edit-distance/3.csv
        [...]
        > cat 1.csv
            1 ||| samsung94 ||| samsung94@
            1 ||| 040384alexej ||| 040384alexey
            1 ||| HoiHalloDoeii14 ||| hoiHalloDoeii14
            1 ||| hoiHalloDoeii14 ||| hoiHalloDoeii13
            1 ||| hoiHalloDoeii13 ||| HoiHalloDoeii13
            1 ||| 8znachnuu ||| 7znachnuu
        EXPLANATION: edit-distance/ contains the passwords pairs sorted by edit distances.
        1.csv contains all pairs with edit distance = 1 (exactly one addition, substitution or deletion).
        2.csv => edit distance = 2, and so on.

        - BreachCompilationAnalysis/reduce-passwords-on-similar-emails/99_per_user.json
        - BreachCompilationAnalysis/reduce-passwords-on-similar-emails/9j_per_user.json
        - BreachCompilationAnalysis/reduce-passwords-on-similar-emails/9a_per_user.json
        [...]
        > cat 96_per_user.json
        {
            "1.0": [
            {
                "edit_distance": [
                    0,
                    1
                ],
                "email": "[email protected]",
                "password": [
                    "090698d",
                    "090698D"
                ]
            },
        {
                "edit_distance": [
                    0,
                    1
                ],
                "email": "[email protected]",
                "password": [
                    "5555555555q",
                    "5555555555Q"
                ]
         }
        EXPLANATION: reduce-passwords-on-similar-emails/ contains files sorted by the first 2 letters of
        the email address. For example [email protected] will be located in 96_per_user.json
        Each file lists all the passwords grouped by user and by edit distance.
        For example, [email protected] had 2 passwords: 090698d and 090698D. The edit distance between them is 1.
        The edit_distance and the password arrays are of the same length, hence, a first 0 in the edit distance array.
        Those files are useful to model how users change passwords over time.
        We can't recover which one was the first password, but a shortest hamiltonian path algorithm is run
        to detect the most probably password ordering for a user. For example:
        hello => hello1 => hell@1 => hell@11 is the shortest path.
        We assume that users are lazy by nature and that they prefer to change their password by the lowest number
        of characters.

Run the data processing alone:

python3 run_data_processing.py --breach_compilation_folder <BreachCompilation path> --output_folder ~/BreachCompilationAnalysis

If the dataset is too big for you, you can set max_num_files to something between 0 and 2000.

Make sure you have enough free memory (8GB should be enough).
It took 1h30m to run on a Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz (on a single thread).
Uncompressed output is around 45G.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github		.github
.gitignore		.gitignore
README.md		README.md
checklist.chk		checklist.chk
data_gen.py		data_gen.py
process_and_train.sh		process_and_train.sh
processing_callbacks.py		processing_callbacks.py
requirements.txt		requirements.txt
run_data_processing.py		run_data_processing.py
run_encoding.py		run_encoding.py
shp.py		shp.py
train_constants.py		train_constants.py
train_model.py		train_model.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1.4 Billion Text Credentials Analysis (NLP)

In the press

Get the data

Get started (processing + deep learning)

Data (explanation)

About

Releases

Packages

Languages

arrheni/tensorflow-1.4-billion-password-analysis

Folders and files

Latest commit

History

Repository files navigation

1.4 Billion Text Credentials Analysis (NLP)

In the press

Get the data

Get started (processing + deep learning)

Data (explanation)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages