This database is 🚧️ under developement! 🚧️
It will eventually be added to 🥑️ Dinosaur Datasets. 🥑️
The data files are organized by repository in data. These instructions are for generation. Create a python environment and install dependencies:
pip install -r requirements.txt
You'll need to make a "drivers" directory and download the chromedriver (matching your browser) to it inside of scripts. Then, run the parsing script, customizing the matrix of search terms. You should have a chromedriver installed, all browsers closed, and be prepared to login to GitHub.
cd scripts/
python search.py
Then download files, from the root, targeting the output file of interest.
python scripts/get_jobspecs.py ./scripts/data/raw-links-may-23.json --outdir ./data
Note that the data now is just a trial run! For the first run, we had 11k+ unique results from just a trial run.
For the second run, that went up to 19544
. When I added more applications, for half of the run it was 25k.
The current total is 31932
scripts. I didn't add the last run of flux because I saw what I thought were false positives.
Also try to get associated GitHub files.
python scripts/get_jobspec_configs.py
Word2Vec is a little old, and I think a flaw is that it is combining jobspecs. But if we have the window the correct size, we can make associations between close terms. The space I'm worried about is the beginning of one script and the end of another, and maybe a different approach or strategy could help with that. To generate the word2vec embeddings you can run:
python scripts/word2vec.py --input ./data
Updates to the above on June 9th:
- Better parsing to tokenize
- we combine by space instead of empty space so words at end are not combined (this was a bug)
- punctuation that should be replaced by space instead of empty space honored (dashes, underscore, etc)
- hash bangs for shell parsed out
- better tokenization and recreation of content
- each script is on one line (akin to how done for word2vec)
I think it would be reasonable to create a similarity matrix, specifically cosine distance between the vectors. This will read in the metadata.tsv and vectors.tsv we just generated.
python scripts/vector_matrix.py --vectors ./scripts/data/combined/vectors.tsv --metadata ./scripts/data/combined/metadata.tsv
The above does the following:
- We start with our jobspecs that are tokenized according to the above.
- We further remove anything that is purely numerical
- We use TF-IDF to reduce the feature space to 300 terms
- We do a clustering of these terms to generate the resulting plot.
The hardest thing is just seeing all the terms. I messed with JavaScript for a while but gave up for the time being, the data is too big for the browser and likely we need to use canvas.
I thought it would be interesting to explicitly parse the directives. That's a bit hard, but I took a first shot:
python scripts/parse_directives.py --input ./data
Assessing 33851 conteder jobscripts...
Found (and skipped) 535 duplicates.
You can find tokenized lines (with one jobspec per line), the directive counts, and the dictionary and skips in scripts/data/combined/
I was thinking about adding doc2vec, because word2vec is likely making associations between terms in different documents, but I don't think anyone is using doc2vec anymore, because the examples I'm finding using a deprecated version of tensorflow that has functions long removed. We could use the old gensim version, but I think it might be better to think of a more modern approach. I decided to try top2vec.
# Using pretrained model (not great because not jobscript terms)
python scripts/run_top2vec.py
# Build with doc2vec - be careful we set workers and learn mode (slower) here
# started at 7pm
python3 scripts/run_top2vec_with_doc2vec.py --speed learn
python3 scripts/run_top2vec_with_doc2vec.py --speed deep-learn
And then to explore (finding matches for a subset of words):
python3 scripts/explore_top2vec.py
python3 scripts/explore_top2vec.py --outname top2vec-jobspec-database-learn.md --model ./scripts/data/combined/wordclouds/top2vec-with-doc2vec-learn.model
# Deep learn (highest quality vectors), takes about 6-7 hours to run 128 GB ram CPU instance
python3 scripts/explore_top2vec.py --outname top2vec-jobspec-database-deep-learn.md --model ./scripts/data/combined/wordclouds/top2vec-with-doc2vec-deep-learn.model
For word2vec:
- continuous bag of words: we create a window around the word and predict the word from the context
- skip gram: we create the same window but predict the context from the word (supposedly slower but better results)
I had to run this on a large VM for it to work. See the topics in scripts/data/combined/wordclouds. We can likely tweak everything but I like how this tool is approaching it (see docs in ddangelov/Top2Vec).
We can run Gemini across our 33K jobspecs to generate a templatized output for each one:
python scripts/classify-gemini.py
That takes a little over a day to run, and it will cost about 25-$30 per run. I did two runs for about $55. Then we can both check the model, normalize and visualize our resources (that we parsed) and compare to what Gemini says.
python scripts/process-gemini.py
You can then see the data output in scripts/data/gemini-with-template-processed or use this script to visualize results that are filtered to those with all, missing, or some wrong values:
# pip install rich
python scripts/inspect-gemini.py
# How to customize
python scripts/inspect-gemini.py --type missing
python scripts/inspect-gemini.py --type wrong
# Print more than 1
python scripts/inspect-gemini.py --type all --number 3
Next, we want to calculate the cyclomatic complexity. Since these are akin to bash scripts, we can use shellmetrics. It's not perfect, but I did a few spot checks and the result was what I'd want or expect - the more complex scripts (with arrays, etc) got a higher score. Since we know our database on LC is now in S3, let's instead write this to an SQL file with a table that can be queried based on path, sha1, or sha256. First, make sure the binary is on your path:
mkdir -p ./bin
curl -fsSL https://git.io/shellmetrics > ./bin/shellmetrics
chmod +x ./bin/shellmetrics
export PATH=$PWD/bin:$PATH
Here is example output, when run manually. Note that I think we want the first section, which has the CCN "cognitive complexity number" for main, which is the main chunk. In the csv, that is the middle block and 4th column "1"
$ shellmetrics data/abdullahrkw/FAU-FAPS/ViT/run-job.sh
==============================================================================
LLOC CCN Location
------------------------------------------------------------------------------
5 1 <main> data/abdullahrkw/FAU-FAPS/ViT/run-job.sh
------------------------------------------------------------------------------
1 file(s), 1 function(s) analyzed. [bash 5.1.16(1)-release]
==============================================================================
NLOC NLOC LLOC LLOC CCN Func File (lines:comment:blank)
total avg total avg avg cnt
------------------------------------------------------------------------------
5 5.00 5 5.00 1.00 1 data/abdullahrkw/FAU-FAPS/ViT/run-job.sh (20:14:1)
------------------------------------------------------------------------------
==============================================================================
NLOC NLOC LLOC LLOC CCN Func File lines comment blank
total avg total avg avg cnt cnt total total total
------------------------------------------------------------------------------
5 5.00 5 5.00 1.00 1 1 20 14 1
------------------------------------------------------------------------------
$ shellmetrics --csv data/abdullahrkw/FAU-FAPS/ViT/run-job.sh
file,func,lineno,lloc,ccn,lines,comment,blank
"data/abdullahrkw/FAU-FAPS/ViT/run-job.sh","<begin>",0,0,0,20,14,1
"data/abdullahrkw/FAU-FAPS/ViT/run-job.sh","<main>",0,5,1,0,0,0
"data/abdullahrkw/FAU-FAPS/ViT/run-job.sh","<end>",0,0,0,20,14,1
Next, generate a database for files in data.
python scripts/cyclomatic-complexity.py --input ./data --db ./scripts/data/cyclomatic-complexity-github.db
IMPORTANT For the above and complexity calculation below, duplicates are not removed. We store the sha256 and sha1 so you can do this!
This database is kind of messy - not sure I like it as much as the one I generated. Someone else can deal with it :)
- Total unique jobspec jsons: 210351
- Total with BatchScript: 116117
cd ./lc
python scripts/cyclomatic-complexity.py --input ./raw/jobdata_json --db ./data/cyclomatic-complexity-lc.db
IMPORTANT Since this is a combination of json and .tar files (for which we extract members) the database has an extra column for the jobid, and the original filename path corresponds to the file here. The file that we actually read is parsed from the BatchScript directive of the json file, which is only the batch portion of the data to match what we use in GitHub.
Examples to read in the two databases:
import sqlite3
conn = sqlite3.connect("scripts/data/cyclomatic-complexity-github.db")
cursor = conn.cursor()
# This gets the field names and metadata
cursor.execute('PRAGMA table_info(jobspecs);').fetchall()
[(0, 'id', 'INTEGER', 0, None, 1),
(1, 'name', 'TEXT', 0, None, 0),
(2, 'sha256', 'TEXT', 0, None, 0),
(3, 'sha1', 'TEXT', 0, None, 0),
(4, 'ccn', 'NUMBER', 0, None, 0)]
And this gets the jobspecs (one for example)
query = cursor.execute("SELECT * from jobspecs;")
query.fetchone()
# rows = query.fetchall()
(1,
'./data/ZIYU-DEEP/reprover-test/2gpu.sh',
'48ef130f0700b606c3b5d4b2a784cd78f97439b935bd7d7df4673d9683d420e1',
'c88fdc41ae091734565c63ed67c51a45de66a07f',
1)
Don't forget to close.
conn.close()
And don't forget LC will have an extra field, for the name of the file plus the jobid since some are members in a tar.
conn = sqlite3.connect("lc/data/cyclomatic-complexity-lc.db")
cursor = conn.cursor()
cursor.execute('PRAGMA table_info(jobspecs);').fetchall()
[(0, 'id', 'INTEGER', 0, None, 1),
(1, 'name', 'TEXT', 0, None, 0),
(2, 'jobid', 'TEXT', 0, None, 0),
(3, 'sha256', 'TEXT', 0, None, 0),
(4, 'sha1', 'TEXT', 0, None, 0),
(5, 'ccn', 'NUMBER', 0, None, 0)]
conn.close()
HPCIC DevTools is distributed under the terms of the MIT license. All new contributions must be made under this license.
See LICENSE, COPYRIGHT, and NOTICE for details.
SPDX-License-Identifier: (MIT)
LLNL-CODE- 842614