Merge pull request #131 from nishant-sachdeva/trainCompile

Training IR2Vec with the compile dataset
IITH-Compilers · Nov 20, 2024 · f0d17bc · f0d17bc
2 parents b35627c + 8da78c6
commit f0d17bc
Show file tree

Hide file tree

Showing 135 changed files with 20,932 additions and 26,928 deletions.
diff --git a/docs/comPile.md b/docs/comPile.md
@@ -0,0 +1,42 @@
+The following guide details the steps followed in training IR2Vec with the ComPile dataset.
+
+# # Generating .ll files for re-training.
+   - The git repo `IR2Vec-Version-Upgrade-Checks` has the required scripts to be run for this process.
+   - The repo is available [here](https://github.com/IITH-Compilers/IR2Vec-Version-Upgrade-Checks/)
+   - Our relevant scripts and files will be present in the folder `ComPile`.
+      - `ComPile/collect_dataset_info.py` - This script will generate the list of all the unique C/C++ files in the dataset.
+      - `ComPile/save_ir.py` - This script will download the ByteCode files for all the C/C++ files in the dataset, generate the IR files and save them in the specified location, following which it will save the file_names to `ir_paths.txt` file.
+      - `ComPile/prep_ir_list.py` - This script will take the `ir_paths.txt` file and generate the list of paths of all the IR files in the downloaded dataset.
+
+- Once we have generated the list of all the `.ll` file paths, we go to the seed_embedding folder in the main IR2Vec repository. Here, our process will have to involve the following tasks.
+   - Generating Training Triplets
+   - Preprocessing the data
+   - Training on the data and generating a final embedding file.
+   - Using the embedding file to generate the test oracle.
+   - Running the testing to verify the validity of the entire upgrade process.
+
+# Generating Training Triplets
+   - Run the `triplets.sh` bash file with relevant changes to update the llvm version.
+   - Instructions to run the same are available at `seed_embeddings/README.md`.
+   - Here, a slight change is required. Since the dataset is large, instead of creating our temp files at `/tmp`, we will specify a custom path to store the temp files.
+   - For assisting this change, and running the script, we have a `gen_triplets.sh` helper script available in the `ComPile` folder. You need to copy the `gen_triplets.sh` script to the `seed_embeddings` folder, make relevant changes to the script and run it.
+
+# Breaking triplets.txt
+   - The generated `triplets.txt` file will be extremely large in size. So much so that attempting to open it, or directly read from it, is likely to cause an overshoot of the available RAM space, and cause a system crash.
+   - To mitigate this, we have developed the following system. Go to the location where your `triplets.txt` is stored. Create a new folder, say `split_files`
+   - We use the command `split -C 500M triplets.txt split_files/triplets_part -d -a 2 --numeric-suffixes=11 --additional-suffix=.txt`
+   - This command will split the `triplets.txt` file into multiple files, each 500MB in size, and store them in the folder, labelled by number.
+
+# Preprocesing the data
+   - In our previous version, we were using the `openKE` folder, where the script `IR2Vec/seed_embeddings/OpenKE/preprocess.py` was being used to preprocess the data from `triplets.txt`.
+   - A new script has been provided at `IR2Vec/seed_embeddings/OpenKE/preprocess_hybrid.py`. This script takes as input, the folder of broken up triplets file, as created in the previous step. The script takes in the folder, and iterates over all the files in the folder, to generate the Entities, Relation and the Training sets, in a safe manner, so as to not cause any RAM overshoots
+   - This script can be used as it-is with the previous approach as well. Previously, a single `triplets.txt` file was sufficient to generate the requisite preprocessed information. To run with a single triplets file, just place it in a folder, and pass the folder path to this script.
+
+# Train IDs
+   - Similar to `triplets.txt`, the file `train2id.txt` is also going to be an extremely large file, and attempting to open it will likely overshoot the RAM and cause a system crash.
+   - To avoid this, the following workaround is deployed.
+   - Studying the training code will show us that the `train2id.txt` file is read, and an attempt is made to extract all the unique train sets from it. This, when run with a large `train2id.txt` , is going to be a likely site of RAM overshoot and a subsequent system crash.
+   - To solve this, a script, `ComPile/get_uniq_train.sh` is supplied. The user needs to copy this to the site of the `train2id.txt`, and run with the appropriate path changes. This will give an output of much reduced size, with unique train sets.
+   - This will can then straight away be used with the regular training path.
+
+Once this step is reached, the user can then resume training, from the original path [here](https://github.com/IITH-Compilers/IR2Vec/wiki/version_upgrade_process#training). A helper script, `ComPile/run_training_ray.sh` has been provided to help the user provide log paths, and specify properly formatted parameters and run the training accordingly.
diff --git a/docs/version_upgrade_process.md b/docs/version_upgrade_process.md
@@ -35,6 +35,10 @@ The following guide details the steps followed in upgrading the LLVM version sup
    - Once the file has been run, we should have a `preprocessed` folder. Inside this folder, we should have the relevant preprocessed data generated.
    - Go ahead and create an empty `embeddings` folder here. This will be relevant for the next step.
 
+We have recently retrained the IR2Vec embeddings with a larger dataset. This is the [ComPile](https://huggingface.co/datasets/llvm-ml/ComPile) dataset. This dataset is a collection of LLVM IR files from open-source projects, and is considerably larger than the current dataset used in the original IR2Vec paper. Further details about the retraining process can be found [here](https://github.com/IITH-Compilers/IR2Vec/wiki/comPile.md).
+
+Once the trainIDs, relations and entities files are generated, we can use them, as it is, in the training process as described before.
+
 # Training
 - The next file to run is the `generate_embeddings_ray.py` file in the `openKE` folder.
    - Use the `openKE.yaml` file to create the conda environment.

diff --git a/seed_embeddings/OpenKE/generate_embedding.py b/seed_embeddings/OpenKE/generate_embedding.py
@@ -188,7 +188,7 @@ def findRep(src, dest, index_dir):
         ),
     )
 
-    findRep(outfilejson, seedfile, arg_conf.index_dir)
+    findRep(outfile, seedfile, arg_conf.index_dir)
 
     print("Training finished...")
     print("seed file : ", seedfile)
diff --git a/seed_embeddings/OpenKE/generate_embedding_ray.py b/seed_embeddings/OpenKE/generate_embedding_ray.py
@@ -36,11 +36,11 @@ def test_files(index_dir):
 
     print(entities, relations, train)
     if not os.path.exists(entities):
-        raise Exception("entity2id.txt not found")
+        raise Exception(f"{entities} not found")
     if not os.path.exists(relations):
-        raise Exception("relation2id.txt not found")
+        raise Exception(f"{relations} not found")
     if not os.path.exists(train):
-        raise Exception("train2id.txt not found")
+        raise Exception(f"{train} not found")
 
 
 def train(config, args=None):
@@ -286,7 +286,7 @@ def reformat_embeddings(input_str):
         param_space=search_space,
         tune_config=TuneConfig(
             search_alg=optuna,
-            max_concurrent_trials=12,
+            max_concurrent_trials=3,
             scheduler=scheduler,
             num_samples=128,
         ),

diff --git a/seed_embeddings/OpenKE/preprocess_hybrid.py b/seed_embeddings/OpenKE/preprocess_hybrid.py
@@ -0,0 +1,174 @@
+# Part of the IR2Vec Project, under the Apache License v2.0 with LLVM
+# Exceptions. See the LICENSE file for license information.
+# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+#
+""" This script generates entity2id.txt, train2id.txt and relation2id.txt  """
+
+# arg1 : path of the file generated by collectIR Pass
+
+import argparse
+import os
+import shutil
+
+
+def getEntityDict(config):
+    uniqueWords = set()
+    # Iterate over all files in the specified folder
+    for filename in sorted(os.listdir(config.tripletFolder)):
+        filepath = os.path.join(config.tripletFolder, filename)
+        if os.path.isfile(filepath):
+            print(f"Reading from file {filepath}")
+            with open(filepath, "r") as file:
+                for line in file:
+                    words = line.strip().split()
+                    uniqueWords.update(words)
+
+    uniqueWords = sorted(uniqueWords)
+    print(f"Unique entities found {len(uniqueWords)}")
+
+    op = open(os.path.join(config.preprocessed_dir, "entity2id.txt"), "w")
+    entityDict = {}
+    op.write(str(len(uniqueWords)) + "\n")
+    for i, word in enumerate(uniqueWords):
+        op.write(str(word) + "\t" + str(i) + "\n")
+        entityDict[str(word)] = str(i)
+    op.close()
+    return entityDict
+
+
+def getRelationDict(config):
+    max_len = 0
+    for filename in sorted(os.listdir(config.tripletFolder)):
+        filepath = os.path.join(config.tripletFolder, filename)
+        if os.path.isfile(filepath):
+            print(f"Reading from file {filepath}")
+            with open(filepath, "r") as file:
+                for line in file:
+                    length = len(line.strip().split("  "))
+                    max_len = max(max_len, length)
+
+    maxArgs = max_len - 2
+    relationDict = {}
+
+    op = open(os.path.join(config.preprocessed_dir, "relation2id.txt"), "w")
+    print(f"Relations - {maxArgs+3}")
+    op.write(str(maxArgs + 3) + "\n")
+    relationDict["Type"] = "0"
+    relationDict["Next"] = "1"
+
+    op.write("Type	0\n")
+    op.write("Next	1\n")
+    for i in range(maxArgs):
+        op.write("Arg" + str(i) + "\t" + str(i + 2) + "\n")
+        relationDict["Arg" + str(i)] = str(i + 2)
+    op.close()
+
+    return relationDict
+
+
+def create_write_str(a, b, c):
+    return f"{a}\t{b}\t{c}\n"
+
+
+def createTrain2ID(entityDict, relationDict, config):
+    print("Generating train set")
+    opc = ""
+    nol = 0
+    temp_file_path = os.path.join(config.preprocessed_dir, "train2id_temp.txt")
+
+    for filename in sorted(os.listdir(config.tripletFolder)):
+        filepath = os.path.join(config.tripletFolder, filename)
+        if os.path.isfile(filepath):
+            print(f"Reading from file {filepath}")
+            temp_file = os.path.join(config.tempDir, filename)
+            with open(filepath, "r") as file, open(temp_file, "w") as temp_file:
+                for sentence in file:
+                    s = sentence.strip().split("  ")
+                    s_len = len(s)
+                    if s and s[0] != "":
+                        if opc != "":
+                            if s[0] not in entityDict:
+                                print(sentence, s, s_len)
+                                print(s[0] + " not found in entitiyDict")
+                            if "Next" not in relationDict:
+                                print("Next not found in relationDict")
+                            temp_file.write(
+                                create_write_str(
+                                    entityDict[opc],
+                                    entityDict[s[0]],
+                                    relationDict["Next"],
+                                )
+                            )
+                            nol += 1
+                        opc = s[0]
+                        temp_file.write(
+                            create_write_str(
+                                entityDict[opc], entityDict[s[1]], relationDict["Type"]
+                            )
+                        )
+                        nol += 1
+                        for i, arg in enumerate(range(2, s_len)):
+                            temp_file.write(
+                                create_write_str(
+                                    entityDict[opc],
+                                    entityDict[s[arg]],
+                                    relationDict[f"Arg{i}"],
+                                )
+                            )
+                            nol += 1
+
+    final_file_path = os.path.join(config.preprocessed_dir, "train2id.txt")
+    with open(final_file_path, "w") as final_file:
+        final_file.write(f"{nol}\n")
+        for filename in sorted(os.listdir(config.tempDir)):
+            temp_file_path = os.path.join(config.tempDir, filename)
+            if os.path.isfile(temp_file_path):
+                with open(temp_file_path, "r") as temp_file:
+                    for line in temp_file:
+                        final_file.write(line)
+            # Remove the temporary file to clean up
+            os.remove(temp_file_path)
+
+    shutil.rmtree(config.tempDir)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--tripletFolder",
+        dest="tripletFolder",
+        metavar="FILE",
+        help="Path of the director containing triplet files generated by collectIR pass.",
+        required=True,
+    )
+    parser.add_argument(
+        "--preprocessed-dir",
+        dest="preprocessed_dir",
+        metavar="DIRECTORY",
+        help="Path of the triplet file generated by collectIR pass.",
+        default=None,
+    )
+    config = parser.parse_args()
+    if config.preprocessed_dir is None:
+        config.preprocessed_dir = os.path.join(
+            os.path.dirname(config.tripletFolder), "preprocessed"
+        )
+        i = 0
+        while os.path.exists(config.preprocessed_dir):
+            i += 1
+            config.preprocessed_dir = config.preprocessed_dir + str(i)
+        os.makedirs(config.preprocessed_dir)
+
+    # create a temp folder to store train-temp-ids
+    config.tempDir = os.path.join(os.path.dirname(config.tripletFolder), "temp_train")
+    i = 0
+    while os.path.exists(config.tempDir):
+        i += 1
+        config.tempDir = config.tempDir + str(i)
+    os.makedirs(config.tempDir)
+
+    ed = getEntityDict(config)
+    rd = getRelationDict(config)
+    createTrain2ID(ed, rd, config)
+
+    print("Files are generated at the path ", config.preprocessed_dir)
diff --git a/seed_embeddings/triplets.sh b/seed_embeddings/triplets.sh
@@ -53,7 +53,7 @@ while read p; do
 	a=0
 	USED_OPT=()
 	while [ "$a" -lt "$NO_OF_OPT_FILES" ]; do # this is loop1
-		tmpfile=$(mktemp /tmp/IR2Vec-CollectIR.XXXXXXXXXX)
+		tmpfile=$(mktemp ${PATH_VAR}/tmp/IR2Vec-CollectIR.XXXXXXXXXX)
 		opt_index=$((RANDOM % 6))
 		DEBUG echo "opt_index from $opt_index"
 		opt=${OPT_LEVELS[$opt_index]}
@@ -70,13 +70,15 @@ while read p; do
 		fi
 		USED_OPT[$a]=$opt
 		DEBUG echo "opt from $opt"
-		${LLVM_BUILD}/bin/opt-14 -S -$opt $p -o $tmpfile
+		${LLVM_BUILD}/bin/opt-17 -S -$opt $p -o $tmpfile
 		$COLLECT_BUILD/bin/ir2vec -collectIR -o $4 $tmpfile &>/dev/null
 		let "a++"
-		rm "$tmpfile"
+		# rm -rf "$tmpfile"
 	done &
 	if [ $counter == 100 ]; then
-		sleep 20
+		echo "========= PAUSE ========="
+		rm -rf ${PATH_VAR}/tmp/IR2Vec-CollectIR*
+		sleep 3
 		counter=0
 	fi