This project is an analysis on LLM detection based on this Kaggle Challenge.
The methodology, analysis and implementation details as well as explanations and supporting graphs can be found in the main notebook. The report only contains supporting graphs for the in-class project presentation.
│ README.md - this file
│ report.pdf - generated report file exported for presentation
│ report.tex - source code for report file
│
├───data - input data
│ cars_generated.csv
│ chatgpt_cars.md
│ chatgpt_electoral.md
│ cluster_augmentation_cars.md
│ cluster_augmentation_election.md
│ elections_generated.csv
│ LLM_generated_essay_PaLM.csv
│ sample_submission.csv
│ test_essays.csv
│ train_essays.csv
│ train_prompts.csv
├───notebooks
│ llm_detection.ipynb - The main notebook the project is built upon
│ loov.ipynb
│ notebook_config.py
│
├───output - output data and figures
│ attribution.png
│ augmentation.csv - Formatted generated data
│ attribution.png
│ augmentation.csv
│ augmentation_stats.png
│ clusters.png
│ dataset_size.png
│ diversity_plot.png
│ similarity.png
│ similarity_mean_max.png
├───intermediate - between-notebook shared data
│ best_model.skops
│ loov_input_data.csv
│ loov_res.csv
└───src - Personal library
│ crawling.py
│ data.py
│ ml.py
The dataset is comprised of a combination of original data produced by the author for the purposes of the project, data given by the competition, as well as two datasets provided by Konstantina Liagkou and Muhammad Rizqi.
The Human essays were downloaded from the competition and can be found at data/train_essays.csv.
The raw files comprising downloaded ChatGPT conversations can be found at data/chatgpt_electoral.md, data/chatgpt_electoral.md, data/cluster_augmentation_election.md and data/cluster_augmentation_cars.md.
The rest of the input datasets can be found in the data
subdirectory.
The final, formatted augmentation dataset can be found at output/augmentation.csv. The file is structured as follows:
Column | Type | Description |
---|---|---|
id | string | A unique identifier for the essay |
text | string | The text of the essay |
prompt_id | integer | Which prompt the essay was generated from, relates to data\train_prompts.csv |
generated | integer | Whether the essay was generated by an LLM, all 1 |
cluster | integer | The assigned cluster of the text, as described in the notebook |
LLM | string | The LLM which generated the essay |
source | string | Dataset attribution |
gold | integer | Whether the dataset was used in the final training phase of the LLM detector. These essays are considered the highest quality amongst the dataset, as explained in the notebook |
The two basic prompts, as well as the source texts accompanying them can be found at data/train_prompts.csv and will be refered to as original prompts from this point forward. The source texts contained in the same file will be refered to as sources.
Many prompting strategies were utilized in order to maximize the variation and data quality of the generated essays by the LLM. Below we list the more prominent prompt structures used:
<original prompt>
Do not include a greeting/title at all. Remove placeholder tags such as [Your Name] entirely.
<original prompt>
Do not include a greeting/title at all. Remove placeholder tags such as [Your Name] entirely.
Sources:
<sources>
<original prompt>
Do not include a greeting/title at all. Remove placeholder tags such as [Your Name] entirely.
Use the sources only as inspiration, do not just cite them.
Sources:
<random sample of sources>
<original prompt>
You are a US high school student. Use language fitting an average US teenager (more informal at places). Do not include a greeting/title at all. Remove placeholder tags such as [Your Name] entirely.
Examples:
<random samples of student essays>
<original prompt>
You are a US high school student. Use language fitting an average US teenager, imitating the structure and tone of the examples below. Do not include a greeting/title at all. Remove placeholder tags such as [Your Name] entirely.
Use the sources only as inspiration, do not just cite them.
Examples:
<random samples of student essays>
Sources:
<random sample of sources>