LLM Detection

This project is an analysis on LLM detection based on this Kaggle Challenge.

The methodology, analysis and implementation details as well as explanations and supporting graphs can be found in the main notebook. The report only contains supporting graphs for the in-class project presentation.

Project Structure

│   README.md - this file
│   report.pdf - generated report file exported for presentation
│   report.tex - source code for report file
│
├───data - input data 
│       cars_generated.csv
│       chatgpt_cars.md
│       chatgpt_electoral.md
│       cluster_augmentation_cars.md
│       cluster_augmentation_election.md
│       elections_generated.csv
│       LLM_generated_essay_PaLM.csv
│       sample_submission.csv
│       test_essays.csv
│       train_essays.csv
│       train_prompts.csv
├───notebooks
│       llm_detection.ipynb - The main notebook the project is built upon
│       loov.ipynb
│       notebook_config.py
│
├───output - output data and figures
│       attribution.png
│       augmentation.csv - Formatted generated data
│       attribution.png
│       augmentation.csv
│       augmentation_stats.png
│       clusters.png
│       dataset_size.png
│       diversity_plot.png
│       similarity.png
│       similarity_mean_max.png
├───intermediate - between-notebook shared data
│       best_model.skops
│       loov_input_data.csv
│       loov_res.csv
└───src - Personal library
│       crawling.py
│       data.py
│       ml.py

The Dataset

The dataset is comprised of a combination of original data produced by the author for the purposes of the project, data given by the competition, as well as two datasets provided by Konstantina Liagkou and Muhammad Rizqi.

Human Essays

The Human essays were downloaded from the competition and can be found at data/train_essays.csv.

Generated Essays

The raw files comprising downloaded ChatGPT conversations can be found at data/chatgpt_electoral.md, data/chatgpt_electoral.md, data/cluster_augmentation_election.md and data/cluster_augmentation_cars.md.

The rest of the input datasets can be found in the data subdirectory.

The final, formatted augmentation dataset can be found at output/augmentation.csv. The file is structured as follows:

Column	Type	Description
id	string	A unique identifier for the essay
text	string	The text of the essay
prompt_id	integer	Which prompt the essay was generated from, relates to data\train_prompts.csv
generated	integer	Whether the essay was generated by an LLM, all 1
cluster	integer	The assigned cluster of the text, as described in the notebook
LLM	string	The LLM which generated the essay
source	string	Dataset attribution
gold	integer	Whether the dataset was used in the final training phase of the LLM detector. These essays are considered the highest quality amongst the dataset, as explained in the notebook

Prompts used

The two basic prompts, as well as the source texts accompanying them can be found at data/train_prompts.csv and will be refered to as original prompts from this point forward. The source texts contained in the same file will be refered to as sources.

Many prompting strategies were utilized in order to maximize the variation and data quality of the generated essays by the LLM. Below we list the more prominent prompt structures used:

Simple prompt

<original prompt>

Do not include a greeting/title at all. Remove placeholder tags such as [Your Name] entirely.

Simple prompt with sources

<original prompt>

Do not include a greeting/title at all. Remove placeholder tags such as [Your Name] entirely. 

Sources:
<sources>

Simple prompt with less reliance on sources

<original prompt>

Do not include a greeting/title at all. Remove placeholder tags such as [Your Name] entirely. 

Use the sources only as inspiration, do not just cite them.

Sources:
<random sample of sources>

Role prompt

<original prompt>

You are a US high school student. Use language fitting an average US teenager (more informal at places). Do not include a greeting/title at all. Remove placeholder tags such as [Your Name] entirely. 

Examples:
<random samples of student essays>

Role prompt with sources

<original prompt>

You are a US high school student. Use language fitting an average US teenager, imitating the structure and tone of the examples below. Do not include a greeting/title at all. Remove placeholder tags such as [Your Name] entirely. 

Use the sources only as inspiration, do not just cite them.

Examples:
<random samples of student essays>

Sources:
<random sample of sources>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LLM Detection

Project Structure

The Dataset

Human Essays

Generated Essays

Prompts used

Simple prompt

Simple prompt with sources

Simple prompt with less reliance on sources

Role prompt

Role prompt with sources

Files

README.md

Latest commit

History

README.md

File metadata and controls

LLM Detection

Project Structure

The Dataset

Human Essays

Generated Essays

Prompts used

Simple prompt

Simple prompt with sources

Simple prompt with less reliance on sources

Role prompt

Role prompt with sources