Skip to content

Latest commit

 

History

History
145 lines (105 loc) · 5.19 KB

File metadata and controls

145 lines (105 loc) · 5.19 KB

LLM Detection

This project is an analysis on LLM detection based on this Kaggle Challenge.

The methodology, analysis and implementation details as well as explanations and supporting graphs can be found in the main notebook. The report only contains supporting graphs for the in-class project presentation.

Project Structure

│   README.md - this file
│   report.pdf - generated report file exported for presentation
│   report.tex - source code for report file
│
├───data - input data 
│       cars_generated.csv
│       chatgpt_cars.md
│       chatgpt_electoral.md
│       cluster_augmentation_cars.md
│       cluster_augmentation_election.md
│       elections_generated.csv
│       LLM_generated_essay_PaLM.csv
│       sample_submission.csv
│       test_essays.csv
│       train_essays.csv
│       train_prompts.csv
├───notebooks
│       llm_detection.ipynb - The main notebook the project is built upon
│       loov.ipynb
│       notebook_config.py
│
├───output - output data and figures
│       attribution.png
│       augmentation.csv - Formatted generated data
│       attribution.png
│       augmentation.csv
│       augmentation_stats.png
│       clusters.png
│       dataset_size.png
│       diversity_plot.png
│       similarity.png
│       similarity_mean_max.png
├───intermediate - between-notebook shared data
│       best_model.skops
│       loov_input_data.csv
│       loov_res.csv
└───src - Personal library
│       crawling.py
│       data.py
│       ml.py

The Dataset

The dataset is comprised of a combination of original data produced by the author for the purposes of the project, data given by the competition, as well as two datasets provided by Konstantina Liagkou and Muhammad Rizqi.

Human Essays

The Human essays were downloaded from the competition and can be found at data/train_essays.csv.

Generated Essays

The raw files comprising downloaded ChatGPT conversations can be found at data/chatgpt_electoral.md, data/chatgpt_electoral.md, data/cluster_augmentation_election.md and data/cluster_augmentation_cars.md.

The rest of the input datasets can be found in the data subdirectory.

The final, formatted augmentation dataset can be found at output/augmentation.csv. The file is structured as follows:

Column Type Description
id string A unique identifier for the essay
text string The text of the essay
prompt_id integer Which prompt the essay was generated from, relates to data\train_prompts.csv
generated integer Whether the essay was generated by an LLM, all 1
cluster integer The assigned cluster of the text, as described in the notebook
LLM string The LLM which generated the essay
source string Dataset attribution
gold integer Whether the dataset was used in the final training phase of the LLM detector. These essays are considered the highest quality amongst the dataset, as explained in the notebook

Prompts used

The two basic prompts, as well as the source texts accompanying them can be found at data/train_prompts.csv and will be refered to as original prompts from this point forward. The source texts contained in the same file will be refered to as sources.

Many prompting strategies were utilized in order to maximize the variation and data quality of the generated essays by the LLM. Below we list the more prominent prompt structures used:

Simple prompt

<original prompt>

Do not include a greeting/title at all. Remove placeholder tags such as [Your Name] entirely. 

Simple prompt with sources

<original prompt>

Do not include a greeting/title at all. Remove placeholder tags such as [Your Name] entirely. 

Sources:
<sources>

Simple prompt with less reliance on sources

<original prompt>

Do not include a greeting/title at all. Remove placeholder tags such as [Your Name] entirely. 

Use the sources only as inspiration, do not just cite them.

Sources:
<random sample of sources>

Role prompt

<original prompt>

You are a US high school student. Use language fitting an average US teenager (more informal at places). Do not include a greeting/title at all. Remove placeholder tags such as [Your Name] entirely. 

Examples:
<random samples of student essays>

Role prompt with sources

<original prompt>

You are a US high school student. Use language fitting an average US teenager, imitating the structure and tone of the examples below. Do not include a greeting/title at all. Remove placeholder tags such as [Your Name] entirely. 

Use the sources only as inspiration, do not just cite them.

Examples:
<random samples of student essays>

Sources:
<random sample of sources>