🔥 TENOR is a user interface for speeding up document labeling process and reducing the number of documents needed to be labeled. See the paper for details:
- (2024). Beyond Automated Evaluation Metrics: Evaluating Topic Models On Practical Social Science Content Analysis Tasks. EACL. https://arxiv.org/abs/2401.16348
If you find this tool helpful, you can cite the following paper:
@misc{li2024automated,
title={Beyond Automated Evaluation Metrics: Evaluating Topic Models On Practical Social Science Content Analysis Tasks},
author={Zongxia Li and Andrew Mao and Daniel Stephens and Pranav Goel and Emily Walpole and Alden Dima and Juan Fung and Jordan Boyd-Graber},
year={2024},
eprint={2401.16348},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
This tool's frontend code interface is adopted from this repository. To run the app interface locally with the default Bills dataset, follow these steps:
git clone https://github.com/Pinafore/2023-document-annotation.git
cd 2023-document-annotation
pip install -r requirements.txt
Preprocess the data for topic model training. The processed data will be saved to the specified --new_json_path directory
./01_data_process.sh
You can obtain trained topic models in two ways:
- If you're looking to get started quickly without training your own models, we've got you covered. Pre-trained models on the bills dataset, configured with 35 topics, are ready for use. You can download these models directly from Google Drive
After downloading, place the model files in the following directory of your local repository:
./flask_app/Topic_Models/trained_Models/
- Train Your Own Models
For those who need customized topic models, we provide a convenient script to facilitate the training process:
./02_train_model.sh
This script will lead you through the steps necessary to train new models, which will then be saved to the location specified in the bash script argument --save_trained_model_path
.
Configuration Note: The application defaults to using 35 topics for model training. If your requirements differ, please follow these steps to ensure compatibility:
- Execute the model training script with the number of topics set to your preference.
- Once training is complete, adjust the code in
app.py
to align with your model's topic count. Specifically, change the value on line 128 to correspond with the number of topics in your newly trained model.
To launch the web application on your local machine, execute the provided shell script by running the following command in your terminal:
./03_run_app.sh
Upon successful execution of the script, the web application will be available. You can access it by opening your web browser and navigating to:
http://localhost:5001
If you wish to use a different port than the default 5001
, ensure that the 03_run_app.sh
script is configured accordingly before starting the application.
This app supports two datasets:
- 20newsgroup: A collection of newsgroup documents.
- congressional_bill_project_dataset: A compilation of Congressional Bill documents.
For the Congressional Bill dataset, the app uses data from these sources:
The original Bills dataset uses numbers as labels. To correlate topics with labels, consult the Codebook.
The List Interface provides a streamlined overview of documents grouped by topics.
The Document Interface offers an immersive reading experience, providing ranked relevant topics and automatic highlights of keywords in the document.
This project was built upon the work of the following repositories:
We extend our gratitude to the authors of these original repositories for their valuable contributions and inspiration.
This project is licensed under the MIT License - see the LICENSE file for details.
For any additional questions or comments, please contact [[email protected]].
Thank you for using or contributing to the Document Annotation NLP Tool!