Legal NLP with Topic Models

How to download the two relevant csv files locally?

1.a. Court decision data: Get case_scraping_Aug_01_2022.csv from the Makefile on Github, or go to the Google Drive folder:

Legal NLP Project (with MPI Coll) -> Updated Data -> case_scraping_Aug_01_2022.csv .

Some variables (columns) of current interest are 'participating_judges' and 'full_text'.

1.b. Ground-truth domains of each author (1998-2022): https://docs.google.com/spreadsheets/d/1xf3cCwArTWHxHNR_7T9D5vjafiw3P18L/edit#gid=1305781617

2.1. How to run LDA?

Make sure you have downloaded both Data_Preprocessing_for_Topic_Models.py and LDA_Model.py , then run LDA_Model.py . You can change the number of topics (default = 37) by call the flag --num_topics . For example, run this command to get results with 10 topics: python3 LDA_Model.py --num_topics 10

-> Which section to comment out to avoid training the model again, but use a trained and saved model (instructions in .py file; Don't forget to download the model file too):

model = fit_model(dictionary, cases, flags.model_save, num_topics=flags.num_topics)

2.2. Relevant distributions returned by running LDA?

Words (i.e. tokens) per topic: Legal NLP Project (with MPI Coll) -> Results -> LDA Model -> lda_model_topics.txt

(Most likely) Topic(s) per document: Legal NLP Project (with MPI Coll) -> Results -> LDA Model -> lda_model_most_likely_topic_per_doc.txt

3.1. How to run Author-Topic (AT) model (any dependency)?

Make sure you have downloaded both Data_Preprocessing_for_Topic_Models.py , Author_Topic_Model.py , and the dependency author2doc.json, then run Author_Topic_Model.py . You can change the number of topics (default = 37) by call the flag --num_topics . For example, run this command to get results with 10 topics: python3 Author_Topic_Model.py --num_topics 10

-> Which section to comment out to avoid training the model again, but use a trained and saved model: instructions in .py file; Don't forget to download the model file too!

3.2. Relevant distributions returned by running AT model?

Words (i.e. tokens) per topic: Legal NLP Project (with MPI Coll) -> Results -> AT model with varying number of topics -> at_model_topics_num_topics=[a number].txt

Topics per author: Legal NLP Project (with MPI Coll) -> Results -> AT model with varying number of topics -> at_model_author_vecs_num_topics=[a number].txt

Resources to double check the authors (judges)?

Wiki page of all judges in the court (the participating_judges variable in csv file only shows their last name): https://de.wikipedia.org/wiki/Liste_der_Richter_des_Bundesverfassungsgerichts

Link of raw data (before scraping) to compare approx case id with year (note: smaller id means older cases; cases with id 10 or above probably decided after 1990s): https://www.bundesverfassungsgericht.de/SiteGlobals/Forms/Suche/Entscheidungensuche_Formular.html?gts=5403124_list%253Ddate_dt%252Basc&language_=de

AT Model Code Pipeline: -> remove_irrelevant_cases.py -> Data_Preprocessing_for_Topic_Models.py -> Generate_author2doc.py -> Clean_author2doc.py -> Convert_author2doc_to_lol.py -> AT_Model_Gibbs_WardNJU.py

Evaluation Pipeline: -> calculate_coherence.py -> automatic_topic_to_domain_map.py -> save_full_topics_per_doc_dist.py -> get_features_domain_and_author_probs_per_doc.py -> augment_clean_judges_to_csv.py -> get_time_aware_features.py -> get_time_aware_judge_specific_features.py

Name		Name	Last commit message	Last commit date
Latest commit History 196 Commits
AT_Model_Gibbs_WardNJU.py		AT_Model_Gibbs_WardNJU.py
Author_Topic_Model.py		Author_Topic_Model.py
Calculate_coherence.py		Calculate_coherence.py
Case_Scraping_with_BeautifulSoup_Aug_01_2022.py		Case_Scraping_with_BeautifulSoup_Aug_01_2022.py
Case_Scraping_with_BeautifulSoup_Dec_04_2022.py		Case_Scraping_with_BeautifulSoup_Dec_04_2022.py
Clean_author2doc.py		Clean_author2doc.py
Construct_stop_words.py		Construct_stop_words.py
Convert_author2doc_to_lol.py		Convert_author2doc_to_lol.py
Data_Preprocessing_for_Topic_Models.py		Data_Preprocessing_for_Topic_Models.py
Download_Raw_Data.py		Download_Raw_Data.py
Download_Updated_Data.py		Download_Updated_Data.py
Generate_author2doc.py		Generate_author2doc.py
Generate_domain2doc.py		Generate_domain2doc.py
HTML_Printout.py		HTML_Printout.py
LDA_Model.py		LDA_Model.py
LDA_calculate_recall_and_precision_per_dm.py		LDA_calculate_recall_and_precision_per_dm.py
Makefile		Makefile
Match_dataset.py		Match_dataset.py
README.md		README.md
Vocab_sorted_by_freq.py		Vocab_sorted_by_freq.py
Vocab_stats.py		Vocab_stats.py
augment_clean_judges_to_csv.py		augment_clean_judges_to_csv.py
author2doc.json		author2doc.json
automatic_topic_to_domain_map.py		automatic_topic_to_domain_map.py
automatic_topic_to_domain_map_LDA.py		automatic_topic_to_domain_map_LDA.py
calculate_precision_2ref_cases.py		calculate_precision_2ref_cases.py
calculate_recall_and_precision_per_dm.py		calculate_recall_and_precision_per_dm.py
calculate_recall_precision_2ref_cases.py		calculate_recall_precision_2ref_cases.py
calculate_total_accuracy.py		calculate_total_accuracy.py
change_format_LDA_to_AT.py		change_format_LDA_to_AT.py
clean_author2doc.json		clean_author2doc.json
clean_author2doc_01_1998_to_07_2022.json		clean_author2doc_01_1998_to_07_2022.json
clean_author2doc_01_1998_to_07_2022_noNaN.json		clean_author2doc_01_1998_to_07_2022_noNaN.json
domain2doc.json		domain2doc.json
error_analysis.py		error_analysis.py
get_features_domain_and_author_probs_per_doc.py		get_features_domain_and_author_probs_per_doc.py
get_features_domain_probs_per_doc.py		get_features_domain_probs_per_doc.py
get_time_aware_features.py		get_time_aware_features.py
get_time_aware_judge_specific_features.py		get_time_aware_judge_specific_features.py
plot_at_author_vecs.py		plot_at_author_vecs.py
plot_at_words_per_topic.py		plot_at_words_per_topic.py
plot_pygraphviz_author_topic_word.py		plot_pygraphviz_author_topic_word.py
plot_pygraphviz_author_topic_word_manualATM.py		plot_pygraphviz_author_topic_word_manualATM.py
plot_pygraphviz_union_of_domains_topic_word.py		plot_pygraphviz_union_of_domains_topic_word.py
plot_referee_precision.py		plot_referee_precision.py
remove_cases_before_1998.py		remove_cases_before_1998.py
remove_irrelevant_cases.py		remove_irrelevant_cases.py
requirements.txt		requirements.txt
save_full_topics_per_doc_dist.py		save_full_topics_per_doc_dist.py
show_top_docs_per_topic.py		show_top_docs_per_topic.py
stop_words_Nov_07_2022.txt		stop_words_Nov_07_2022.txt
tests.py		tests.py
topic_to_domain_dict.json		topic_to_domain_dict.json
topic_to_domain_map.py		topic_to_domain_map.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Legal NLP with Topic Models

About

Releases

Packages

Contributors 2

Languages

Pinafore/Constitutional_NLP_Summer_2022

Folders and files

Latest commit

History

Repository files navigation

Legal NLP with Topic Models

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages