-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
2edbb22
commit 22a120f
Showing
1 changed file
with
278 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,278 @@ | ||
{ | ||
"nbformat": 4, | ||
"nbformat_minor": 0, | ||
"metadata": { | ||
"colab": { | ||
"provenance": [], | ||
"authorship_tag": "ABX9TyNAhFQUbk8KKp8mUyRVbN8K", | ||
"include_colab_link": true | ||
}, | ||
"kernelspec": { | ||
"name": "python3", | ||
"display_name": "Python 3" | ||
}, | ||
"language_info": { | ||
"name": "python" | ||
} | ||
}, | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "view-in-github", | ||
"colab_type": "text" | ||
}, | ||
"source": [ | ||
"<a href=\"https://colab.research.google.com/github/ieg-dhr/NLP-Course4Humanities_2024/blob/main/Large_Language_Models_Article_Separation.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"#Large Language Models and Article Extraction\n", | ||
"\n", | ||
"\n", | ||
"Created by Sarah Oberbichler [![ORCID](data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIyMCIgaGVpZ2h0PSIyMCIgdmlld0JveD0iMCAwIDIwIDIwIj4KICA8cmVjdCB3aWR0aD0iMjAiIGhlaWdodD0iMjAiIGZpbGw9IiNGRkZGRkYiLz4KICA8Y2lyY2xlIGN4PSIxMCIgY3k9IjEwIiByPSI5IiBmaWxsPSIjQThDRTNDIi8+CiAgPHRleHQgeD0iMTAiIHk9IjE1IiBmb250LWZhbWlseT0iQXJpYWwsIHNhbnMtc2VyaWYiIGZvbnQtc2l6ZT0iMTEiIGZvbnQtd2VpZ2h0PSJib2xkIiB0ZXh0LWFuY2hvcj0ibWlkZGxlIiBmaWxsPSIjRkZGRkZGIj5pRDwvdGV4dD4KPC9zdmc+)](https://orcid.org/0000-0002-1031-2759)\n", | ||
"\n", | ||
"###Using LLMs via APIs\n", | ||
"\n", | ||
"For this course, we utilize the NVIDIA API, which provides up to 4,000 free credits to access the open-source model llama-3.1-nemotron-70b-instruct via NVIDIA's GPU infrastructure. When using larger models outside of chatbot applications, they demand significant computational resources.\n", | ||
"While APIs offer a solution for accessing models and GPU power through third parties where no local computer power is available, they typically:\n", | ||
"\n", | ||
"* Require payment beyond free trial credits\n", | ||
"* Should not be used with sensitive data\n", | ||
"* Should not be used with copyright restricted data\n", | ||
"\n", | ||
"\n", | ||
"### Using LLMs via APIs for the Analysis of Historical Newspapers\n", | ||
"Historical newspapers published before 1940 are generally free from copyright protection and, when accessed through public newspaper platforms, are not classified as sensitive data. However, important considerations include:\n", | ||
"\n", | ||
"* Library licensing agreements may restrict usage\n", | ||
"* Cultural heritage institutions might have specific terms of use\n", | ||
"* Access and processing policies may vary by institution\n", | ||
"\n", | ||
"When using API's provided by third parties, make sure to check the licensing agreements of the data provider (e.g. library). For example, newspapers makred with **Public Domain Mark 1.0 Universell** don't have any restrictions." | ||
], | ||
"metadata": { | ||
"id": "LX69IhPEGqhv" | ||
} | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"#Setting up the Large Language Model\n", | ||
"\n", | ||
"In order to use the large language model via API, you need to get an API key: https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-instruct. Add your private key to you Colab Notebook under *Secrets* as NVIDIA_TOKEN. Run the next cell and see if everything worked as intended." | ||
], | ||
"metadata": { | ||
"id": "vzUjqdnmXvg1" | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"source": [ | ||
"import pandas as pd\n", | ||
"from openai import OpenAI\n", | ||
"\n", | ||
"# Initialize OpenAI client with NVIDIA API settings\n", | ||
"client = OpenAI(\n", | ||
" base_url=\"https://integrate.api.nvidia.com/v1\",\n", | ||
" api_key = userdata.get('NVIDIA_TOKEN')\n", | ||
")\n", | ||
"\n", | ||
"# Process the DataFrame\n", | ||
"all_articles = []\n", | ||
"for index, row in df.iterrows():\n", | ||
" try:\n", | ||
" # Make API call\n", | ||
" completion = client.chat.completions.create(\n", | ||
" model=\"nvidia/llama-3.1-nemotron-70b-instruct\",\n", | ||
" messages=[\n", | ||
" {\n", | ||
" 'role': 'system',\n", | ||
" 'content': \"\"\" System Instructions: \"\"\"\n", | ||
" },\n", | ||
" {\n", | ||
" 'role': 'user',\n", | ||
" 'content': f\"\"\"# Task Instructions:\n", | ||
"Text to analyze:\n", | ||
"{row['plainpagefulltext']}\"\"\"\n", | ||
" }\n", | ||
" ],\n", | ||
" temperature=0.0,\n", | ||
" max_tokens=20000\n", | ||
" )\n", | ||
"\n", | ||
" content = completion.choices[0].message.content\n", | ||
"\n", | ||
" # Process articles\n", | ||
" if content and \"Keine Artikel mit dem angegebenen Thema gefunden.\" not in content:\n", | ||
" new_row = row.to_dict()\n", | ||
" new_row['extracted_article'] = content.strip()\n", | ||
" all_articles.append(new_row)\n", | ||
"\n", | ||
" except Exception as e:\n", | ||
" print(f\"Error processing row {index}: {str(e)}\")\n", | ||
" continue\n", | ||
"\n", | ||
"# Create final DataFrame\n", | ||
"result_df = pd.DataFrame(all_articles)\n", | ||
"\n", | ||
"# Save to Excel\n", | ||
"result_df.to_excel('test_1.xlsx', index=False)\n", | ||
"\n", | ||
"# Display results\n", | ||
"print(result_df.head())" | ||
], | ||
"metadata": { | ||
"id": "-6S_983uGveC" | ||
}, | ||
"execution_count": null, | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"#Now try the code with the provided prompts\n", | ||
"\n", | ||
"How well did your prompt perform in comparison to this prompt?" | ||
], | ||
"metadata": { | ||
"id": "DEXdZNstH5ba" | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"source": [ | ||
"import pandas as pd\n", | ||
"from typing import List, Dict\n", | ||
"from openai import OpenAI\n", | ||
"\n", | ||
"# Initialize OpenAI client with NVIDIA API settings\n", | ||
"client = OpenAI(\n", | ||
" base_url=\"https://integrate.api.nvidia.com/v1\",\n", | ||
" api_key = userdata.get('NVIDIA_TOKEN')\n", | ||
")\n", | ||
"\n", | ||
"def analyze_dataframe(df: pd.DataFrame, text_column: str) -> pd.DataFrame:\n", | ||
" def analyze_text(text: str) -> List[Dict[str, str]]:\n", | ||
" system_prompt = f\"\"\"\n", | ||
"# System Instructions\n", | ||
"You are an expert text analyst and information retrieval specialist and hate summarization as well as enumerations. Use {examples} for structuring your answer.\n", | ||
"Your task is to carefully analyze given texts and extract complete articles that contain specific themes. You never change original texts.\n", | ||
"\n", | ||
"Classify as relevant if the text contains:\n", | ||
"- Primary earthquake terminology from the 19th and 20th century\n", | ||
"- Official earthquake reports\n", | ||
"- geology and seismology\n", | ||
"- Impact descriptions\n", | ||
"- Solution description\n", | ||
"- Technical description\n", | ||
"- Aid\n", | ||
"- Honorations\n", | ||
"- Political discussion and opinions on earthquake\n", | ||
"- Stories from victims and refugees\n", | ||
"- reportings on refugees and victims\n", | ||
"- Live of victims\n", | ||
"- historical references\n", | ||
"- comparisons\n", | ||
"\n", | ||
"Your output should consist of the extracted articles and the verification\n", | ||
"\n", | ||
"Maintain a neutral, objective stance throughout the analysis. Focus on accuracy and completeness in your extractions\n", | ||
"\"\"\"\n", | ||
" user_prompt = f\"\"\"\n", | ||
"# Task Instructions\n", | ||
"Bitte führe die folgenden Schritte aus:\n", | ||
"1. Lese jeden Text aufmerksam durch. Behandle jeden Text als eigene Einheit, ohne auf andere Texte zu referieren\n", | ||
"2. Identifiziere alle Artikel zum Thema Erdbeben und Erstoß\n", | ||
"3. Für jedes Vorkommen des Themas:\n", | ||
" a. Bestimme den Anfang des Artikels, in dem das Thema vorkommen.\n", | ||
" b. Kontrolliere Satz für Satz, ob diese zusammengehören, Ende den Artikel, wenn die Sätze nicht mehr zusammengehören.\n", | ||
" c. Markiere den vollständigen Artikel von Anfang bis Ende.\n", | ||
" d. Wenn der Artikel zu lang für eine Antwort ist, antworte mit Ja auf \"article too long, human addition needed\":\n", | ||
" e. Berücksichtige auch sehr kurze und sehr lange Artikel\n", | ||
"4. Überprüfe jeden markierten Artikel:\n", | ||
" a. Stelle sicher, dass er eine Einheit bildet, auch wenn es nicht mehr um Erdbeben geht.\n", | ||
" b. Vergewissere dich, dass er eines der genannten Themen enthält.\n", | ||
" c. Prüfe, ob der extrahierte Text tatsächlich im Dokument ist\n", | ||
"5. Extrahiere jeden überprüften Artikel als Originaltext, der nichts als den originalen Text enthält\n", | ||
"6. Korrigiere OCR-Fehler\n", | ||
"7. Wenn keine Artikel gefunden wurden, gib \"Keine Artikel mit dem angegebenen Thema gefunden.\" aus.\n", | ||
"\n", | ||
"Führe nun diese Schritte für den folgenden Text aus:\n", | ||
"{text}\n", | ||
"\"\"\"\n", | ||
" try:\n", | ||
" messages = [\n", | ||
" {\n", | ||
" 'role': 'system',\n", | ||
" 'content': system_prompt\n", | ||
" },\n", | ||
" {\n", | ||
" 'role': 'user',\n", | ||
" 'content': user_prompt\n", | ||
" }\n", | ||
" ]\n", | ||
"\n", | ||
" completion = client.chat.completions.create(\n", | ||
" model=\"nvidia/llama-3.1-nemotron-70b-instruct\",\n", | ||
" messages=messages,\n", | ||
" temperature=0.0,\n", | ||
" max_tokens=20000\n", | ||
" )\n", | ||
"\n", | ||
" content = completion.choices[0].message.content\n", | ||
"\n", | ||
" # Split the content into individual articles\n", | ||
" articles = []\n", | ||
" if \"Keine Artikel mit dem angegebenen Thema gefunden.\" in content:\n", | ||
" return []\n", | ||
"\n", | ||
" # Split by \"**END OF ARTICLE**\" if present, otherwise treat as single article\n", | ||
" if \"**END OF ARTICLE**\" in content:\n", | ||
" parts = content.split(\"**END OF ARTICLE**\")\n", | ||
" articles = [{\"article\": part.strip()} for part in parts if part.strip()]\n", | ||
" else:\n", | ||
" articles = [{\"article\": content.strip()}]\n", | ||
"\n", | ||
" return articles\n", | ||
"\n", | ||
" except Exception as e:\n", | ||
" print(f\"Error in AI processing: {str(e)}\")\n", | ||
" return []\n", | ||
"\n", | ||
" # Apply the analysis to each row in the DataFrame\n", | ||
" all_articles = []\n", | ||
" for index, row in df.iterrows():\n", | ||
" articles = analyze_text(row[text_column])\n", | ||
" for i, article in enumerate(articles, 1):\n", | ||
" new_row = row.to_dict()\n", | ||
" new_row['extracted_article'] = article['article']\n", | ||
" new_row['article_part'] = i\n", | ||
" new_row['total_parts'] = len(articles)\n", | ||
" all_articles.append(new_row)\n", | ||
"\n", | ||
" # Create a new DataFrame with individual rows for each article\n", | ||
" result_df = pd.DataFrame(all_articles)\n", | ||
"\n", | ||
" return result_df\n", | ||
"\n", | ||
"# Usage example\n", | ||
"text_column = 'plainpagefulltext'\n", | ||
"result_df = analyze_dataframe(df, text_column)\n", | ||
"\n", | ||
"# Save the results to an Excel file\n", | ||
"result_df.to_excel('test_2.xlsx', index=False)\n", | ||
"\n", | ||
"# Display the first few rows of the result\n", | ||
"print(result_df.head())" | ||
], | ||
"metadata": { | ||
"id": "G6GHbkcUb0hR" | ||
}, | ||
"execution_count": null, | ||
"outputs": [] | ||
} | ||
] | ||
} |