Mit Colab erstellt

ieg-dhr · Dec 2, 2024 · 22a120f · 22a120f
1 parent 2edbb22
commit 22a120f
Showing 1 changed file with 278 additions and 0 deletions.
diff --git a/Large_Language_Models_Article_Separation.ipynb b/Large_Language_Models_Article_Separation.ipynb
@@ -0,0 +1,278 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": [],
+      "authorship_tag": "ABX9TyNAhFQUbk8KKp8mUyRVbN8K",
+      "include_colab_link": true
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/ieg-dhr/NLP-Course4Humanities_2024/blob/main/Large_Language_Models_Article_Separation.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "#Large Language Models and Article Extraction\n",
+        "\n",
+        "\n",
+        "Created by Sarah Oberbichler [![ORCID](data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIyMCIgaGVpZ2h0PSIyMCIgdmlld0JveD0iMCAwIDIwIDIwIj4KICA8cmVjdCB3aWR0aD0iMjAiIGhlaWdodD0iMjAiIGZpbGw9IiNGRkZGRkYiLz4KICA8Y2lyY2xlIGN4PSIxMCIgY3k9IjEwIiByPSI5IiBmaWxsPSIjQThDRTNDIi8+CiAgPHRleHQgeD0iMTAiIHk9IjE1IiBmb250LWZhbWlseT0iQXJpYWwsIHNhbnMtc2VyaWYiIGZvbnQtc2l6ZT0iMTEiIGZvbnQtd2VpZ2h0PSJib2xkIiB0ZXh0LWFuY2hvcj0ibWlkZGxlIiBmaWxsPSIjRkZGRkZGIj5pRDwvdGV4dD4KPC9zdmc+)](https://orcid.org/0000-0002-1031-2759)\n",
+        "\n",
+        "###Using LLMs via APIs\n",
+        "\n",
+        "For this course, we utilize the NVIDIA API, which provides up to 4,000 free credits to access the open-source model llama-3.1-nemotron-70b-instruct via NVIDIA's GPU infrastructure. When using larger models outside of chatbot applications, they demand significant computational resources.\n",
+        "While APIs offer a solution for accessing models and GPU power through third parties where no local computer power is available, they typically:\n",
+        "\n",
+        "*   Require payment beyond free trial credits\n",
+        "*   Should not be used with sensitive data\n",
+        "*   Should not be used with copyright restricted data\n",
+        "\n",
+        "\n",
+        "### Using LLMs via APIs for the Analysis of Historical Newspapers\n",
+        "Historical newspapers published before 1940 are generally free from copyright protection and, when accessed through public newspaper platforms, are not classified as sensitive data. However, important considerations include:\n",
+        "\n",
+        "*   Library licensing agreements may restrict usage\n",
+        "*   Cultural heritage institutions might have specific terms of use\n",
+        "*   Access and processing policies may vary by institution\n",
+        "\n",
+        "When using API's provided by third parties, make sure to check the licensing agreements of the data provider (e.g. library). For example, newspapers makred with **Public Domain Mark 1.0 Universell** don't have any restrictions."
+      ],
+      "metadata": {
+        "id": "LX69IhPEGqhv"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "#Setting up the Large Language Model\n",
+        "\n",
+        "In order to use the large language model via API, you need to get an API key: https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-instruct. Add your private key to you Colab Notebook under *Secrets* as NVIDIA_TOKEN. Run the next cell and see if everything worked as intended."
+      ],
+      "metadata": {
+        "id": "vzUjqdnmXvg1"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import pandas as pd\n",
+        "from openai import OpenAI\n",
+        "\n",
+        "# Initialize OpenAI client with NVIDIA API settings\n",
+        "client = OpenAI(\n",
+        "    base_url=\"https://integrate.api.nvidia.com/v1\",\n",
+        "    api_key = userdata.get('NVIDIA_TOKEN')\n",
+        ")\n",
+        "\n",
+        "# Process the DataFrame\n",
+        "all_articles = []\n",
+        "for index, row in df.iterrows():\n",
+        "    try:\n",
+        "        # Make API call\n",
+        "        completion = client.chat.completions.create(\n",
+        "            model=\"nvidia/llama-3.1-nemotron-70b-instruct\",\n",
+        "            messages=[\n",
+        "                {\n",
+        "                    'role': 'system',\n",
+        "                    'content': \"\"\" System Instructions: \"\"\"\n",
+        "                },\n",
+        "                {\n",
+        "                    'role': 'user',\n",
+        "                    'content': f\"\"\"# Task Instructions:\n",
+        "Text to analyze:\n",
+        "{row['plainpagefulltext']}\"\"\"\n",
+        "                }\n",
+        "            ],\n",
+        "            temperature=0.0,\n",
+        "            max_tokens=20000\n",
+        "        )\n",
+        "\n",
+        "        content = completion.choices[0].message.content\n",
+        "\n",
+        "        # Process articles\n",
+        "        if content and \"Keine Artikel mit dem angegebenen Thema gefunden.\" not in content:\n",
+        "            new_row = row.to_dict()\n",
+        "            new_row['extracted_article'] = content.strip()\n",
+        "            all_articles.append(new_row)\n",
+        "\n",
+        "    except Exception as e:\n",
+        "        print(f\"Error processing row {index}: {str(e)}\")\n",
+        "        continue\n",
+        "\n",
+        "# Create final DataFrame\n",
+        "result_df = pd.DataFrame(all_articles)\n",
+        "\n",
+        "# Save to Excel\n",
+        "result_df.to_excel('test_1.xlsx', index=False)\n",
+        "\n",
+        "# Display results\n",
+        "print(result_df.head())"
+      ],
+      "metadata": {
+        "id": "-6S_983uGveC"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "#Now try the code with the provided prompts\n",
+        "\n",
+        "How well did your prompt perform in comparison to this prompt?"
+      ],
+      "metadata": {
+        "id": "DEXdZNstH5ba"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import pandas as pd\n",
+        "from typing import List, Dict\n",
+        "from openai import OpenAI\n",
+        "\n",
+        "# Initialize OpenAI client with NVIDIA API settings\n",
+        "client = OpenAI(\n",
+        "    base_url=\"https://integrate.api.nvidia.com/v1\",\n",
+        "    api_key = userdata.get('NVIDIA_TOKEN')\n",
+        ")\n",
+        "\n",
+        "def analyze_dataframe(df: pd.DataFrame, text_column: str) -> pd.DataFrame:\n",
+        "    def analyze_text(text: str) -> List[Dict[str, str]]:\n",
+        "        system_prompt = f\"\"\"\n",
+        "# System Instructions\n",
+        "You are an expert text analyst and information retrieval specialist and hate summarization as well as enumerations. Use {examples} for structuring your answer.\n",
+        "Your task is to carefully analyze given texts and extract complete articles that contain specific themes. You never change original texts.\n",
+        "\n",
+        "Classify as relevant if the text contains:\n",
+        "- Primary earthquake terminology from the 19th and 20th century\n",
+        "- Official earthquake reports\n",
+        "- geology and seismology\n",
+        "- Impact descriptions\n",
+        "- Solution description\n",
+        "- Technical description\n",
+        "- Aid\n",
+        "- Honorations\n",
+        "- Political discussion and opinions on earthquake\n",
+        "- Stories from victims and refugees\n",
+        "- reportings on refugees and victims\n",
+        "- Live of victims\n",
+        "- historical references\n",
+        "- comparisons\n",
+        "\n",
+        "Your output should consist of the extracted articles and the verification\n",
+        "\n",
+        "Maintain a neutral, objective stance throughout the analysis. Focus on accuracy and completeness in your extractions\n",
+        "\"\"\"\n",
+        "        user_prompt = f\"\"\"\n",
+        "# Task Instructions\n",
+        "Bitte führe die folgenden Schritte aus:\n",
+        "1. Lese jeden Text aufmerksam durch. Behandle jeden Text als eigene Einheit, ohne auf andere Texte zu referieren\n",
+        "2. Identifiziere alle Artikel zum Thema Erdbeben und Erstoß\n",
+        "3. Für jedes Vorkommen des Themas:\n",
+        "   a. Bestimme den Anfang des Artikels, in dem das Thema vorkommen.\n",
+        "   b. Kontrolliere Satz für Satz, ob diese zusammengehören, Ende den Artikel, wenn die Sätze nicht mehr zusammengehören.\n",
+        "   c. Markiere den vollständigen Artikel von Anfang bis Ende.\n",
+        "   d. Wenn der Artikel zu lang für eine Antwort ist, antworte mit Ja auf \"article too long, human addition needed\":\n",
+        "   e. Berücksichtige auch sehr kurze und sehr lange Artikel\n",
+        "4. Überprüfe jeden markierten Artikel:\n",
+        "   a. Stelle sicher, dass er eine Einheit bildet, auch wenn es nicht mehr um Erdbeben geht.\n",
+        "   b. Vergewissere dich, dass er eines der genannten Themen enthält.\n",
+        "   c. Prüfe, ob der extrahierte Text tatsächlich im Dokument ist\n",
+        "5. Extrahiere jeden überprüften Artikel als Originaltext, der nichts als den originalen Text enthält\n",
+        "6. Korrigiere OCR-Fehler\n",
+        "7. Wenn keine Artikel gefunden wurden, gib \"Keine Artikel mit dem angegebenen Thema gefunden.\" aus.\n",
+        "\n",
+        "Führe nun diese Schritte für den folgenden Text aus:\n",
+        "{text}\n",
+        "\"\"\"\n",
+        "        try:\n",
+        "            messages = [\n",
+        "                {\n",
+        "                    'role': 'system',\n",
+        "                    'content': system_prompt\n",
+        "                },\n",
+        "                {\n",
+        "                    'role': 'user',\n",
+        "                    'content': user_prompt\n",
+        "                }\n",
+        "            ]\n",
+        "\n",
+        "            completion = client.chat.completions.create(\n",
+        "                model=\"nvidia/llama-3.1-nemotron-70b-instruct\",\n",
+        "                messages=messages,\n",
+        "                temperature=0.0,\n",
+        "                max_tokens=20000\n",
+        "            )\n",
+        "\n",
+        "            content = completion.choices[0].message.content\n",
+        "\n",
+        "            # Split the content into individual articles\n",
+        "            articles = []\n",
+        "            if \"Keine Artikel mit dem angegebenen Thema gefunden.\" in content:\n",
+        "                return []\n",
+        "\n",
+        "            # Split by \"**END OF ARTICLE**\" if present, otherwise treat as single article\n",
+        "            if \"**END OF ARTICLE**\" in content:\n",
+        "                parts = content.split(\"**END OF ARTICLE**\")\n",
+        "                articles = [{\"article\": part.strip()} for part in parts if part.strip()]\n",
+        "            else:\n",
+        "                articles = [{\"article\": content.strip()}]\n",
+        "\n",
+        "            return articles\n",
+        "\n",
+        "        except Exception as e:\n",
+        "            print(f\"Error in AI processing: {str(e)}\")\n",
+        "            return []\n",
+        "\n",
+        "    # Apply the analysis to each row in the DataFrame\n",
+        "    all_articles = []\n",
+        "    for index, row in df.iterrows():\n",
+        "        articles = analyze_text(row[text_column])\n",
+        "        for i, article in enumerate(articles, 1):\n",
+        "            new_row = row.to_dict()\n",
+        "            new_row['extracted_article'] = article['article']\n",
+        "            new_row['article_part'] = i\n",
+        "            new_row['total_parts'] = len(articles)\n",
+        "            all_articles.append(new_row)\n",
+        "\n",
+        "    # Create a new DataFrame with individual rows for each article\n",
+        "    result_df = pd.DataFrame(all_articles)\n",
+        "\n",
+        "    return result_df\n",
+        "\n",
+        "# Usage example\n",
+        "text_column = 'plainpagefulltext'\n",
+        "result_df = analyze_dataframe(df, text_column)\n",
+        "\n",
+        "# Save the results to an Excel file\n",
+        "result_df.to_excel('test_2.xlsx', index=False)\n",
+        "\n",
+        "# Display the first few rows of the result\n",
+        "print(result_df.head())"
+      ],
+      "metadata": {
+        "id": "G6GHbkcUb0hR"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}