Merge pull request #1 from BU-Spark/dev_akshat

Project Outline Update
BU-Spark · Nov 7, 2024 · a25dfc9 · a25dfc9
2 parents d0533d8 + 1c0357e
commit a25dfc9
Show file tree

Hide file tree

Showing 4 changed files with 234 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -1,7 +1,66 @@
-# TEMPLATE-base-repo
+# Boston Public School Policy Document Retrieval Chatbot
 
-Create a new branch from dev, add changes on the new branch you just created.
+Project Overview
 
-Open a Pull Request to dev. Add your PM and TPM as reviewers. 
+This project aims to streamline access to policy documents within educational and public institutions. The chatbot system will help users quickly locate specific information from disorganized or complex file structures by reorganizing documents intelligently. This system will enhance accessibility, save time, and improve user experience.
 
-At the end of the semester during project wrap up open a final Pull Request to main from dev branch.
+ Key Features:
+
+1. Simplified document navigation: The chatbot will make it easy for users to search for and retrieve relevant policy documents.  
+2. AI/ML-powered: Machine learning models will be used to interpret user queries and rank policy documents by relevance.  
+3. Document reorganization: The system will suggest improvements to document categorization based on user behavior and queries.
+
+ B. Problem Statement
+
+1\. Natural Language Understanding for Query Interpretation:
+
+* Task: Develop an NLP model to interpret user queries and map them to policy categories.  
+* Objective: Classify user input into predefined categories and extract key details.  
+* Approach: Use text classification and named entity recognition (NER), leveraging fine-tuned large language models (LLMs).
+
+2\. Document Retrieval and Ranking:
+
+* Task: Build a document retrieval system that searches the repository and returns the most relevant documents.  
+* Objective: Use techniques like vector embeddings to rank documents by relevance.  
+* Approach: Implement semantic search with LLM-based embeddings or retrieval models like BM25 or dense retrieval.
+
+C. Project Checklist
+
+1\. Data Preparation and Preprocessing:
+
+* Data scraping: Collect policy documents from the repository.  
+* Data cleaning: Remove duplicates, incomplete, and irrelevant data.  
+* Tokenization: Break text into tokens for processing.  
+* Metadata Tagging: Assign metadata like policy numbers and names.
+
+2\. Query Interpretation:
+
+* Text Classification Model: Implement a model for interpreting and categorizing user queries.  
+* Intent Recognition: Define logic to map user queries to relevant document categories.
+
+3\. Document Retrieval System:
+
+* Vector Embeddings: Develop a system to retrieve documents based on vector embeddings.  
+* Semantic Search Engine: Implement a search engine to rank documents by relevance.
+
+4\. Fine-tuning and Evaluation:
+
+* Refine Models: Continuously improve model performance based on feedback and evaluation metrics (e.g., accuracy, precision, recall).
+
+5\. User Interface and Chatbot Integration:
+
+* Chatbot Deployment: Integrate the chatbot with a web interface for user interaction and document retrieval.
+
+D. Operationalization Path
+
+The chatbot will be deployed on a web interface where users can ask questions in natural language (e.g., "How do I report an incident?"). The system will retrieve the most relevant policy documents and provide direct answers, helping users navigate the policy repository efficiently.
+
+Resources:
+
+https://www.bostonpublicschools.org/domain/1884
+
+Contributors:  
+Akshat Gurbuxani  
+Abhaya Shukla  
+Akuraju Mounika Chowdary  
+Duoduo Xu
diff --git a/project-outline/project-outline.md b/project-outline/project-outline.md
@@ -0,0 +1,82 @@
+# Technical Project Document
+
+## *Akshat Gurbuxani, Abhaya Shukla, Akuraju Mounika Chowdary, Duoduo Xu,  2024-Sep-29 v1.0.0-dev*
+
+## Project Overview
+
+In educational and public institutions, locating the right policy documents can be difficult due to disorganized and complex file structures. This project aims to streamline access to these documents by developing a chatbot system that smartly reorganizes them and helps users quickly find specific information. By simplifying navigation through the policy database, the system will enhance document accessibility, save time, and improve the overall user experience.
+
+### **A. Human to AI Process**
+
+**Time-Consuming Searches**: Users often waste time sifting through irrelevant documents.  
+**Inefficient Document Management**: Organizing files becomes increasingly difficult as the volume of documents grows.  
+**Dependence on Human Assistance**: Users frequently need help from administrators or colleagues to navigate the repository.  
+**Manual Data Analysis for Reorganization**: Tracking user behavior and identifying reorganization needs is extremely slow and inefficient, as it depends on manual reporting.  
+**Inconsistent Naming & Categorization**: Non-intuitive naming and categorization make it difficult for users to locate necessary documents.
+
+The goal of implementing AI/ML is to automate and streamline the above processes, improving the efficiency of document management, query resolution, and user experience through a chatbot interface. Machine learning models can be trained to understand user queries, locate relevant documents quickly, and even suggest organizational improvements to the document repository, minimizing human effort and speeding up the process.
+
+**B. Problem Statement**
+
+1. *Natural Language Understanding for Query Interpretation*  
+   **Task**: Develop an NLP model that accurately interprets user queries, mapping them to relevant policy categories such as Superintendent's Circulars or Policies & Procedures.  
+   **Objective**: Classify user input into predefined categories and extract key details necessary to retrieve the correct document.  
+   **Approach**: Implement text classification and named entity recognition (NER) models, potentially leveraging fine-tuned large language models (LLMs) to improve query interpretation and understanding.  
+2. *Document Retrieval and Ranking*  
+   **Task**: Create a robust document retrieval system that searches the policy repository and returns the most relevant documents based on the interpreted user query.  
+   **Objective**: Employ techniques like vector embeddings (e.g., TF-IDF, BERT) to rank documents by relevance, ensuring the user receives the most appropriate policy document.  
+   **Approach**: Utilize semantic search with LLM-based embeddings or specialized retrieval models such as BM25 or dense retrieval models for optimal ranking.
+
+### **C. Project Checklist**
+
+1. *Data Preparation and Preprocessing*   
+   1. Data scraping: Collect policy documents and relevant data from the repository.  
+   2. Data cleaning: Remove duplicates, incomplete data, and irrelevant content.  
+   3. Tokenization: Break down text into manageable tokens for processing.  
+   4. Metadata Tagging: Assigning descriptive metadata to documents, such as policy numbers and names.  
+2. *Query Interpretation*  
+   1. Text Classification Model: Implement a model to interpret user queries and classify them into predefined categories.  
+   2. Intent Recognition: Define logic for recognizing user intents and mapping queries to relevant document categories.  
+3. *Document Retrieval System*   
+   1. Vector Embeddings: Develop a document retrieval system using vector embeddings  
+   2. Semantic Search Engine: Implement a semantic search engine to retrieve the most relevant policy documents  
+4. *Fine-tuning and Evaluating the Model*
+
+    Refine models based on user feedback and performance metrics (e.g., accuracy, precision, recall).
+
+5. *User Interface and Chatbot Integration*
+
+   Chatbot Deployment: Integrate the chatbot into the website to allow users to interact and retrieve documents to understand applicable policies.
+
+### **D. Operationalization Path**
+
+**Web-Based Chatbot for Policy Assistance**: Teachers and staff will interact with the chatbot through a user-friendly web interface, where they can ask policy-related questions in natural language (e.g., "How do I report an incident?"). The chatbot will respond by retrieving the most relevant policy documents and offering direct answers based on a comprehensive policy repository. This system will simplify access to crucial information, making it faster and easier for users to find what they need.
+
+## Resources
+
+### Data Sets
+
+* The dataset is accessible on the website linked below
+
+	Link: https://www.bostonpublicschools.org/domain/1884
+
+### References
+
+1. BPS dataset. [Policies and Procedures: Superintendent's Circulars](https://www.bostonpublicschools.org/domain/1884)
+
+## Weekly Meeting Updates
+
+Link to Minutes of Meeting(MoM): [MoM](https://docs.google.com/document/d/1wGxGDV2dEWZpbn51u630e4QiFFTHfj-zjwMHe8bGOHs/edit?usp=sharing)
+
+| Week | What is done? | What is next? |
+| :---- | :---- | :---- |
+| 1 | Team Agreement | Project Outline |
+| 2 | Project Outline | Data Scraping  |
+| 3 |  |  |
+| 4 |  |  |
+| 5 |  |  |
+| 6 |  |  |
+| 7 |  |  |
+| 8 |  |  |
+| 9 |  |  |
+
diff --git a/project-research/Research-phase.md b/project-research/Research-phase.md
@@ -0,0 +1,55 @@
+
+**Research phase**
+
+*Paper 1: LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ*  
+[*https://arxiv.org/pdf/2409.16779*](https://arxiv.org/pdf/2409.16779)
+
+* **Fine-Tuning for Accuracy**: Just as the LLaMa-SciQ project fine-tuned its language model for specific STEM topics, we can fine-tune our chatbot's natural language processing (NLP) model to better understand education-related queries. By training the model with actual policy documents, it will become more accurate at interpreting user questions and matching them to the right policies. This means teachers and staff will get more relevant results when they ask the chatbot questions like, "What’s the policy on student safety?"  
+* **Optimizing Responses Based on User Preferences**: The **Direct Preference Optimization (DPO)** technique highlighted in the paper aligns the model’s responses with what users expect. We can use a similar approach by training our chatbot to understand which types of responses users prefer based on past interactions. This will help it provide more helpful and context-aware answers, especially when multiple policies could apply to a single query.  
+* **Enhanced Document Retrieval**: Although the paper shows that Retrieval-Augmented Generation (RAG) didn’t always work perfectly, we can still apply this idea. Our chatbot can pull relevant information from multiple policy documents when users ask complex questions. For example, if someone asks, "What’s the procedure for reporting an incident and following up?" the chatbot can retrieve information from different sections of the policies and summarize it in one response.  
+* **Efficient Processing**: The research also shows how quantization helps make the model faster without losing much accuracy. We can apply this in our project to make sure the chatbot responds quickly, even on lower-end devices or with limited server resources. This will be especially important if we want the system to scale across multiple institutions or school districts.  
+* **Step-by-Step Explanations**: Finally, the **Chain-of-Thought (CoT)** reasoning used in LLaMa-SciQ can improve how our chatbot handles more detailed questions. For example, if a user asks, "How do I report a student incident, and what steps follow?" the chatbot can provide a clear, step-by-step explanation based on the relevant policy documents, making sure nothing important is left out.
+
+*Paper 2: Developing a Llama-Based Chatbot for CI/CD Question Answering: A Case Study at Ericsson*  
+[*https://arxiv.org/pdf/2408.09277*](https://arxiv.org/pdf/2408.09277)
+
+* **Domain-Specific corpus creation:** Just as the CI/CD chatbot uses a domain-specific corpus built from internal documents and conversations, we can build a similar corpus from school policy documents. By extracting and preprocessing documents —the chatbot will have a solid base for retrieving accurate information. This will ensure that when higher management or attorneys ask questions like "What are all the laws that are required reporting or filling  bullying incidents?" The chatbot will retrieve relevant sections from multiple documents and provide a precise answer.  
+* **Data Preprocessing and Contextual Compression for Improved Accuracy**: We can follow a similar approach by preprocessing policy documents—removing unnecessary formatting, linking related sections, and organizing documents into smaller, manageable chunks. This will allow the chatbot to better focus on the specific sections of the policies relevant to the user’s query, here the method restates the user query in more precise terms to improve retrieval accuracy. By applying techniques like contextual compression, we can minimize irrelevant information, making sure users receive precise, policy-driven answers.  
+* **Retrieval-Augmented Generation (RAG) for Enhanced Document Retrieval**: Similar to the RAG model used in the CI/CD chatbot, we can implement a retrieval-augmented approach to pull relevant policies for complex user queries. In situations where multiple policies or regulations may apply, such as when a user asks, "What are the steps for handling a student safety violation?" The chatbot will retrieve and consolidate information from different policy sections, ensuring a comprehensive response. Even though RAG has some limitations, it remains a powerful tool for generating accurate and contextually relevant answers in complex domains like school policies.  
+* **Query Rewriting for Better User Experience**: The chatbot described in the paper uses query rewriting to clarify user queries, which improves retrieval accuracy. For our chatbot, this means rephrasing vague or unclear questions posed by school staff to ensure they retrieve the most relevant policies. For instance, if a user asks, "How do I handle a student misconduct report?" the chatbot could rewrite the query into a more specific form, like "What are the reporting steps for student misconduct as per district guidelines?" This will optimize the accuracy of the retrieved policies.  
+* **Error Handling and Hallucination Prevention**: The error analysis in the paper highlights the challenges of language models hallucinating or providing inaccurate responses. For our policy chatbot, implementing safeguards to prevent hallucination is essential, particularly in a legal or educational context where inaccurate answers could have significant consequences. Ensuring that the chatbot strictly relies on retrieved policy documents rather than generating unsupported content will improve trust and accuracy.
+
+*Paper 3: FACTS About Building Retrieval Augmented Generation-based Chatbots*  
+[*https://arxiv.org/pdf/2407.07858*](https://arxiv.org/pdf/2407.07858)
+
+* **Hybrid Search Implementation**: The paper highlights the advantages of combining lexical and vector-based search methods. By integrating this hybrid search into our chatbot, we think it can significantly improve the accuracy and relevance of retrieved school policy documents, ensuring users quickly find the information they need.  
+* **Agentic Architectures for Complex Queries**: The paper emphasizes the need for agentic architectures capable of decomposing complex queries. This feature will be valuable for our chatbot when users pose multifaceted questions regarding policy documents, enabling it to provide precise and context-aware answers.  
+* **Fine-Tuning Techniques**: The paper presents several fine-tuning strategies for large language models (LLMs), such as prompt engineering and parameter-efficient training. Implementing these techniques will allow us to customize our chatbot's responses to better align with the specific needs and language used in school policy documents.  
+
+
+
+*Paper 4: RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation*  
+[*https://arxiv.org/pdf/2404.00610*](https://arxiv.org/pdf/2404.00610)  
+**Query Refinement Mechanism:** RQ-RAG introduces a method for refining user queries before they are sent to the retrieval system. By incorporating a query refinement layer in our chatbot, we can ensure that ambiguous or complex user queries related to public policy documents are clarified and simplified. This refinement can lead to more effective retrieval of relevant documents, as the system will better understand the user's intent.
+
+**Multi-Step Query Processing:** The multi-step processing framework described allows the model to iteratively improve the quality of queries before document retrieval. We can apply this technique to create a multi-turn interaction model, where the chatbot can ask clarifying questions to users, ensuring it retrieves the most relevant policy documents. For example, if a user inquires about procedures for a meeting, the chatbot can ask for specifics like the type of meeting (e.g., parent-teacher) before retrieving documents.
+
+**Dual-Retriever Framework:** RQ-RAG uses two retrievers—one for coarse retrieval and one for refined, context-aware retrieval. The coarse retriever initially retrieves a set of broad documents based on the user's input. The system then passes these documents and the original query to a fine retriever, which uses additional context to refine the search and retrieve more relevant documents. This process involves training both retrievers: one to handle general searches and the other to handle refined context-sensitive queries. This approach can be integrated into our system to enhance the chatbot's ability to fetch more accurate documents by first retrieving a broader set of documents and then refining the query based on the initial results.
+
+*Open-source project :*
+
+1. RAG Chatbot:  
+   [https://github.com/umbertogriffo/rag-chatbot](https://github.com/umbertogriffo/rag-chatbot)  
+   The rag-chatbot repository offers several components that can significantly enhance our chatbot project for BPS:  
+* **Retrieval-Augmented Generation (RAG):** This approach retrieves relevant document sections based on user queries and improves response accuracy. Using this in our project will enhance how the chatbot pulls policy documents.  
+* **Memory Indexing:** We can enable faster, more accurate document retrieval by chunking documents and storing their embeddings in a vector database like **Chroma**.  
+* **Context Handling:** Methods like **Hierarchical Summarization** and **Create and Refine** can help our chatbot handle large or complex queries across multiple documents.
+
+The following research papers and one open-source project reference have been thoroughly read, and there is a clear explanation of how some of these methods and techniques are used in our chatbot implementation.
+
+**Contributors:**
+
+1. Akshat Gurbuxani  
+2. Abhaya Shukla   
+3. Akuraju Mounika Chowdary   
+4. Duoduo Xu