Skip to content

prasannakotyal/IntelliDocs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IntelliDocs: Intelligent Document Search and Summarization

IntelliDocs is a powerful tool that leverages Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) to perform intelligent document search and summarization. Given a user query, IntelliDocs retrieves relevant documents from a corpus, extracts pertinent information, and generates concise summaries, enabling efficient knowledge discovery.

Features

  • Intelligent Search: Utilizes semantic search powered by Sentence Transformers and FAISS for accurate document retrieval.
  • Contextual Summarization: Generates abstractive summaries using pre-trained transformer models (e.g., BART).
  • Handles Various Document Formats: Supports text files (.txt) and provides a framework for extending to PDF and image inputs through added preprocessing steps.
  • Modular Design: Clear separation of concerns with modules for data processing, indexing, retrieval, and summarization.
  • Easy to Use: Simple command-line interface for interacting with the system.

Project Structure

Project structure

Installation

  1. Clone the repository:

    git clone https://github.com/prasannakotyal/IntelliDocs.git
    cd intelligent-doc-search-summarization
  2. Create and activate a virtual environment (recommended):

    python3 -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt

    Make sure to install faiss-cpu if you do not have a supported GPU.

Usage

  1. Place your documents: Add the documents you want to search within the data/sample_documents/ directory. Currently, .txt files are supported.

  2. Run the main script:

    python src/main.py
  3. Enter your query: When prompted, enter your search query.

Sample Output

Scenario: Searching through sample documents about climate change.

Query: What are the impacts of rising sea levels?

Output: Relevant documents: ['doc2.txt', 'doc1.txt']

Generated summary: Rising sea levels are a major consequence of climate change, threatening coastal communities with flooding, erosion, and saltwater intrusion. Low-lying areas face displacement risks.

Retrieval time: 0.0123 seconds Summarization time: 1.4567 seconds

Note: The output will vary depending on the documents in your data/sample_documents directory. The provided example is for illustrative purposes.

Extending to PDF and Image Input

  • PDF Handling:

    • You'll need to use libraries like:
    • PyPDF2: For basic PDF manipulation and text extraction (if the PDF has selectable text).
    • PyMuPDF (fitz): A more powerful library for PDF handling, including text extraction and even image extraction from PDFs.
    • textract: A library that can handle various document formats, including PDFs, using other libraries under the hood.
  • Image Handling:

    • Libraries:
    • Tesseract OCR: A popular open-source OCR engine.
    • pytesseract: A Python wrapper for Tesseract OCR.
    • OpenCV (cv2): Often used in conjunction with OCR for image pre-processing (resizing, thresholding, noise reduction) to improve OCR accuracy.

Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues to suggest improvements or report bugs.

About

Intelligent Document Sumarization Using LLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages