IntelliDocs is a powerful tool that leverages Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) to perform intelligent document search and summarization. Given a user query, IntelliDocs retrieves relevant documents from a corpus, extracts pertinent information, and generates concise summaries, enabling efficient knowledge discovery.
- Intelligent Search: Utilizes semantic search powered by Sentence Transformers and FAISS for accurate document retrieval.
- Contextual Summarization: Generates abstractive summaries using pre-trained transformer models (e.g., BART).
- Handles Various Document Formats: Supports text files (
.txt
) and provides a framework for extending to PDF and image inputs through added preprocessing steps. - Modular Design: Clear separation of concerns with modules for data processing, indexing, retrieval, and summarization.
- Easy to Use: Simple command-line interface for interacting with the system.
-
Clone the repository:
git clone https://github.com/prasannakotyal/IntelliDocs.git cd intelligent-doc-search-summarization
-
Create and activate a virtual environment (recommended):
python3 -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
Make sure to install
faiss-cpu
if you do not have a supported GPU.
-
Place your documents: Add the documents you want to search within the
data/sample_documents/
directory. Currently,.txt
files are supported. -
Run the main script:
python src/main.py
-
Enter your query: When prompted, enter your search query.
Scenario: Searching through sample documents about climate change.
Query: What are the impacts of rising sea levels?
Output: Relevant documents: ['doc2.txt', 'doc1.txt']
Generated summary: Rising sea levels are a major consequence of climate change, threatening coastal communities with flooding, erosion, and saltwater intrusion. Low-lying areas face displacement risks.
Retrieval time: 0.0123 seconds Summarization time: 1.4567 seconds
Note: The output will vary depending on the documents in your data/sample_documents
directory. The provided example is for illustrative purposes.
-
PDF Handling:
- You'll need to use libraries like:
PyPDF2
: For basic PDF manipulation and text extraction (if the PDF has selectable text).PyMuPDF (fitz)
: A more powerful library for PDF handling, including text extraction and even image extraction from PDFs.textract
: A library that can handle various document formats, including PDFs, using other libraries under the hood.
-
Image Handling:
- Libraries:
Tesseract OCR
: A popular open-source OCR engine.pytesseract
: A Python wrapper for Tesseract OCR.OpenCV
(cv2): Often used in conjunction with OCR for image pre-processing (resizing, thresholding, noise reduction) to improve OCR accuracy.
Contributions are welcome! Please feel free to submit pull requests or open issues to suggest improvements or report bugs.