IntelliDocs: Intelligent Document Search and Summarization

IntelliDocs is a powerful tool that leverages Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) to perform intelligent document search and summarization. Given a user query, IntelliDocs retrieves relevant documents from a corpus, extracts pertinent information, and generates concise summaries, enabling efficient knowledge discovery.

Features

Intelligent Search: Utilizes semantic search powered by Sentence Transformers and FAISS for accurate document retrieval.
Contextual Summarization: Generates abstractive summaries using pre-trained transformer models (e.g., BART).
Handles Various Document Formats: Supports text files (.txt) and provides a framework for extending to PDF and image inputs through added preprocessing steps.
Modular Design: Clear separation of concerns with modules for data processing, indexing, retrieval, and summarization.
Easy to Use: Simple command-line interface for interacting with the system.

Project Structure

Installation

Clone the repository:

git clone https://github.com/prasannakotyal/IntelliDocs.git
cd intelligent-doc-search-summarization

Create and activate a virtual environment (recommended):

python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Make sure to install faiss-cpu if you do not have a supported GPU.

Usage

Place your documents: Add the documents you want to search within the data/sample_documents/ directory. Currently, .txt files are supported.
Run the main script:
```
python src/main.py
```
Enter your query: When prompted, enter your search query.

Sample Output

Scenario: Searching through sample documents about climate change.

Query: What are the impacts of rising sea levels?

Output: Relevant documents: ['doc2.txt', 'doc1.txt']

Generated summary: Rising sea levels are a major consequence of climate change, threatening coastal communities with flooding, erosion, and saltwater intrusion. Low-lying areas face displacement risks.

Retrieval time: 0.0123 seconds Summarization time: 1.4567 seconds

Note: The output will vary depending on the documents in your data/sample_documents directory. The provided example is for illustrative purposes.

Extending to PDF and Image Input

PDF Handling:
- You'll need to use libraries like:
- PyPDF2: For basic PDF manipulation and text extraction (if the PDF has selectable text).
- PyMuPDF (fitz): A more powerful library for PDF handling, including text extraction and even image extraction from PDFs.
- textract: A library that can handle various document formats, including PDFs, using other libraries under the hood.
Image Handling:
- Libraries:
- Tesseract OCR: A popular open-source OCR engine.
- pytesseract: A Python wrapper for Tesseract OCR.
- OpenCV (cv2): Often used in conjunction with OCR for image pre-processing (resizing, thresholding, noise reduction) to improve OCR accuracy.

Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues to suggest improvements or report bugs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.MD

README.MD

IntelliDocs: Intelligent Document Search and Summarization

Features

Project Structure

Installation

Usage

Sample Output

Extending to PDF and Image Input

Contributing

Files

README.MD

Latest commit

History

README.MD

File metadata and controls

IntelliDocs: Intelligent Document Search and Summarization

Features

Project Structure

Installation

Usage

Sample Output

Extending to PDF and Image Input

Contributing