GitHub - aryanndwi123/Data-Table-Extraction

PDF Data Extraction and Rapid Prototyping

This project aims to extract key-value pairs from PDF files and images (such as JPG, JPEG, PNG, GIF, BMP, and WebP) using Optical Character Recognition (OCR). The extracted data is then saved to a CSV file for further analysis and processing. The project provides a web interface for users to upload files and view the extracted data.

Prerequisites

Before running the project, make sure you have the following dependencies installed:

Python (version 3.x)
Flask
pytesseract
pdf2image
pdfplumber
Pillow You also need to have Tesseract OCR installed on your system. Tesseract is an open-source OCR engine used for text recognition. You can download and install Tesseract from the following link:
Tesseract OCR

Make sure to install the appropriate version of Tesseract for your operating system.

Once you have installed Tesseract, you may need to set the Tesseract path in the script.py file if it's not in the default system path. Locate the following line in script.py:

pytesseract.pytesseract.tesseract_cmd = 'tesseract'

Replace tesseract with the actual path to the Tesseract executable.

You can install the required Python packages by running the following command:

pip install flask pytesseract pdf2image pdfplumber Pillow

Project Structure

The project consists of the following files:

app.py: This file contains the Flask application code, including the routes for handling file uploads and displaying the extracted data.
script.py: This file contains the functions for extracting key-value pairs from PDF files and images using OCR. It also includes a function to save the extracted data to a CSV file.
index.html: This HTML template defines the file upload form for users to submit their files.
display.html: This HTML template displays the extracted data in a table format.

Usage

Run the Flask application by executing the following command:

python app.py

Open a web browser and navigate to http://localhost:5000 to access the file upload page.
Click on the "Choose File" button and select the PDF file or image file you want to extract data from.
Click the "Upload" button to submit the file. The application will extract the key-value pairs and save them to a CSV file.
After the extraction is complete, you will be redirected to the data display page, where you can view the extracted data in a table format.

Note: The extracted data will be saved in the extracted_data.csv file in the project directory.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
templates		templates
uploads		uploads
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
script.py		script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Data Extraction and Rapid Prototyping

Prerequisites

Project Structure

Usage

About

Releases

Packages

Languages

aryanndwi123/Data-Table-Extraction

Folders and files

Latest commit

History

Repository files navigation

PDF Data Extraction and Rapid Prototyping

Prerequisites

Project Structure

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages