Skip to content

PYOSTIE( Python Open Source Text Information Extractor)

License

Notifications You must be signed in to change notification settings

Pavan-Bellam/Pyostie

 
 

Repository files navigation

Upload Python Package release

Table of Contents

About The Project

PYOSTIE is short for Python Open Source Text Information Extractor.

A very elegant and simple library to extract text from many file formats.

This module can extract text from PDfs, Office files, text files, Image files. Also, we generate an excel file that gives you some deeper insights into the text. We are now only extracting insights for Image and PDF formats.( More to come soon.)

Installation

  1. Clone the repo
git clone https://github.com/anirudhpnbb/Pyostie.git
  1. Install using pip or pip3

pip3 install Pyostie

(or)

pip install Pyostie

Usage

import pyostie

For image files with insights.

output = pyostie.extract(filename, insights=True, extension="jpg") #### Format of the extension can also be "tif" or "pnb"
df, text = output.start()

For image files without insights.

output = pyostie.extract(filename, insights=False, extension="jpg")
text = output.start()

For PDF files:

output = pyostie.extract(filename, extension="pdf")
text = output.start()

For PDF files with insights:

output = pyostie.extract(filename, insights=True, extension="pdf")
text = output.start()

For Excel files

output = pyostie.extract(filename, extension="xlsx")
text = output.start()

For word files

image_folder(optional): Address where image needs to be written

output = pyostie.extract(filename, image_folder, extension="docx")
text = output.start()

For audio files

output = pyostie.extract(filename, extension="mp3")
text = output.start()

Future Works

In this version, we can only extract text from PDFs, Excel, TXT, CSV and MP3 formats. Soon, we will be adding doc, ppt, pptx, and many more. Watch this space for more updates.

Contact

Anirudh Palaparthi - @anirudh8889 - pnbbanirudh - [email protected]

Project Link: PYOSTIE

About

PYOSTIE( Python Open Source Text Information Extractor)

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 99.8%
  • Jupyter Notebook 0.2%