PYOSTIE is short for Python Open Source Text Information Extractor.
A very elegant and simple library to extract text from many file formats.
This module can extract text from PDfs, Office files, text files, Image files. Also, we generate an excel file that gives you some deeper insights into the text. We are now only extracting insights for Image and PDF formats.( More to come soon.)
- Clone the repo
git clone https://github.com/anirudhpnbb/Pyostie.git
- Install using pip or pip3
pip3 install Pyostie
(or)
pip install Pyostie
import pyostie
output = pyostie.extract(filename, insights=True, extension="jpg") #### Format of the extension can also be "tif" or "pnb"
df, text = output.start()
output = pyostie.extract(filename, insights=False, extension="jpg")
text = output.start()
output = pyostie.extract(filename, extension="pdf")
text = output.start()
output = pyostie.extract(filename, insights=True, extension="pdf")
text = output.start()
output = pyostie.extract(filename, extension="xlsx")
text = output.start()
image_folder(optional): Address where image needs to be written
output = pyostie.extract(filename, image_folder, extension="docx")
text = output.start()
output = pyostie.extract(filename, extension="mp3")
text = output.start()
In this version, we can only extract text from PDFs, Excel, TXT, CSV and MP3 formats. Soon, we will be adding doc, ppt, pptx, and many more. Watch this space for more updates.
Anirudh Palaparthi - @anirudh8889 - pnbbanirudh - [email protected]
Balaram Guddanti - Balaram Guddanti - [email protected]
Project Link: PYOSTIE