FASoC Datasheet-Scrubber

The FASoC Datasheet Scrubbet is a utility that scrubs through large sets of PDF datasheets/documents in order to extract key circuit information. The information gathered is used to build a database of commercial off-the-shelf (COTS) IP that can be used to build larger SoC in the FASoC design. More information here You can do Datasheet Scrubbing by running Datasheet_Scrubbing.py, which you can input a datasheet (between one of ADC, CDC, DCDC, PLL, LDO, SRAM, Temperature Sensor, BDRT, Counters, DAC, Delay_Line, Digital Potentiometers, DSP, IO, Opamp categories) and observe the extracted specs and pins. Instruction steps of each of these would be as follows:

Environment

We need python 3.7 for this part. Moreover we need these libraries that you can use pip to install in PowerShell/terminal.

install and upgrade pip: python -m pip install --upgrade pip
pip install pandas
pip install -U scipy
pip install matplot
pip install matplotlib
pip install pdfminer.six
pip install pypdf2
pip install request
pip install lxml
pip install tabula-py
pip install sklearn
pip install regex
pip install keras
pip install tensorflow
pip install pdf2image
pip install pillow
pip install pytesseract
pip install -U numpy
pip install opencv-python

Here we propose two different approaches:

Categorition using Bag of words, text extraction using regular expression, and table extraction using tabula (please see here for more information)
Categorition, text extraction, and table extraction using Convolutional neural network (CNN) (please see here for more information)
- If you want to test the CNN part, we need more softwares:
  - Poppler
  - Tesseract
  - Visual Studio 2017

Getting Started

These are steps for compiling codes:

Clone the datasheet-scrubber repository

git clone

Go to here and download (All_pdf.zip should be downloaded)
Run make init which runs Initializer.py. It will ask you to type All_pdf.zip directory, your work directory, and your code dirrectory (for datasheet scrubbing) that you have just cloned. After running initializer.py you should see something like this.

All_pdf, All_text, cropped_pdf, and cropped_text are training directories. For addign more files to training set put your labeled pdf files in All_pdf directory (it means put ADC datasheets in ADC folder inside All_pdf, CDC datasheet in CDC folder inside All_pdf and so on).
Run make categorizer which runs test_confusion_matrix.py and shows the confusion matrix of our categorizer on whole dataset.
Put pdf files of datasheets that you want to test in Test_pdf folder and please email them to [email protected] in order to have a better repository.
Run make extraction which runs Datasheet_Scrubbing.py which you can input a datasheet (between one of ADC, BDRT, CDC, counters, DAC, DCDC, Delay_Line, Digital_Potentiometers, DSP, IO, LDO, Opamp, PLL, SRAM, Temperature Sensor categories) and observe the extracted specs and pins.

Contributing

Extracted datasheets can be emailed to [email protected] in order build a bigger repository.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FASoC Datasheet-Scrubber

Environment

Getting Started

Contributing

About

Releases

Packages

Languages

License

nimbis/datasheet-scrubber

Folders and files

Latest commit

History

Repository files navigation

FASoC Datasheet-Scrubber

Environment

Getting Started

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages