Skip to content

nimbis/datasheet-scrubber

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FASoC Datasheet-Scrubber

The FASoC Datasheet Scrubbet is a utility that scrubs through large sets of PDF datasheets/documents in order to extract key circuit information. The information gathered is used to build a database of commercial off-the-shelf (COTS) IP that can be used to build larger SoC in the FASoC design. More information here You can do Datasheet Scrubbing by running Datasheet_Scrubbing.py, which you can input a datasheet (between one of ADC, CDC, DCDC, PLL, LDO, SRAM, Temperature Sensor, BDRT, Counters, DAC, Delay_Line, Digital Potentiometers, DSP, IO, Opamp categories) and observe the extracted specs and pins. Instruction steps of each of these would be as follows:

Environment

We need python 3.7 for this part. Moreover we need these libraries that you can use pip to install in PowerShell/terminal.

  • install and upgrade pip: python -m pip install --upgrade pip
  • pip install pandas
  • pip install -U scipy
  • pip install matplot
  • pip install matplotlib
  • pip install pdfminer.six
  • pip install pypdf2
  • pip install request
  • pip install lxml
  • pip install tabula-py
  • pip install sklearn
  • pip install regex
  • pip install keras
  • pip install tensorflow
  • pip install pdf2image
  • pip install pillow
  • pip install pytesseract
  • pip install -U numpy
  • pip install opencv-python

Here we propose two different approaches:

  • Categorition using Bag of words, text extraction using regular expression, and table extraction using tabula (please see here for more information)
  • Categorition, text extraction, and table extraction using Convolutional neural network (CNN) (please see here for more information)
    • If you want to test the CNN part, we need more softwares:
      • Poppler
      • Tesseract
      • Visual Studio 2017

Getting Started

These are steps for compiling codes:

  1. Clone the datasheet-scrubber repository
git clone 
  1. Go to here and download (All_pdf.zip should be downloaded)

  2. Run make init which runs Initializer.py. It will ask you to type All_pdf.zip directory, your work directory, and your code dirrectory (for datasheet scrubbing) that you have just cloned. After running initializer.py you should see something like this.

  1. All_pdf, All_text, cropped_pdf, and cropped_text are training directories. For addign more files to training set put your labeled pdf files in All_pdf directory (it means put ADC datasheets in ADC folder inside All_pdf, CDC datasheet in CDC folder inside All_pdf and so on).

  2. Run make categorizer which runs test_confusion_matrix.py and shows the confusion matrix of our categorizer on whole dataset.

  3. Put pdf files of datasheets that you want to test in Test_pdf folder and please email them to [email protected] in order to have a better repository.

  4. Run make extraction which runs Datasheet_Scrubbing.py which you can input a datasheet (between one of ADC, BDRT, CDC, counters, DAC, DCDC, Delay_Line, Digital_Potentiometers, DSP, IO, LDO, Opamp, PLL, SRAM, Temperature Sensor categories) and observe the extracted specs and pins.

Contributing

Extracted datasheets can be emailed to [email protected] in order build a bigger repository.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%