-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add company dataset abstraction #218
base: master
Are you sure you want to change the base?
Conversation
This commit adds an abstraction to build the company dataset. It includes: - Interface to download data from Brasil.IO (Receita Federal CNPJ dataset) - Interface to get CNAE (economic activity) description from IBGE website - Interface to get geo-coordinates from Open Street Maps's Nominatim API
26dacda
to
a13471d
Compare
Just generated a new companies file with 125k different companies (the one current in use has 60k). This will not work on Jarbas just yet, but I'll open a PR over there. Meanwhile, can upload this file to our DigitalOcean Spaces? I don't have access anymore ; ) cc @sergiomario |
@cuducos I can't access the file through the link you suggested. I'm being directed to a branch comparison page on github. |
Sorry, my bad! Fixed the link over there, but just in case: https://www.dropbox.com/s/8kxt3szujr3tksb/2019-11-19-companies.csv.xz?dl=0 |
Uploaded to the project storage at DigitalOcean Spaces: |
@cuducos are there any drawbacks for 'converting' the SQLite db into the 'companies' files? |
I opted for setting a DB and querying it in order to have a smaller file. Postgres is already a bottleneck for Jarbas's performance. Updating Jarbas database is another bottleneck. Thus I don't think using a +30Gb database dump (that could be filtered to a few Mb) would be the better choice… and I'm not sure we have enough disk space for that in production. Also Rosie already has a memory bottleneck and loading the full dump would make this memory bottleneck even tighter. Does that make sense or am I over engineering that? |
It does make sense. I felt I was missing the bigger picture and your answer clarified that. |
@luizfzs do you think you can handle the PR on Jarbas? I can help with that. Basically what is need is to adapt the |
@cuducos well, I'd like to give it a try. |
I just suggested a road map at okfn-brasil/serenata-de-amor#509 (comment) ; ) |
I couldn't finish this contribution, but I believe it doesn't make sense to finish it anymore — at least not in the direction I started. Now we can use minhaeceita.org API and maybe keep only the Open Street Maps's Nominatim API drafted here. Gonna try to change substantially this PR, gonna |
What is the purpose of this Pull Request?
This initial commit adds an abstraction to build the company dataset. It doesn't include tests yet because of the exploitative way I approached the problem. But tests are listed in the TODO list below.
What was done to achieve this purpose?
How to test if it really works?
This will look for the datasets in
data/
withcnpj_cpf
columns, take unique values ofcnpj_cpf
in these datasets and generate aYYYY-MM-DD-companies.csv.xz
.Who can help reviewing it?
@sergiomario @g4brielvs @turicas
TODO
Use(not a good idea at the moment)asyncio
to query the SQLite (enhance performance)Fix CI broken in Support to Python 3.7 & Python 3.8 #217([#219] Use tox tests in TravisCI #220)serenata_toolbox.companies
Add6fe22c0companies
command to the CLIAdd3e2e378companies
command to theREADME.rst
Generate(link in the next message below)YYYY-MM-DD-companies.csv
YYYY-MM-DD-companies.csv
to the project storage at DigitalOcean Spacesserenata_toolbox.datasets.downloader.Downloader.LATEST
with newYYYY-MM-DD-companies.csv
Note on performance:
My initial tests gave sub-optimal performance:
However, nowadays we have ~95k different
cnpj_cpf
values with 14 digits in Jarbas's reimbursements. In a pace of 6 companies per second, we can cover that dataset in roughly 4h.