Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add company dataset abstraction #218

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Conversation

cuducos
Copy link
Collaborator

@cuducos cuducos commented Nov 16, 2019

What is the purpose of this Pull Request?

This initial commit adds an abstraction to build the company dataset. It doesn't include tests yet because of the exploitative way I approached the problem. But tests are listed in the TODO list below.

What was done to achieve this purpose?

How to test if it really works?

$ serenata-toolbx --modules companies

This will look for the datasets in data/ with cnpj_cpf columns, take unique values of cnpj_cpf in these datasets and generate a YYYY-MM-DD-companies.csv.xz.

Who can help reviewing it?

@sergiomario @g4brielvs @turicas

TODO

Note on performance:

My initial tests gave sub-optimal performance:

2019-11-15 20:06:23,908 - root - INFO - 200 companies fetched (4.12 companies/s)
2019-11-15 20:07:12,282 - root - INFO - 300 companies fetched (6.20 companies/s)
2019-11-15 20:07:58,779 - root - INFO - 400 companies fetched (8.60 companies/s)

However, nowadays we have ~95k different cnpj_cpf values with 14 digits in Jarbas's reimbursements. In a pace of 6 companies per second, we can cover that dataset in roughly 4h.

This commit adds an abstraction to build the company dataset. It
includes:

- Interface to download data from Brasil.IO (Receita Federal CNPJ
  dataset)
- Interface to get CNAE (economic activity) description from IBGE
  website
- Interface to get geo-coordinates from Open Street Maps's Nominatim API
@cuducos
Copy link
Collaborator Author

cuducos commented Nov 20, 2019

Just generated a new companies file with 125k different companies (the one current in use has 60k). This will not work on Jarbas just yet, but I'll open a PR over there. Meanwhile, can upload this file to our DigitalOcean Spaces? I don't have access anymore ; ) cc @sergiomario

@sergiomario
Copy link
Collaborator

@cuducos I can't access the file through the link you suggested. I'm being directed to a branch comparison page on github.

@cuducos
Copy link
Collaborator Author

cuducos commented Dec 2, 2019

Sorry, my bad! Fixed the link over there, but just in case: https://www.dropbox.com/s/8kxt3szujr3tksb/2019-11-19-companies.csv.xz?dl=0

@sergiomario
Copy link
Collaborator

Uploaded to the project storage at DigitalOcean Spaces:
https://serenata-de-amor-data.nyc3.digitaloceanspaces.com/2019-11-19-companies.csv.xz

README.rst Outdated Show resolved Hide resolved
@luizfzs
Copy link

luizfzs commented Jan 14, 2020

@cuducos are there any drawbacks for 'converting' the SQLite db into the 'companies' files?
I see that the approach uses the DB as a way to query CNPJs found on the reimbursement files, right?
Maybe it could 'export' the database into a 'companies' file, so that Rosie could reuse it.

@cuducos
Copy link
Collaborator Author

cuducos commented Jan 14, 2020

I opted for setting a DB and querying it in order to have a smaller file.

Postgres is already a bottleneck for Jarbas's performance. Updating Jarbas database is another bottleneck.

Thus I don't think using a +30Gb database dump (that could be filtered to a few Mb) would be the better choice… and I'm not sure we have enough disk space for that in production.

Also Rosie already has a memory bottleneck and loading the full dump would make this memory bottleneck even tighter.

Does that make sense or am I over engineering that?

@luizfzs
Copy link

luizfzs commented Jan 14, 2020

It does make sense. I felt I was missing the bigger picture and your answer clarified that.

@cuducos
Copy link
Collaborator Author

cuducos commented Jan 14, 2020

@luizfzs do you think you can handle the PR on Jarbas? I can help with that.

Basically what is need is to adapt the Company to this new and updated CSV. I was wondering: maybe we can use JsonField for activities and partners and simplify the architecture. What do you think?

@luizfzs
Copy link

luizfzs commented Jan 15, 2020

@cuducos well, I'd like to give it a try.
Is there an issue created that I can refer to?

@cuducos
Copy link
Collaborator Author

cuducos commented Jan 15, 2020

I just suggested a road map at okfn-brasil/serenata-de-amor#509 (comment) ; )

README.rst Outdated Show resolved Hide resolved
@cuducos
Copy link
Collaborator Author

cuducos commented Jan 15, 2021

I couldn't finish this contribution, but I believe it doesn't make sense to finish it anymore — at least not in the direction I started.

Now we can use minhaeceita.org API and maybe keep only the Open Street Maps's Nominatim API drafted here.

Gonna try to change substantially this PR, gonna --force some pushes to avoid keeping code that never hit the main branch, ok?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants