Add company dataset abstraction #218

cuducos · 2019-11-16T15:42:29Z

What is the purpose of this Pull Request?

This initial commit adds an abstraction to build the company dataset. It doesn't include tests yet because of the exploitative way I approached the problem. But tests are listed in the TODO list below.

What was done to achieve this purpose?

Interface to download data from Brasil.IO (the SQLite version of Receita Federal's CNPJ dataset)
Interface to get CNAE (economic activity) description from IBGE website
Interface to get geo-coordinates from Open Street Maps's Nominatim API

How to test if it really works?

$ serenata-toolbx --modules companies

This will look for the datasets in data/ with cnpj_cpf columns, take unique values of cnpj_cpf in these datasets and generate a YYYY-MM-DD-companies.csv.xz.

Who can help reviewing it?

@sergiomario @g4brielvs @turicas

TODO

~~Use asyncio to query the SQLite (enhance performance)~~ (not a good idea at the moment)
~~Fix CI broken in Support to Python 3.7 & Python 3.8 #217~~ ([#219] Use tox tests in TravisCI #220)
Write tests for serenata_toolbox.companies
~~Add companies command to the CLI~~ 6fe22c0
~~Add companies command to the README.rst~~ 3e2e378
Version bump
~~Generate YYYY-MM-DD-companies.csv~~ (link in the next message below)
Upload YYYY-MM-DD-companies.csv to the project storage at DigitalOcean Spaces
Update serenata_toolbox.datasets.downloader.Downloader.LATEST with new YYYY-MM-DD-companies.csv

Note on performance:

My initial tests gave sub-optimal performance:

2019-11-15 20:06:23,908 - root - INFO - 200 companies fetched (4.12 companies/s)
2019-11-15 20:07:12,282 - root - INFO - 300 companies fetched (6.20 companies/s)
2019-11-15 20:07:58,779 - root - INFO - 400 companies fetched (8.60 companies/s)

However, nowadays we have ~95k different cnpj_cpf values with 14 digits in Jarbas's reimbursements. In a pace of 6 companies per second, we can cover that dataset in roughly 4h.

This commit adds an abstraction to build the company dataset. It includes: - Interface to download data from Brasil.IO (Receita Federal CNPJ dataset) - Interface to get CNAE (economic activity) description from IBGE website - Interface to get geo-coordinates from Open Street Maps's Nominatim API

Landscape (free) is down for ages, so let's drop it

cuducos · 2019-11-20T00:37:22Z

Just generated a new companies file with 125k different companies (the one current in use has 60k). This will not work on Jarbas just yet, but I'll open a PR over there. Meanwhile, can upload this file to our DigitalOcean Spaces? I don't have access anymore ; ) cc @sergiomario

sergiomario · 2019-11-27T20:12:42Z

@cuducos I can't access the file through the link you suggested. I'm being directed to a branch comparison page on github.

cuducos · 2019-12-02T14:37:08Z

Sorry, my bad! Fixed the link over there, but just in case: https://www.dropbox.com/s/8kxt3szujr3tksb/2019-11-19-companies.csv.xz?dl=0

sergiomario · 2019-12-02T22:08:02Z

Uploaded to the project storage at DigitalOcean Spaces:
https://serenata-de-amor-data.nyc3.digitaloceanspaces.com/2019-11-19-companies.csv.xz

README.rst

serenata_toolbox/companies/db.py

luizfzs · 2020-01-14T01:33:23Z

@cuducos are there any drawbacks for 'converting' the SQLite db into the 'companies' files?
I see that the approach uses the DB as a way to query CNPJs found on the reimbursement files, right?
Maybe it could 'export' the database into a 'companies' file, so that Rosie could reuse it.

cuducos · 2020-01-14T02:54:23Z

I opted for setting a DB and querying it in order to have a smaller file.

Postgres is already a bottleneck for Jarbas's performance. Updating Jarbas database is another bottleneck.

Thus I don't think using a +30Gb database dump (that could be filtered to a few Mb) would be the better choice… and I'm not sure we have enough disk space for that in production.

Also Rosie already has a memory bottleneck and loading the full dump would make this memory bottleneck even tighter.

Does that make sense or am I over engineering that?

luizfzs · 2020-01-14T14:54:55Z

It does make sense. I felt I was missing the bigger picture and your answer clarified that.

cuducos · 2020-01-14T15:59:18Z

@luizfzs do you think you can handle the PR on Jarbas? I can help with that.

Basically what is need is to adapt the Company to this new and updated CSV. I was wondering: maybe we can use JsonField for activities and partners and simplify the architecture. What do you think?

luizfzs · 2020-01-15T00:49:35Z

@cuducos well, I'd like to give it a try.
Is there an issue created that I can refer to?

cuducos · 2020-01-15T16:09:02Z

I just suggested a road map at okfn-brasil/serenata-de-amor#509 (comment) ; )

serenata_toolbox/companies/db.py

README.rst

serenata_toolbox/companies/dataset.py

cuducos · 2021-01-15T20:32:28Z

I couldn't finish this contribution, but I believe it doesn't make sense to finish it anymore — at least not in the direction I started.

Now we can use minhaeceita.org API and maybe keep only the Open Street Maps's Nominatim API drafted here.

Gonna try to change substantially this PR, gonna --force some pushes to avoid keeping code that never hit the main branch, ok?

cuducos added 3 commits November 16, 2019 10:40

Minor edits

8c268de

Merge branch 'master' into cuducos/update-companies

a46a3de

cuducos mentioned this pull request Nov 17, 2019

Add geolocation to the latest companies dataset #57

Closed

cuducos added 5 commits November 17, 2019 10:03

Clean-up and update README

a90b47c

Simplify companies API using data directory

3e2e378

Clean-up

ea6a25c

Landscape (free) is down for ages, so let's drop it

Add companies module to the CLI

6fe22c0

Add progress bar to companies

a13471d

cuducos force-pushed the cuducos/update-companies branch from 26dacda to a13471d Compare November 17, 2019 17:39

cuducos added 2 commits November 17, 2019 12:39

Enhance command command output

0dc1dc9

Handles OSM Nominatim HTTP error

2b30617

cuducos mentioned this pull request Dec 31, 2019

Update the companies dataset okfn-brasil/serenata-de-amor#509

Open

luizfzs reviewed Dec 31, 2019

View reviewed changes

README.rst Outdated Show resolved Hide resolved

luizfzs reviewed Dec 31, 2019

View reviewed changes

serenata_toolbox/companies/db.py Outdated Show resolved Hide resolved

Merge branch 'master' into cuducos/update-companies

be086ac

cuducos commented Jan 15, 2020

View reviewed changes

serenata_toolbox/companies/db.py Outdated Show resolved Hide resolved

cuducos commented Jan 15, 2020

View reviewed changes

README.rst Outdated Show resolved Hide resolved

cuducos commented Jan 15, 2020

View reviewed changes

serenata_toolbox/companies/dataset.py Outdated Show resolved Hide resolved

Typos

9f4394b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add company dataset abstraction #218

Add company dataset abstraction #218

cuducos commented Nov 16, 2019 •

edited by sergiomario

Loading

cuducos commented Nov 20, 2019 •

edited

Loading

sergiomario commented Nov 27, 2019

cuducos commented Dec 2, 2019

sergiomario commented Dec 2, 2019

luizfzs commented Jan 14, 2020

cuducos commented Jan 14, 2020

luizfzs commented Jan 14, 2020

cuducos commented Jan 14, 2020

luizfzs commented Jan 15, 2020

cuducos commented Jan 15, 2020

cuducos commented Jan 15, 2021

Add company dataset abstraction #218

Are you sure you want to change the base?

Add company dataset abstraction #218

Conversation

cuducos commented Nov 16, 2019 • edited by sergiomario Loading

cuducos commented Nov 20, 2019 • edited Loading

sergiomario commented Nov 27, 2019

cuducos commented Dec 2, 2019

sergiomario commented Dec 2, 2019

luizfzs commented Jan 14, 2020

cuducos commented Jan 14, 2020

luizfzs commented Jan 14, 2020

cuducos commented Jan 14, 2020

luizfzs commented Jan 15, 2020

cuducos commented Jan 15, 2020

cuducos commented Jan 15, 2021

cuducos commented Nov 16, 2019 •

edited by sergiomario

Loading

cuducos commented Nov 20, 2019 •

edited

Loading