Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Census 2020 example #430

Draft
wants to merge 31 commits into
base: main
Choose a base branch
from
Draft

Census 2020 example #430

wants to merge 31 commits into from

Conversation

Azaya89
Copy link
Collaborator

@Azaya89 Azaya89 commented Oct 17, 2024

Created a new example using the 2020 US census dataset. The file exists locally as a large .parq file that will be uploaded to S3 at a later time.

NOTES:

  1. The url added in the downloads section of the anaconda-project.yml files is not a real link and that is what is causing the CI build failure.

@Azaya89 Azaya89 requested a review from hoxbro October 17, 2024 00:34
@maximlt
Copy link
Contributor

maximlt commented Oct 17, 2024

I suspect it is due to #429 but I'm not sure how to resolve it.

You need to re-create the conda environment locally following the contributing guide.

The test file added is a 0.1% sample of the full dataset but it is still about 8MB in size. I don't know if that is too large and should be reduced further.

It's still way too large. You should aim for the minimum dataset size possible, it's fine if it's just a few KB as long as it contains data that is representative of the whole dataset. For instance, if the code expects some data category, then it should be in the sample dataset to let the notebook run entirely.

Copy link
Contributor

@maximlt maximlt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an absolute need to rename the original census project census_one? Without doing anything else, this is going to break all the links to its web page and deployment.

I would also not call the new one census_two but census2020.

@Azaya89
Copy link
Collaborator Author

Azaya89 commented Oct 17, 2024

Is there an absolute need to rename the original census project census_one? Without doing anything else, this is going to break all the links to its web page and deployment.

I would also not call the new one census_two but census2020.

I imagine renaming the original from census to something else makes sense seeing as there are now more than one census notebooks in the examples gallery (and possibly more in the future). However, I tried renaming both to census2010 and census2020 but the doit validate step emits a warning that only lower case characters and underscore allowed in the naming. I wasn't sure ignoring that warning was ideal that is why I now renamed both to the current names.

@maximlt
Copy link
Contributor

maximlt commented Oct 17, 2024

However, I tried renaming both to census2010 and census2020 but the doit validate step emits a warning that only lower case characters and underscore allowed in the naming

Sounds like a bug in the validation code, something like census2020 should be allowed.

@Azaya89
Copy link
Collaborator Author

Azaya89 commented Oct 17, 2024

You need to re-create the conda environment locally following the contributing guide.

Done. Thanks

It's still way too large. You should aim for the minimum dataset size possible, it's fine if it's just a few KB as long as it contains data that is representative of the whole dataset. For instance, if the code expects some data category, then it should be in the sample dataset to let the notebook run entirely.

Reduced it to <1MB now.

@maximlt
Copy link
Contributor

maximlt commented Oct 21, 2024

Replying to your comment elsewhere:

Thank you. I'm still in favor of renaming the first one to census2010 though.

If you intend to rename it, then redirect links have to be set up:

Alternatively, we could just:

  • Change the title property in the project YAML to Census 2010
  • Change the notebook top-level heading to Census 2010

@Azaya89
Copy link
Collaborator Author

Azaya89 commented Oct 21, 2024

Alternatively, we could just:

  • Change the title property in the project YAML to Census 2010
  • Change the notebook top-level heading to Census 2010

I already did these in this PR. Would that be enough to differentiate both examples eventually?

@maximlt
Copy link
Contributor

maximlt commented Oct 21, 2024

Would that be enough to differentiate both examples eventually?

I think so?

@Azaya89
Copy link
Collaborator Author

Azaya89 commented Oct 21, 2024

I think so?

OK. I will revert the other renaming then

census2020/census2020.ipynb Outdated Show resolved Hide resolved
@hoxbro
Copy link
Contributor

hoxbro commented Nov 7, 2024

My suggestion was that you use the processing script to save it to disk as new data and use that data in the notebook.

@Azaya89
Copy link
Collaborator Author

Azaya89 commented Nov 7, 2024

My suggestion was that you use the processing script to save it to disk as new data and use that data in the notebook.

Oh? Alright then. Will do...

@maximlt
Copy link
Contributor

maximlt commented Nov 15, 2024

@Azaya89 you will need to re-lock the project as the solve is failing:

Channels:
 - conda-forge
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed

PackagesNotFoundError: The following packages are not available from current channels:

  - libcurl==8.11.0=hbbe4b11_0

Not your fault, sometimes conda-forge marks some packages as broken (adding the broken label on conda-forge) which means these packages are no longer available on the conda-forge channel but on conda-forge/label/broken.

conda-forge/admin-requests#1147

@jbednar
Copy link
Contributor

jbednar commented Dec 2, 2024

It was hard to follow the discussion above, but it looks like the original one is still called census rather than census2010, and if so, I agree -- let's preserve those links. We'll put a link to census2020 within census so that wherever someone lands they will find both.

Even apart from the file size, the test data seems more complex than necessary. I think you can provide an option to write_parquet to store the test data into a single flat .parq file rather than a directory full of separate part files. Looks like the old census didn't do that, but I don't think there was a good reason for that, as e.g. opensky uses a single parquet file.

@Azaya89 Azaya89 self-assigned this Dec 2, 2024
@Azaya89
Copy link
Collaborator Author

Azaya89 commented Dec 2, 2024

It was hard to follow the discussion above, but it looks like the original one is still called census rather than census2010, and if so, I agree -- let's preserve those links.

Correct.

We'll put a link to census2020 within census so that wherever someone lands they will find both.

OK. That will require a separate PR then.

Even apart from the file size, the test data seems more complex than necessary. I think you can provide an option to write_parquet to store the test data into a single flat .parq file rather than a directory full of separate part files. Looks like the old census didn't do that, but I don't think there was a good reason for that, as e.g. opensky uses a single parquet file.

OK. I will do that.

@maximlt
Copy link
Contributor

maximlt commented Dec 2, 2024

OK. That will require a separate PR then.

I'd make sense doing it in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants