Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNT: reduce repo size #727

Open
Gui-FernandesBR opened this issue Nov 9, 2024 · 7 comments
Open

MNT: reduce repo size #727

Gui-FernandesBR opened this issue Nov 9, 2024 · 7 comments
Assignees
Labels
Git housekeeping Clean and organize our github

Comments

@Gui-FernandesBR
Copy link
Member

Is your feature request related to a problem? Please describe.

As discussed here by @aureliobarbosa, cloning the RocketPy repo currently consumes more more than 1GB.
This is probably due to large files being stored

Describe the solution you'd like

There are a few options that we would like to explore in order to tackle this issue. For instance:

  1. Delete old, unused files from the git history. This could include .nc and other binary files that were initially committed to this repo but at some point got deleted.
  2. Use git large file system to store files that are too heavy (>10MB), specially those in the data folder.

Additional context

I have no much experience on this, but I will try listing a few links that may help us.

@aureliobarbosa
Copy link

Thanks for keeping the issue alive. I am still interested on applying git large file storage to this repo, but before diving in I decide to do the "home work" and spend the day looking into your documentation and also tried different install procedures. You can assign me the task if you wish!

Regards

@Gui-FernandesBR
Copy link
Member Author

Nice job!

I'm also trying to read more about git filter-branch and git bfg.
I don't know if I will manage to find time to actually use it on this repo, I'm more of studying the tools than actually using.
But if I do any progress I will let you know.

I have used git LFS at this repo, maybe there's something in that repo that could help us.

@aureliobarbosa
Copy link

aureliobarbosa commented Nov 9, 2024

EDIT: migrating to git-lfs on github is better described by github here.

Currently the 'data' has the following distribution of files:

# All Files
RocketPy on master is 📦 v1.6.1 via 🐍 v3.12.3 
❯ du -h --max-depth=0 data
162M	data

# .csv files
RocketPy on master is 📦 v1.6.1 via 🐍 v3.12.3 
❯ find -type f -name "*.csv" -exec du -ch {} + | grep 'total' | awk '{print $1}'
20M

# .rc files
RocketPy on master is 📦 v1.6.1 via 🐍 v3.12.3 
❯ find -type f -name "*.nc" -exec du -ch {} + | grep 'total' | awk '{print $1}'
142M

There is also about 1 Mb of .csv and some tiny .nc files on both tests (fixtures) and docs folders. It looks like it is the case of tracking those files with git-lsf.

Git LFS official repo has a tutorial on how to migrate a repository.

Problems I envision to implement this:

  • The migration procedure must be done by someone with administrative privileges, because it will be necessary to rewrite the repository.
  • There maybe problems for people installing previous versions, particularly regarding tests and documentation;
  • This is actually some kind of reset in the repository. Anyone working on some feature must first finish the PR.
  • Everybody working on the repository code must install git-lfs (and maybe this is a Blocker!!!!!!) and if it is decided to proceed into this direction it will be necessary to document properly into the development guide.

Alternative:

  • Just erasing all .csv and .nc files from git history and uploading again only the files being used right now. This could be used as an intermediate step before implementing git-lsf. The disadvantage of this alternative is that it is not a permanent solution and, over time, it will be necessary to do it all over again.

Those were the investigations for today, as soon as I implement git-lfs on my repo I will bring you the numbers and more discussions, if needed.

@aureliobarbosa
Copy link

Nice job!

I'm also trying to read more about git filter-branch and git bfg. I don't know if I will manage to find time to actually use it on this repo, I'm more of studying the tools than actually using. But if I do any progress I will let you know.

I have used git LFS at this repo, maybe there's something in that repo that could help us.

Just now I actually read your comment. I will look into the tools you mentioned, the paper repo seems to be a different case (did you migrate?). I think the main problem is doing the migration and coordinating with everyone else to use git-lfs.

@Gui-FernandesBR
Copy link
Member Author

@aureliobarbosa I have to say I liked the idea of trying the alternative solution first, and then we can try the git LFS.
Everything in the data folder is now "definitive", and those files which we deleted we may never need to restore again.

@aureliobarbosa
Copy link

Hey @Gui-FernandesBR,

I agree with you about trying cleaning the git history. In this direction, I evaluated tools for cleaning the git history and found that git-filter-repo seems to be a better solution. It has an option to analyze the size of previously deleted files, folders and files by extension (inside git history). Since I am supposing that you going to keep versions of data files inside the repository, I opted to investigate the sizes by file, while sorting them in reverse order. Below is a snapshot of CSV file I generated (I will send it to the team via Discord).

Contrary to initial expectation, only a few big CSV files are stored in the git history and the villains include .nc files, as expected, and notebooks (which store data inside it, of course...). The first big .py is the number 22 appearing on the list and has about 9 Mb (the first version of RocketPy?). The second .py appearing on this list is file number 51. By excluding 49 files on this list you would reduce the size of your git repo by 660Mb. It is important to remember that this files have been deleted from the tip of the main branch.

Considering this, my recommendation would be to delete only those 49 files, since this seems the simplest action to be done. After installing the git-filter-repo this operations can be easily done by putting the undesired files in a single file and running:

git filter-repo --invert-paths --paths-from-file files-i-dont-want-anymore.txt

Note that it DOES rewrite the git history and all developers would need to clone the repository again. I recommend to do this 'surgery' when you finish all PRs you expect to do before the next minor version.


mycode/projects/rocketpy-dev
❯ more path-deleted-sizes.csv 
1, 46978562, 16949319, 2019-02-07, 'docs/sampleDispersionDataReader.ipynb'
2, 46978562, 16949319, 2019-02-07, 'disp/sampleDispersionDataReader.ipynb'
3, 46902996, 40188650, 2023-01-01, 'data/weather/Alcantara_2016_ERA-5.nc'
4, 46774852, 40263054, 2023-01-01, 'data/weather/Alcantara_2017_ERA-5.nc'
5, 46774852, 40126497, 2023-01-01, 'data/weather/Alcantara_2015_ERA-5.nc'
6, 46774852, 40116024, 2023-01-01, 'data/weather/Alcantara_2018_ERA-5.nc'
7, 43664596, 36926254, 2022-09-24, 'data/weather/EuroC_single_level_reanalysis_2000_2021.nc'
8, 21323940, 18974499, 2023-01-01, 'data/weather/CLBI_2016_ERA-5.nc'
9, 21323936, 19420900, 2023-01-01, 'data/weather/SpaceportAmerica_2016_ERA-5.nc'
10, 21265684, 18923920, 2023-01-01, 'data/weather/CLBI_2018_ERA-5.nc'
11, 21265684, 18904131, 2023-01-01, 'data/weather/CLBI_2017_ERA-5.nc'
12, 21265680, 19476628, 2023-01-01, 'data/weather/SpaceportAmerica_2017_ERA-5.nc'
13, 21265680, 19373762, 2023-01-01, 'data/weather/SpaceportAmerica_2015_ERA-5.nc'
14, 21265680, 18918037, 2023-01-01, 'data/weather/CLBI_2015_ERA-5.nc'
15, 13753819, 4050298, 2024-09-21, 'docs/notebooks/fins_roll.csv'
16, 12908689, 2940044, 2024-09-21, 'docs/notebooks/coeff_testing.ipynb'
17, 12894004, 3692363, 2020-03-22, 'nbks/Dispersion Sample.disp_input'
18, 12355867, 3862899, 2021-04-07, 'docs/notebooks/valetudo_dispersion/valetudo_dispersion.ipynb'
19, 11866021, 4080483, 2024-08-04, 'docs/notebooks/airbrakes_example.ipynb'
20, 10830351, 3809773, 2020-03-22, 'nbks/Getting Started - Examples.ipynb'
21, 9860124, 3218580, 2021-04-07, 'docs/notebooks/dispersion_analysis.ipynb'
22, 9802609, 123010, 2020-03-22, 'nbks/rocketpyAlpha.py'
23, 9005216, 8966518, 2022-09-24, 'data/weather/EuroC_pressure_levels_reanalysis_2002-2021.nc'
24, 8589588, 4696361, 2024-08-04, 'docs/notebooks/air_brakes_example.ipynb'
25, 8232142, 2572448, 2023-08-10, 'docs/notebooks/example_hybrid.ipynb'
26, 7849765, 2155194, 2021-04-07, 'docs/notebooks/valetudo_dispersion/Monte_carlo_valetudo.valetudo_disp_o
ut.txt'
27, 6574758, 4701393, 2024-08-03, 'docs/notebooks/environment/environment_class_usage.ipynb'
28, 6313322, 3514941, 2023-06-28, 'docs/notebooks/example_solid.ipynb'
29, 6054970, 1531514, 2020-03-22, 'nbks/Dispersion Sample.disp_output'
30, 5635005, 3817378, 2020-03-22, 'nbks/Environment - Examples.ipynb'
31, 5080208, 4834400, 2022-04-09, 'data/weather/spaceport_america_pressure_level_reanalysis_2015_2021.nc'
32, 4976750, 2880500, 2022-06-07, 'docs/notebooks/SolidMotor_class_usage.ipynb'
33, 4929275, 3217437, 2020-03-22, 'nbks/Dispersion Analysis - Monte Carlo - Example.ipynb'
34, 4782068, 2288881, 2023-08-10, 'docs/notebooks/tank_class_usage.ipynb'
35, 4712149, 3155118, 2019-02-07, 'nbks/Environment Examples.ipynb'
36, 4299000, 215, 2024-09-21, 'docs/notebooks/tail_cL.csv'
37, 4298998, 1038596, 2024-09-21, 'docs/notebooks/tail_cQ.csv'
38, 4273149, 215, 2024-09-21, 'docs/notebooks/nose_cL.csv'
39, 4273147, 1036268, 2024-09-21, 'docs/notebooks/nose_cQ.csv'
40, 4223009, 213, 2024-09-21, 'docs/notebooks/fins_cL.csv'
41, 4223007, 1030363, 2024-09-21, 'docs/notebooks/fins_cQ.csv'
42, 4082870, 4044416, 2022-09-22, 'data/weather/EuroC_pressure_levels_reanalysis_2002_2010.nc'
43, 3418375, 1328400, 2023-08-10, 'docs/notebooks/example_liquid.ipynb'
44, 3274830, 1609382, 2020-03-22, 'nbks/Euporia.ipynb'
45, 3211198, 2261130, 2022-10-10, 'getting_started Dispersion.ipynb'
46, 2695115, 2695950, 2023-09-25, 'docs/static/trajectory-earth.png'
47, 2174231, 911552, 2022-05-19, 'getting_started.ipynb'
48, 2138702, 771662, 2023-01-01, 'data/calisto/CD Test.CSV'
49, 2109965, 836873, 2023-01-01, 'data/euporia/euporiaIDrag.csv'
50, 1933436, 965847, 2018-12-11, 'nbks/Calisto.ipynb'
51, 1525754, 51968, 2024-07-03, 'tests/test_rocket.py'

@Gui-FernandesBR
Copy link
Member Author

Amazing work, @aureliobarbosa !
I believe we can move forward with the git-filter-repo in order to significantly reduce the repo size (probably by half!).

I was not imagining that .ipynb would also be a part of the "villains list", but it makes total sense! When we save the notebooks with images, the ipython interpreter has to convert the image to a hash and store it in the .ipynb file (wich is just a fancy .json), this may consume disk space. Found another reason to migrate .ipynb to .rst files @MateusStano @phmbressan @Lucas-Prates !

@RocketPy-Team/code-owners can you read this thread and let us know that you agree with such operation?

The only concern is that a few files are still being used, therefore cannot be deleted:

  • "data/weather/EuroC_single_level_reanalysis_2000_2021.nc"
  • "data/weather/EuroC_pressure_levels_reanalysis_2002_2010.nc"
  • "docs/static/trajectory-earth.png" (honestly this is here just for README purposes)

Something we should definitely try is to compress the .nc files! Based on my experience, there are some free tools that compress these files, usually reducing the file size by 30%.


With all that been said, I guess a good summary of next steps would be:

  1. Finish all PRs that are currently opened. I think we can "pause" new developments for a few weeks and target finish what we started.
  2. [optional] -> I think we should worry about stashes and local branches that each developer may currently have.
  3. Run the git-filter-repo command to recreate the git history.
  4. Start using git LFS to store .nc and other large files.

As of now, I think your contribution is already quite beneficial for us, @aureliobarbosa !
I will discuss with the other code owners during our next weekly meetings to coordinate the best time to make the step 3 happen. Feel free to work on another meanwhile (let me know if you need any suggestions).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Git housekeeping Clean and organize our github
Projects
Status: Backlog
Development

No branches or pull requests

2 participants