Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚀 [Feature] Implement github workflow to publish data daily #16

Open
4 tasks
ulfgebhardt opened this issue Mar 27, 2021 · 13 comments
Open
4 tasks

🚀 [Feature] Implement github workflow to publish data daily #16

ulfgebhardt opened this issue Mar 27, 2021 · 13 comments

Comments

@ulfgebhardt
Copy link
Member

🚀 Feature

Implement github workflow to publish data daily

  • Github workflow to push stuff to https://github.com/bundestag/gesetze automatically
  • run this daily/weekly/regularly
  • tag github releases (?)
  • optional python build/syntax check on the code in this repo (this is sorta another issue)

Please help implement it - if you have the free time to do it we would solve a 3 year old problem which pops up every election year. Pinging capable and potentially interested people out of the blue: @Muehe @jbbgameich <3

User Problem

We would have plain text data here in github

bundestag/gesetze#55

Implementation

Use github workflows. See examples:

https://github.com/Ocelot-Social-Community/Ocelot-Social/blob/master/.github/workflows/publish.yml
https://github.com/gradido/gradido/blob/master/.github/workflows/publish.yml
https://github.com/mattia-lerario/Mentor-Application-Bachelor-Project/blob/master/.github/workflows/test.yml#L23

Additional context

Also ideal wäre wenn man irgendwie ein "binary" erstellen bzw den interpreter drüberlaufen lassen könnte. Ich habe leider noch nie etwas ernsthaftes mit python gemacht, da heißt es glaube ich packages erstellen oder so.

Ich hab überhaupt nix dagegen das zu mergen. Das Repo ist aktuell ziemlich tot
image

ich bekomme allerdings mit der anstehenden Bundestagswahl mehr Anfragen auf https://github.com/bundestag/gesetze und soweit ich das verstanden habe ist dieses Repo zum generieren des Inhalts zuständig.

Siehe: bundestag/gesetze#59 und bundestag/gesetze#55
Heute habe ich eine Anfrage von @Muehe bekommen bezüglich des Repos.

Ich fände es cool wenn wir es gemeinschaftlich hinbekommen ein script für die github workflows zu schreiben um ähnlich zu den Repos die wir für das democracy Projekt crawlen regelmäßig updates bekommen.

https://github.com/bundestag/NamedPolls
https://github.com/bundestag/NamedPollDeputies
https://github.com/bundestag/ConferenceWeekDetails
https://github.com/bundestag/dip21-daten
Die Leute von https://github.com/bundestag/offenegesetze.de haben sich da leider noch nicht eingeklinkt um diese Aufgabe zu übernehmen.

Also @JBBgameich : Hast du bock sowas zu machen? Soll ich diesen PR nun mergen?

src

@darkdragon-001
Copy link
Collaborator

Who maintains bundestag/gesetze? Who has pull/merge rights?

There are lots of open pull requests which haven't been merged yet. One should first get the manual workflow running before trying to automate things.

@jbruechert
Copy link
Contributor

jbruechert commented Mar 27, 2021

Most of the pull requests are either jokes, drafts or too large to review. Generating an up to date version from source is probably a better course of action.

@ulfgebhardt
Copy link
Member Author

I sorta start taking the responsibility since people come to me and ask for the Repo. Tho I have nothing to do with it. My course of action is finding people who wanna do it. I have all the rights needed and can also propagate those rights.
I invite people to the orga if they have a commit on a repo in the Orga or a featured fork. This should allow you to have more rights - not sure the merge right is set tho.

So if you wanna do the automatic push thingy, we can certainly make that happen rightwise.

@darkdragon-001
Copy link
Collaborator

Anyone has an idea how to efficiently determine the changed laws since the last run?

While it is easy for the scrapers (BGBl, BAnz, ...) since they are ordered by date, it is not so easy for the laws.
There is Aktualitätendienst which can be mapped to the corresponding entries in the scraped data based on page number, but I don't see how this can determine which laws (name or slug) actually changed. Anyone has an idea?

@darkdragon-001
Copy link
Collaborator

I am wondering if it makes sense to use https://github.com/actions/cache for storing the json data instead of committing it to some repo as it is fully generated. @ulfgebhardt do you have an opinion here?

@ulfgebhardt
Copy link
Member Author

I believe that it is worthwhile to store all data in a repo - that way we would make the changes of laws transparent and searchable.

Why would we hide the actual content in some volatile cache? I do not really understand the benefits.
Furthermore the actual content we provide is the scraped data - we should ensure maximum visibility and transparency.

But thats all just an opinion ;)

@darkdragon-001
Copy link
Collaborator

I don't like the fact that tooling and data is mixed in this repository. Also using and updating the cache seems just easier. I also don't see any added benefit by storing this data as it is fully reproducible and verifiable by anyone. No strong objection, just my personal opinion.

@ulfgebhardt
Copy link
Member Author

Tooling happens here: https://github.com/bundestag/gesetze-tools
Data happens here: https://github.com/bundestag/gesetze

The data is not reproducable since the official websites do not provide a history, do they?

@darkdragon-001
Copy link
Collaborator

darkdragon-001 commented Nov 14, 2023

I am talking about the intermediate JSON files stored in https://github.com/bundestag/gesetze-tools/tree/master/data. I agree that the final Markdown files should be committed via Git to the other repository.

@ulfgebhardt
Copy link
Member Author

Ok then I missunderstood

@mk-pmb
Copy link
Contributor

mk-pmb commented Nov 14, 2023

Hi! Sorry for being late to the party.

I am wondering if it makes sense to use https://github.com/actions/cache for storing the json data

Don't cache, always publish.
If the data helps for our next automated run, usually it will also help humans with their next manually-invoked run.
For data where git can make meaningful useful diffs, pushing it to a repo is a good idea. For all other stuff, let's instead make it part of a "release" = GitHub-hostet blob download.

I don't like the fact that tooling and data is mixed in this repository.

Yes, we should strictly separate both.

I had a quick look at gesetze-tools and see several python scripts.
I assume they need to run in a temporary clone of the gesetze repo, right?
From the readme I see lawde.py and lawdown.py have to run chained. Can the others run in parallel, each in their own gesetze clone (probably with working directory set to repo root?), or do some of them depend on another's results?
Will some of them conflict when run in parallel but using the same (shared) gesetze clone?
What files do I need to collect and publish from which of the tools?

@mk-pmb

This comment was marked as outdated.

@mk-pmb
Copy link
Contributor

mk-pmb commented Nov 14, 2023

Also it would be nice to have a small dummy version of the data repo, with all important structures at the latest version but much faster to clone. Or can I just pick an ancient commit? My hope is to make quick test runs for debugging that will probably produce wrong results but can give a preview of whether it would have worked when using the real data repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants