Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add live demos #30

Closed
simonw opened this issue Dec 4, 2021 · 32 comments
Closed

Add live demos #30

simonw opened this issue Dec 4, 2021 · 32 comments
Labels
documentation Improvements or additions to documentation research

Comments

@simonw
Copy link
Owner

simonw commented Dec 4, 2021

I'm tempted to pull a bunch of different example repos on a schedule and bundle them into the same demo instance.

Could have a recipes.md documentation page that shares the same demos and shows how they were built, using cog somehow.

@simonw simonw added documentation Improvements or additions to documentation research labels Dec 4, 2021
@simonw
Copy link
Owner Author

simonw commented Dec 4, 2021

Maybe cog generates the workflow YAML and the recipes markdown from the same source somehow?

@simonw
Copy link
Owner Author

simonw commented Dec 5, 2021

I'm going to put the demos in the README itself, so that they end up on https://datasette.io/ and PyPI.

@simonw
Copy link
Owner Author

simonw commented Dec 5, 2021

I have 19 git scraping repos of my own: https://github.com/search?q=topic%3Agit-scraping+user%3Asimonw&type=repositories

Options for demos:

@simonw
Copy link
Owner Author

simonw commented Dec 5, 2021

And there are repos by other people:

@simonw
Copy link
Owner Author

simonw commented Dec 6, 2021

Here's the 511 demo recipe:

git-history file 511.db events.json --id id --convert '
data = json.loads(content)
if data.get("error"):
    # {"code": 500, "error": "Error accessing remote data..."}
    return
for event in data["Events"]:
    event["id"] = event["extension"]["event-reference"]["event-identifier"]
    # Remove noisy updated timestamp
    del event["updated"]
    # Drop extension block entirely
    del event["extension"]
    # "schedule" block is noisy but not interesting
    del event["schedule"]
    # Flatten nested subtypes
    event["event_subtypes"] = event["event_subtypes"]["event_subtype"]
    if not isinstance(event["event_subtypes"], list):
        event["event_subtypes"] = [event["event_subtypes"]]
    yield event
' --wal

@simonw
Copy link
Owner Author

simonw commented Dec 6, 2021

Here's the CA fires demo:

ca-fires-history % git-history file fires.db incidents.json --id UniqueId --convert 'json.loads(content)["Incidents"]'

Can graph the acres burned for a specific fire using this:

/fires/item_version_detail?_item__exact=197&AcresBurned__notblank=1#g.mark=bar&g.x_column=_commit_at&g.x_type=temporal&g.y_column=AcresBurned&g.y_type=quantitative

@simonw
Copy link
Owner Author

simonw commented Dec 6, 2021

It's a shame both of these demos need a --convert - what's a good JSON one that doesn't?

@simonw
Copy link
Owner Author

simonw commented Dec 6, 2021

PGE is a good demo for that:

git-history file pge.db pge-outages.json --id outageNumber --branch master --wal --ignore lastUpdateTime

Shows --branch and --ignore as well.

Does take about an hour to generate though!

@simonw
Copy link
Owner Author

simonw commented Dec 6, 2021

Maybe FARA is good - I can link to the blog entry that explains it, and it's CSV, and the DB is probably quite small. Can maybe find Manafort in it as a demo?

@simonw
Copy link
Owner Author

simonw commented Dec 6, 2021

FARA is no good because the FARA_All_Registrants.csv has weird errors that need to be worked around.

@simonw
Copy link
Owner Author

simonw commented Dec 6, 2021

https://github.com/simonw/neededge-history/blob/main/v1.xml is interesting. It showed me a bug in the --import command - after applying a fix this worked:

git-history file neededge.db v1.xml --id url --convert '
tree = xml.etree.ElementTree.fromstring(content)
return [site.attrib for site in tree.iter("site")]
' --import xml.etree.ElementTree

Actually it's not interesting, because the data in it never changes - the only column you get is url and there is never a version 2 of anything.

@simonw
Copy link
Owner Author

simonw commented Dec 6, 2021

I'm going to try for sf-tree-history since it's the Git scraping demo I enjoy showing people the most, and it might work for a CSV example.

It's only 247 commits, but the first one takes a LONG time because it has ~185,000 records in it.

@simonw
Copy link
Owner Author

simonw commented Dec 6, 2021

Yeah I'm going to do the trees one for the CSV example.

So the examples are:

  • PG&E because it's interesting and doesn't need convert
  • ca-fires because it's quick
  • sf-tree-history because it's CSV
  • 511 because it's data someone else collected and it's an interesting complex convert example

@simonw
Copy link
Owner Author

simonw commented Dec 6, 2021

... but the trees export breaks at simonw/sf-tree-history@3fb63a9 - which is why I added this feature:

  [#####-------------------------------]  40/247   16%  01:12:33Error: Commit: 3fb63a99dfab8a75c83d341c67afc9abf484e0c4 - every item must have the --id keys. These items did not:
[
    {
        "1": "9",
        "DPW Maintained": "DPW Maintained",
        "Pinus radiata :: Monterey Pine": "Palm (unknown Genus) :: Palm Spp",
        "10 10th Ave": "97 12th St",
        "Median : Cutout": "Median : Cutout",
        "Tree": "Tree",
        "DPW": "DPW",
        "": "",
        "3": "20",
        "16": "3X3",
        "5992719.40718": "6007166.5822125",
        "2114869.71245": "2109668.7679164",
        "37.7866097734767": "37.7731533608049",
        "-122.468982710496": "-122.418629924651",
        "(37.7866097734767, -122.468982710496)": "(37.7731533608049, -122.418629924651)",
        "11": "19",
        "6": "2",
        "54": "28853"
    },

simonw added a commit that referenced this issue Dec 6, 2021
@simonw
Copy link
Owner Author

simonw commented Dec 6, 2021

I'm going to stash the intermediary .db files in an S3 bucket and re-fetch them when each action runs.

@simonw
Copy link
Owner Author

simonw commented Dec 7, 2021

Had to use --dialect excel with the trees example because of this strange issue: https://gist.github.com/simonw/f715433af8f8bf699d016ff19d7f542a

@simonw
Copy link
Owner Author

simonw commented Dec 7, 2021

Here's the tree recipe that finally worked:

git-history file tree-history.db Street_Tree_List.csv \
  --id TreeID \
  --csv \
  --start-after 3fb63a99dfab8a75c83d341c67afc9abf484e0c4 \
  --dialect excel

tree-history.db was 591M after a sqlite-utils vacuum tree-history.db.

@simonw
Copy link
Owner Author

simonw commented Dec 7, 2021

Most exciting demo is still likely to be this one: https://github.com/adolph/getIncidentsGit

git-history file houston.db incidents.json \
  --convert 'return json.loads(content)["ActiveIncidentDataTable"]' \
  --id CallTimeOpened --id Address --id CrossStreet \
  --ignore-duplicate-ids --wal

Could do with a couple of extra conversions, in particular the XCoord and YCoord look like this and would work better as a latitude/longitude:

-95373150,29752989

Also would be neat if columns like IncidentType could be extracted, see:

@simonw
Copy link
Owner Author

simonw commented Dec 7, 2021

My big concern at this point is size: three of my four preferred demos are multiple hundreds of MB. I think they're going to work on Cloud Run but it's towards the upper limit of what I'm happy to host there.

simonw added a commit that referenced this issue Dec 7, 2021
@simonw
Copy link
Owner Author

simonw commented Dec 7, 2021

Deploying the first version of this with:

datasette publish cloudrun \
  --service git-history-demos \
  /tmp/pge-outages/pge.db \
  /tmp/511-events-history/511.db \
  /tmp/ca-fires-history/fires.db \
  --install datasette-block-robots \
  --install datasette-remote-metadata \
  -m demos/metadata.yml \
  --memory 4Gi

Creating temporary tarball archive of 5 file(s) totalling 656.2 MiB before compression.

So big but maybe not too big?

@simonw
Copy link
Owner Author

simonw commented Dec 7, 2021

That deployed fine and it seems snappy enough: https://git-history-demos-j7hipcg4aq-uc.a.run.app/

@simonw
Copy link
Owner Author

simonw commented Dec 7, 2021

That change to the metadata.yml was picked up on https://git-history-demos-j7hipcg4aq-uc.a.run.app/pge without needing to re-deploy thanks to datasette-remote-metadata!

@simonw
Copy link
Owner Author

simonw commented Dec 7, 2021

I'll configure git-history-demos.datasette.io for this.

@simonw
Copy link
Owner Author

simonw commented Dec 7, 2021

https://git-history-demos.datasette.io/ is working now.

@simonw
Copy link
Owner Author

simonw commented Dec 7, 2021

I added --public to s3-credentials in https://github.com/simonw/s3-credentials/releases/tag/0.8 so now I can create an S3 bucket to store the built SQLite database files in between GitHub Actions runs.

@simonw
Copy link
Owner Author

simonw commented Dec 7, 2021

~ % s3-credentials create git-history-demos --public --create-bucket
Created bucket: git-history-demos
Attached bucket policy allowing public access
Created  user: 's3.read-write.git-history-demos' with permissions boundary: 'arn:aws:iam::aws:policy/AmazonS3FullAccess'
Attached policy s3.read-write.git-history-demos to user s3.read-write.git-history-demos
Created access key for user: s3.read-write.git-history-demos
{
    "UserName": "s3.read-write.git-history-demos",
    "AccessKeyId": "AKIAWXFXAIOZOLWKY4FP",
    "Status": "Active",
    "SecretAccessKey": “…”,
    "CreateDate": "2021-12-07 07:11:48+00:00"
}

Got the bucket and the credentials now. Adding those credentials to this GitHub repository as secrets AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

simonw added a commit that referenced this issue Dec 7, 2021
simonw added a commit that referenced this issue Dec 7, 2021
@simonw
Copy link
Owner Author

simonw commented Dec 7, 2021

https://git-history-demos.datasette.io/ was just successfully deployed using a manual trigger of the new GitHub Actions workflow! https://github.com/simonw/git-history/runs/4441234271?check_suite_focus=true

simonw added a commit that referenced this issue Dec 7, 2021
@simonw
Copy link
Owner Author

simonw commented Dec 7, 2021

I'm not convinced datasette-remote-metadata is working correctly: https://git-history-demos.datasette.io/-/metadata

UPDATE: no it was working fine, see notes on simonw/datasette-remote-metadata#3

@simonw
Copy link
Owner Author

simonw commented Dec 7, 2021

Add these to the README and I can close the issue.

@simonw
Copy link
Owner Author

simonw commented Dec 7, 2021

git-history-demos.datasette.io hosts three example databases created using this tool:

@simonw simonw closed this as completed in 32b8b81 Dec 7, 2021
simonw added a commit that referenced this issue Dec 7, 2021
simonw added a commit that referenced this issue Dec 7, 2021
Refs #30. I want to use the fix for this:

simonw/datasette#1544
@simonw
Copy link
Owner Author

simonw commented Dec 7, 2021

Wrote up part of this as a TIL: https://til.simonwillison.net/github-actions/s3-bucket-github-actions

@simonw
Copy link
Owner Author

simonw commented Dec 8, 2021

Also this blog entry: https://simonwillison.net/2021/Dec/7/git-history/

simonw added a commit that referenced this issue Dec 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation research
Projects
None yet
Development

No branches or pull requests

1 participant