Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brainstorm how example data usage repos will look like, e.g. folders and files #24

Closed
lwjohnst86 opened this issue Nov 11, 2024 · 12 comments · Fixed by seedcase-project/examples#1
Assignees

Comments

@lwjohnst86
Copy link
Member

No description provided.

@lwjohnst86 lwjohnst86 converted this from a draft issue Nov 11, 2024
@K-Beicher
Copy link
Contributor

K-Beicher commented Nov 25, 2024

DRYAD
Looking at for instance DRYAD, which is a tool to publish and disseminate data it looks like the main points to their website is a description of the relevant project, tags that ensures that data is easy to discover, and a download section which allows direct access to the data and maybe a readme file. In contrast Seedcase isn't meant to expose the research data directly, but it is supposed to facilitate ease of access to metadata, but whether it is on the project level, or the data level is unclear to me. Looking at their page I imagine something like the top bit, but with a section on metadata showing the fields available, what data types they are, and their description. A later edition could look at replicating some of the information on the right (with keywords and funding info), but I don't think we are collecting that at present.

ZENODO
Like Dryad it is more project based, so there is a minimum set of data that concerns the project and who all authors/project members are. They do have a nice feature in that they offer a pre-view of the data, in Seedcase that would probably be the metadata we would display. It did get a bit annoying with Dryad that you had to download all the files, even the readme.

GitHub (Robert Koch Institut)
The German Institute has made some of their covid data available on GitHub. The main issue with reviewing this is that the text is in German, so it isn't completely clear what it is all about. What is clear is that there is an extensive readme file which contains information about the project and the data files. There isn't much (as far as I can see) about the metadata for the data fields, but there is some general meta data organised based on the zenodo.json format.

@K-Beicher
Copy link
Contributor

The pages I've looked at above all showcase the actual data, I can see how we could substitute the metadata for the data, but then I'm not entirely clear on how much project information we want to collect etc.

I think there are a few things I need to have clarified.

  1. Which format are we thinking of using for our sample pages, I'm thinking website pages, but maybe we also want to have a GitHub repo alongside them?
  2. I keep thinking about the difference between Sprout and other parts of Seedcase, because Sprout will only have the metadata for the project and the metadata for the research data, it won't have any type of 'display it back to the user' functionality, or will it?
  3. Do we want to include information about the project itself, and if so, what kind of info are we looking at collecting and where?

@lwjohnst86
Copy link
Member Author

Hmm, almost, but not quite. You are right that Sprout does not have an UI to display back to the user, at least these example repos won't.

The purpose of these examples is for users to see how they could write the code to use Seedcase software using their own data. So it is an informational/reference repo, way more detailed than a how-to guide.

What it should look like at the end is something like how Quarto has their gallery: https://quarto.org/docs/gallery/#dashboards

For instance, check out the "Gapminder Dashboard" in the Dashboard section. You'll notice when you click the "source" section, it takes you to a GitHub repo: https://github.com/jjallaire/gapminder-dashboard/tree/main Notice who the owner of the repo is, jjallaire. He is the CEO of RStudio and the co-creator of Quarto. He made this repo to show other people what Quarto can do, using a real-world example of the Gapminder dataset. So you as a user can look through this repo, look exactly at the code used to create the dashboard, so you get inspiration for how to apply Quarto to your own use case. That is what we want to have for Seedcase, is a set of repos, using real data, and converting them into tidier formats using Seedcase.

So everything in the example repos will be code to convert a messy real dataset into a tidier Seedcase organized dataset.

So, likely at the root of the repo will be a file called like convert.py or convert.sh that contains the Python code to take the real dataset, do some basic processing, and then generate a data package using Seedcase (starting with using Sprout).

@K-Beicher
Copy link
Contributor

Ah - would our repos be more informative, as in not just a single file with the uncommented code, but also something that says 'first you do this, then you do this, and finally you do this, and here is the result'? Because I admit that looking at that page of code doesn't really tell me a lot about how to go about getting started.

@lwjohnst86
Copy link
Member Author

Hmm, no, the purpose would be entirely as a reference. The how-to guides are what users would go to in order to learn how to do things and get started. Plus, it would make our work just a bit easier.

@K-Beicher
Copy link
Contributor

So use the data set downloaded with the how-to guide and put the result on the repo?

@lwjohnst86
Copy link
Member Author

Yea, exactly! We could do it two ways:

  1. One repo per dataset, with each repo being a data package.
  2. One repo for all datasets, with each dataset being a folder as a datapackage.

Maybe a folder and file structure for the repos for 1) would be:

example-male-beetles/
├── convert-with-core.py
├── convert-with-lib.py
├── convert-with-cli.sh
├── datapackage.json
├── README.md
└── resources/
    └── 1/
        ├── raw/
        │   └── <timestamp>-<uuid>.csv.gz
        └── data.parquet

And for 2):

examples/
├── scripts/
│   ├── datasetname1/
│   │   ├── convert-with-core.py
│   │   ├── convert-with-lib.py
│   │   └── convert-with-cli.sh
│   └── datasetname2/
│       ├── convert-with-core.py
│       ├── convert-with-lib.py
│       └── convert-with-cli.sh
└── packages/
    ├── 1/
    │   ├── README.md
    │   ├── datapackage.json
    │   └── resources/
    │       └── 1/
    │           ├── raw/
    │           │   └── <timestamp>-<uuid>.csv.gz
    │           └── data.parquet
    └── 2/
        ├── README.md
        ├── datapackage.json
        └── resources/
            └── 1/
                ├── raw/
                │   └── <timestamp>-<uuid>.csv.gz
                └── data.parquet

And the code to create the packages using core, lib, or cli would be in those convert- files.

We should probably do both 1) and 2) at some point, but which do we want to start with? The advantage of having all data in one repo is there is less to organize. The disadvantage is that it gets tricky to know which dataset links to which data package just by the folder structure. I personally am leaning towards 1).

@K-Beicher
Copy link
Contributor

K-Beicher commented Nov 25, 2024

We could use 1, and then make a top level folder called something like data-examples. I really like to readability of the first one, it is so simple and easy (at least for me) to understand. It made perfect sense!

I'd like to set up the repo once we get to that, then you can check and let me know what I'm missing.

@lwjohnst86
Copy link
Member Author

Just to be clear, you mean we could use 1? And the top level folder will always be seedcase-project since that is the GitHub organization/account and all repos would be under that.

As for making the repo, anyone on the team has permissions to create a repo 😌 You go ahead and set it up and I can add the general infrastructure around it 🤩 😁

@lwjohnst86
Copy link
Member Author

Just a small edit to the file structure of 1):

examples/
├── scripts/
│   ├── convert-with-core.py
│   ├── convert-with-lib.py
│   ├── convert-with-cli.sh
│   └── README.md
├── README.md
├── datapackage.json
└── resources/
    └── 1/
        ├── raw/
        │   └── <timestamp>-<uuid>.csv.gz
        └── data.parquet

So that the processing scripts can be in one location and so that we can include a README to describe what we are doing with them and why, without having to pollute the root README.

@K-Beicher
Copy link
Contributor

K-Beicher commented Nov 26, 2024

Yeah - I've edited my comment.

What I was thinking was one example repo, with folder between the main folder and the scripts folder (see below), so that we don't end up with five repos clogging up our overview.

examples/
├── male-seed-beetle
│     │── scripts/
│     │   ├── convert-with-core.py
│     │   ├── convert-with-lib.py
│     │   ├── convert-with-cli.sh
│     │   └── README.md
│     ├── README.md
│     ├── datapackage.json
│     └── resources/
│          └── 1/
│            │── raw/
│            │   └── <timestamp>-<uuid>.csv.gz
│            └── data.parquet
├── living-birds
      ├── scripts/
      │   ├── convert-with-core.py
      │   ├── convert-with-lib.py
      │   ├── convert-with-cli.sh
      │   └── README.md
      ├── README.md
...

@lwjohnst86
Copy link
Member Author

I have a feeling we will end up needing to go the "one repo, one data package" approach in the long term, but we can split things up once we get there.

@K-Beicher K-Beicher moved this from In Progress to In Review in Team project planning Nov 26, 2024
@K-Beicher K-Beicher linked a pull request Nov 26, 2024 that will close this issue
2 tasks
@github-project-automation github-project-automation bot moved this from In Review to Done in Team project planning Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants