Skip to content

Latest commit

 

History

History
585 lines (365 loc) · 27.7 KB

DOCUMENTATION.md

File metadata and controls

585 lines (365 loc) · 27.7 KB

Data Coordinating Center - Dataset Ingress

The Data Coordinating Center (DCC) dataset ingress process consists of three main stages

  1. Dataset transfer to DCC: Transfer your experimental data files to a cloud storage bucket. Depending on dataset size, this step may take you anywhere from a few minutes up to multiple hours.

  2. Metadata upload: Upload a spreadsheet of your metadata annotations for your data files. Depending on the number and diversity of dataset files, this step could take you from 10 minutes to a couple of hours.

  3. Metadata validation and dataset submission confirmation: Verify that your metadata meets requirements. This step should take you less than 30 seconds on a typical internet connection, and completes your submission to the DCC.

A dataset is a set of experimental data files derived from a single type of experimental platform, such as single-cell RNA sequencing. The chart below provides a high-level overview of the steps an HTAN Center needs to complete in each stage. Software tools steamlining the process are linked and documented, as well as contacts of DCC liaisons that can provide additional information and help facilitate data submission.

Dataset ingress flow

Dataset transfer to DCC

Selecting storage platform

The DCC provides dataset storage on the cloud, hosted by Amazon Web Services (AWS) or Google Cloud (GC). Your center may decide where to store datasets depending on existing contracts, dataset location, or other preferences.

The DCC supports dataset transfers to both clouds via the Synapse platform. Please create a Synapse account here, if you do not already have an existing account.

Next, please provide your Synapse username and indicate your cloud platform preference to your DCC liaison. You can indicate a cloud platform of your choice, however the DCC recommends the following options:

  • if your center's data is already stored on premises/local machines, select AWS as your storage option

  • if your center's data is already stored on AWS, select AWS as your storage option and also provide your AWS storage region

  • if your center's data is already stored on GC, select GC as your storage option

Once you determine your dataset storage platform and provide Synapse username(s), your DCC liaison will boot-up the required cloud infrastructure and authorize you to transfer data into a private storage location.

Dataset upload

Centers do not need to follow a particular folder hierarchy in the provided cloud storage location.

To upload data to your DCC-designated storage location, please use the Synapse platform tools.

Depending on dataset size and other preferences, you may utilize web-based or programmatic data upload interfaces. Some of the more typical options are described below, along with links to relevant documentation for more detail and the typical usecase for each.

Synapse data upload via web interface
This option would typically be useful for uploading files residing on your local machine to a Synapse cloud storage location. You can follow the steps below to complete a data upload:
Navigate to your project, following the Synapse link provided by your DCC liaison
If prompted, please login with your Synapse account (or an associated Google account).

synapseLogin

Create a folder to store your first dataset.
  • Go to the Files tab

htax FilesTab

  • Create a folder (click on Files Tools -> Add New folder)

htax CreateFolder

Go to your folder and upload the files from your dataset (click on Folder tools -> Upload or Link to a File)

htax Folder

htax FileUpload

  • Once uploaded you can preview your files:

htaxFilesUploaded

Synapse data upload via a programmatic client
This option would typically be most suitable for upload of files residing on a cloud or your local machine; and in case of uploading large-number and/or large-size files.

You can modify the Python code vignette below for your particular dataset upload. For equivalent functionality in R or CLI, please refer to the Synapse documentation here.

To get started, first install the Synapse Python client:

pip install synapseclient
  • To upload a dataset from a local folder to a Synapse storage location, you can modify the script below
# the python Synapse client module
import synapseclient

# Synapse will organize your data files in a folder within project
# these are the corresponding Synapse modules
from synapseclient import Project, Folder, File

# Log in to synapse
syn = synapseclient.Synapse()

syn.login('my_username', 'my_password')

# Name and create the folder that will store your dataset; 
# you can use a name representative for your particular dataset, e.g. hta-x-dataset
# for the parent parameter, please enter the synapse project ID provided by your DCC liaison
data_folder = Folder('hta-x-dataset', parent='syn123')

# create the folder on Synapse
data_folder = syn.store(data_folder)

# point to files you'd like to upload in your dataset; note that the description field is optional
# the code below would upload two files to your folder, feel free to create a loop for more files
test_entity = File('/path/to/data/file1.txt', description='file 1', parent=data_folder)
test_entity = syn.store(test_entity)

test_entity = File('/path/to/data/file1.txt', description='file 2', parent=data_folder)
test_entity = syn.store(test_entity)

Metadata upload

At present, the DCC supports a web-based metadata upload via the Data Curator web app in Synapse.

We are working on providing

  1. a Python package for programmatic metadata upload and management; and
  2. an API for programmatic metadata upload and management.

These will be available in the next release of the DCC data pipeline. Please check with your DCC liaison on details.

Use the Data Curator app to curate a dataset for a first time

You have already transfered your dataset to the DCC - congratulations! If you have not, please follow the instructions here.

Please provide the metadata for your dataset using the Data Curator app. Here we assume your dataset is named hta-x-dataset.

Access the Data Curator app

If you are prompted to login to Synapse, please use your Synapse account (or associated Google account).

In the app, from the first tab, select your project. The project name corresponds to the bucket name (here `hta-x`). Then select your dataset, which corresponds to the folder name in your bucket (`hta-x-dataset`). Then select the metadata template you would like to use (e.g. scRNASeq if providing metadata for a scRNASeq dataset). If you don't see the correct template for your dataset, you can select the "Minimal Metadata" template and contact your DCC liaison.

DataCurator project selection

Once you have selected your dataset and metadata template, navigate to the second tab "Get Metadata Template" and click on "Link to Google Sheets Template". This will generate a link to a Google spreadsheet containing an empty template for you to complete with metadata, for each of the files in your dataset.

dataCurator MetadataTab

dataCurator LinktoTemp

You can fill out the sheet on the web, using dropdowns with allowed values and other standard Google Sheet features.

gtemplate Empty

gtemplate Filled

Note that you can also save the spreadsheet as a CSV file and use a method of your choice to fill it out. The metadata CSV will be validated by the Data Curator app before submission in any case.

Once filled in, you can save your spreadsheet as a CSV (File -> Download -> Comma-separated Value...)

gtemplateDLCSV

Next: navigate to the third tab "Submit & Validate Metadata"

dataCurator SubmitTab

Upload your saved CSV.

dataCurator UploadCSV

  • If upload was successful, you will see your metadata entries in the Metadata Preview

dataCurator MetadataPrev

Click "Validate Metadata"
  • If your metadata is valid, you will see a corresponding message and a "Submit" button will become available.

dataCurator ValidateSuccess

  • Clicking the "Submit" button confirms that this dataset has been curated according to the relevant DCC data model. You will receive a link to your metadata in the Synapse system.

dataCurator SubmitSuccess

If your metadata has been validated and submitted successfully, your metadata will appear in the "Files and Metadata" Table in your Synapse Project.

Fileview NewAnno

If you receive an error upon pressing the "Validate Metadata" button, the metadata template-cells causing the error will be highlighted, along with a corresponding list of error details

dataCurator ValidateError

  • You can edit your file in a Google spreadsheet (click the link following the errors) and re-download it as a CSV or edit your CSV locally, as shown here on Excel.

excel TemplateFixed

  • Upload your file and see your metadata updates reflected

dataCurator UploadFixedFile

  • Press the "Validate Metadata" button again

dataCurator ValidateFixedFile

  • If all errors have been resolved, you can submit your validated metadata

dataCurator SubmitFixedFile

  • Please contact your DCC liaison if you cannot resolve a metadata error; or have questions regarding metadata submission.

Use the Data Curator app to update existing metadata

You have already transfered your dataset to the DCC, and have provided metadata successfully - congratulations!

Now you'd like to update your metadata in order to

  • correct mistake(s)
  • provide further/change metadata to comply with a new iteration of the DCC data model affecting your datasets' metadata
  • provide metadata for files that have been added to your dataset
Access the Data Curator app

If you are prompted to login to Synapse, please use your Synapse account (or associated Google account).

In the app, from the first tab, select your project (e.g. hta-x, corresponds to your bucket name if you have uploaded your dataset directly to a AWS or GC bucket); your dataset (e.g. hta-x-dataset, corresponds to a folder name in your bucket); and the metadata template you would like to use (e.g. scRNASeq if providing metadata for a scRNASeq dataset); if you don't see the correct template for your dataset, you can select the "Minimal Metadata" template and contact your DCC liaison.

DataCurator project selection

Once you have selected your dataset and metadata template, navigate to the second tab "Get Metadata Template" and under "Have Previously Submitted Metadata?" click on 'Link to Google Sheets'. This will generate a link to a Google spreadsheet containing the metadata available for each of the files in your dataset.

dataCurator MetadataTab

Data Curator metadata update google sheets link

You can fill out the sheet on the web, using dropdowns with allowed values and other standard Google Sheet features.

gtemplate Filled

Note that you can also save the spreadsheet as a CSV file and use a method of your choice to fill it out. The metadata CSV will be validated by the Data Curator app before submission in any case.

Once updated, you can save your spreadsheet as a CSV (File -> Download -> Comma-separated Value...)

gtemplate dlCSV

Next: navigate to the third tab "Submit & Validate Metadata"

dataCurator SubmitTab

Upload your saved CSV.

dataCurator UploadCSV

  • If upload was successful, you will see your metadata entries in the Metadata Preview

dataCurator MetadataPreview

Click "Validate Metadata"
  • If your metadata is valid, you will see a corresponding message and a "Submit" button will become available.

dataCurator ValidateSuccess

  • Clicking the "Submit" button confirms that this dataset has been curated according to the latest DCC data model. You will receive a link to your metadata in the Synapse system.

dataCurator SubmitSuccess

If your metadata has been validated and submitted successfully, your metadata will appear in the "Files and Metadata" Table in your Synapse Project.

Filewview NewAnno

If you receive an error upon pressing the "Validate Metadata" button, the metadata template-cells causing the error will be highlighted, along with a corresponding list of error details

dataCurator ValidateError

  • You can edit your file in a Google spreadsheet (click the link following the errors) and re-download it as a CSV or edit your CSV locally, as shown here on Excel.

excel TemplateFixed

  • Upload your file and see your metadata updates reflected

dataCurator UploadFixedFile

  • Press the "Validate Metadata" button again

dataCurator ValidateFixedFile

  • If all errors have been resolved, you can submit your validated metadata

dataCurator SubmitFixedFile

  • Please contact your DCC liaison if you cannot resolve a metadata error; or have questions regarding metadata updates and submission.

Metadata and dataset submission confirmation

You can verify that both your dataset and metadata have been successfully submitted to the DCC by navigating to the Synapse project containing you dataset. The link to the project was provided by your DCC liaison in stage 1; the link is also generated by the DataCurator app above, in stage 2, if your metadata submission is successful.

If your dataset has been successfully submitted, under the Table tab of your project, there would be a table named 'hta-x-dataset', containing the list of files in your dataset and their metadata.