TARGET classification workflow (using the GDC Data Portal)

Set up the directory structure:

project_dir="/data/BIDS-HPC/private/projects/dmi2"
working_dir="/home/weismanal/notebook/2020-06-10/dmi"
mkdir "$project_dir" "$working_dir"
cd "$working_dir"
git clone [email protected]:andrew-weisman/target_classification.git "$project_dir/checkout"
mkdir "$project_dir/data"

Note: The effort using the data directly from the TARGET data website (as opposed to the GDC Data Portal) is in the target_data_website branch of this repository.

Download the manifest for all the gene expression quantification files in the TARGET program (click on the blue "Manifest" button):

Place the downloaded manifest file as $project_dir/checkout/manifests/gdc_manifest.2020-06-10-all_gene_expression_files_in_target.txt.

In addition, click on the blue "Add All Files to Cart" button, go to the cart (top right of page), click on the two blue buttons "Sample Sheet" and "Metadata", and save the resulting two files to $project_dir/data. The two files will be named, e.g., gdc_sample_sheet.2020-07-02.tsv and metadata.cart.2020-07-02.json.

Note that these 5,149 files correspond to 1,192 cases (people [for sure that's what it means]).

Download the expression files from the manifest on Helix:

module load gdc-client
mkdir "$project_dir/data/all_gene_expression_files_in_target"
cd !!:1
gdc-client download -m "$project_dir/checkout/manifests/gdc_manifest.2020-06-10-all_gene_expression_files_in_target.txt"

Extract the resulting compressed files and link to them from a single folder $project_dir/data/all_gene_expression_files_in_target/links:

mkdir links
cd !!:1
for file in $(find ../ -iname "*.gz"); do gunzip "$file"; done
for file in $(find ../ -type f | grep -v "/logs/\|/annotations.txt"); do ln -s $file; done
ln -s "$project_dir/checkout/manifests/gdc_manifest.2020-06-10-all_gene_expression_files_in_target.txt" MANIFEST.txt

Note that

for file in $(ls | grep -v MANIFEST.txt); do echo $file | awk -v FS="." '{print $1}'; done | sort -u | wc -l

shows that, ostensibly, there are 2,481 unique expression files (independent of normalization). This is just based on the filenames, and is not actually correct.

Start an interactive allocation, using, e.g.,

sinteractive --mem=40g # --mem=20g may be fine

Go through the Python Jupyter notebook /data/BIDS-HPC/private/projects/dmi2/checkout/main.ipynb. Use the conda environment /data/BIDS-HPC/public/software/conda/envs/r_env. (Note this environment contains pandas version 1.1.0, whereas Biowulf's default python module has pandas version 0.24.2, which is insufficient.) See here for more notes on the environment.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
images		images
manifests		manifests
.gitignore		.gitignore
README.md		README.md
README.txt		README.txt
comparing_important_genes_for_aml.ipynb		comparing_important_genes_for_aml.ipynb
environment_info.md		environment_info.md
main-tpm.ipynb		main-tpm.ipynb
main.ipynb		main.ipynb
main_r.ipynb		main_r.ipynb
run_vst.R		run_vst.R
scratch.ipynb		scratch.ipynb
target_class_lib.R		target_class_lib.R
target_class_lib.py		target_class_lib.py
target_classification_status_presentation_on_2020_08_11.pdf		target_classification_status_presentation_on_2020_08_11.pdf
top_gene_summaries.txt		top_gene_summaries.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TARGET classification workflow (using the GDC Data Portal)

About

Releases

Packages

Languages

andrew-weisman/target_classification

Folders and files

Latest commit

History

Repository files navigation

TARGET classification workflow (using the GDC Data Portal)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages