Skip to content
This repository has been archived by the owner on Mar 30, 2023. It is now read-only.

Add Lekha subset busco, kmermaid workflows, lemur unaligned #3

Open
wants to merge 209 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
209 commits
Select commit Hold shift + click to select a range
ea3e706
Add workflows with Makefile and nextflow config'
olgabot Nov 4, 2019
4bc58f4
remove nextflow extra files
olgabot Nov 4, 2019
b53c851
Increase cpus and runtime
olgabot Nov 5, 2019
7f1a116
Update workflows/kmermaid/10x/Makefile
olgabot Mar 4, 2020
8fdc4b1
About to change branch
olgabot May 23, 2020
f72d192
Change to branch that propagates ksize (remove ribo merged into dev)
olgabot May 23, 2020
be81820
Commit final, usable Makefiles
olgabot Jun 3, 2020
1da69dd
Add more files for kmermaid minitest
olgabot Jun 3, 2020
8285dba
Whitespace to trigger changes
olgabot Jun 3, 2020
90ba0b9
Add makefile with soft link folder
olgabot Apr 10, 2020
6dc863c
Make separate run of only normal lung
olgabot Apr 13, 2020
994030e
Add normal lung only with dev
olgabot May 6, 2020
53cef96
Add nextflow config to limit processors
olgabot May 6, 2020
35ff388
remove Makefile~
pranathivemuri Jun 14, 2020
32204a2
Add makefiles, nextflow configs for running lemur and hlca lung data
olgabot Oct 6, 2020
367e189
Add makefile for hlca
olgabot Oct 8, 2020
2a97a66
Delete Makefile~
pranathivemuri Oct 8, 2020
7ca29a0
Delete nextflow.config~
pranathivemuri Oct 8, 2020
0faa2e7
Delete nextflow.config~
pranathivemuri Oct 8, 2020
e33aefd
Delete Makefile~
pranathivemuri Oct 8, 2020
e2470fd
hulk changes makefile
pranathivemuri Oct 8, 2020
c9af46f
add remove ribo rna rule
pranathivemuri Oct 12, 2020
f68e4a3
makefile for hlca remove ribo rna on lrrr as sortmerna was not workin…
pranathivemuri Oct 14, 2020
bad256c
lung patient 3 added rule
pranathivemuri Oct 16, 2020
2319175
Separate out busco human+mouse from rest of mammal subsetting"
olgabot Oct 22, 2021
ef4a06c
Initial commit
bluegenes Apr 1, 2020
d45a69e
init
bluegenes Apr 12, 2020
06cc5c8
add nbs
bluegenes May 19, 2020
ef69290
upd
bluegenes Apr 14, 2020
3a9d5cf
plts
bluegenes May 19, 2020
a87923c
Separate reading/writing, then plotting
olgabot May 19, 2020
b89a119
Re-add plotting notebook with full outputs
olgabot May 19, 2020
879de3b
Add notebook to visualize observed vs theoretical k-mers for each dat…
olgabot Jun 9, 2020
546dfdc
No more uniref
olgabot Jun 9, 2020
fc8cde1
Remove uniref, update molecular order
olgabot Jun 9, 2020
13cfdd0
Add human qfo notebooks so far
olgabot Oct 1, 2020
762d3a1
Clean up notebooks with Lekha
olgabot Oct 2, 2020
982215b
Add python gitignore
olgabot Oct 2, 2020
6a52234
Update notebooks using BUSCO mammalia only reads
olgabot Oct 5, 2020
a51a580
Add more notebooks for qfo analyses
olgabot Oct 6, 2020
e77ebe0
Finalize more qfo analysesA
olgabot Oct 29, 2020
667c7b6
Update figure plotting
olgabot Apr 26, 2021
075265f
Update ROC AUC figure
olgabot Apr 26, 2021
dbbdb01
Add notebooks updated for paper on 2021-10-07
olgabot Oct 7, 2021
b39d7ac
Remove old README
olgabot Oct 7, 2021
a749525
Add notebooks for analyzing sketches so far
olgabot Sep 30, 2020
83db3d5
Add more notebooks for searching interferon
olgabot Oct 5, 2020
ad4f9b7
Add orthogroup signatures, searching for interferons
olgabot Oct 21, 2020
0bd191f
Update interferon search
olgabot Jul 19, 2021
4d17b9d
Add olga's hackycode for signatures
olgabot Oct 16, 2020
36af4e3
updates to removing ribosomal reads
phoenixAja Nov 13, 2020
6e73a3b
fixing remove ribosomal, added some basic logging to see which signat…
phoenixAja Nov 19, 2020
9d2c02d
remove ribosomal working, but not with some signatures
phoenixAja Nov 20, 2020
9a8e5ff
all commands should be good here up to searching human in mouse sbt n…
phoenixAja Nov 20, 2020
32cc1aa
Add all updated notebooks for cell lookups
olgabot Dec 2, 2020
3c90b2b
Add workflows for predictorthologs
olgabot Dec 2, 2020
0be5eb6
Add intersection comparison notebooks
olgabot Jul 19, 2021
1a1765f
Add more notebooks for translate resutls so far
olgabot Oct 21, 2020
fa13808
batlas makefile updated
phoenixAja Sep 21, 2020
3d99009
Update workflows/kmermaid/bat/Makefile
phoenixAja Sep 21, 2020
3c81eab
Update workflows/kmermaid/bat/Makefile
phoenixAja Sep 21, 2020
85924f3
small workflow changes
phoenixAja Sep 30, 2020
20efcc0
subsetted to lung tissues
phoenixAja Oct 5, 2020
617e8c5
Update Makefile
phoenixAja Oct 5, 2020
43957be
moved time to sencha translate
phoenixAja Oct 5, 2020
b56b642
ignore mltiqc
phoenixAja Oct 6, 2020
eebf6f2
fixed multiqc
phoenixAja Oct 6, 2020
b33cdab
small changes
phoenixAja Oct 8, 2020
2a15eba
updates params for jaccard threshold and set ksize as 8
phoenixAja Oct 12, 2020
7be4b81
Add separate rule for ksize 8
olgabot Oct 14, 2020
1dceaa6
Add COMMON_FLAGS variable to reduce repeating oneself
olgabot Oct 14, 2020
d1e2569
Add updated create_sourmash_command_utils.py and sig_utils.py
olgabot Dec 2, 2020
ba2c03a
modified join ontologies to not have to subsample tissues types
phoenixAja Dec 4, 2020
42593c1
Add removing common hashes
olgabot Dec 12, 2020
9c83f79
Add all notebooks for analyzing mouse2mouse, human2human so far
olgabot Dec 20, 2020
6cc0adf
Add pareto analysis
olgabot Dec 24, 2020
2199abd
Add bat lookup
olgabot Dec 26, 2020
4e1fa33
Add removing common hashes
olgabot Dec 12, 2020
adae8c3
Update making celltype sbts
olgabot Dec 20, 2020
c2c1e48
Add nextflow runs so far
olgabot Jan 6, 2021
95e6dfd
Reduce translate cpus'
olgabot Jan 6, 2021
09bcfe8
Remove unused makefile
olgabot Jan 6, 2021
50fb32f
Add initial makefile for mouse
olgabot Jan 6, 2021
a146d37
Update mouse bladder makefile
olgabot Jan 6, 2021
9ef4140
Update mouse makefils from Hulk
olgabot Jan 7, 2021
a1c7fec
Add with tower
olgabot Jan 11, 2021
82d31a5
Add all notebooks from lrrr on Tue Jan 19 09:44:18 PST 2021
olgabot Jan 19, 2021
bf88da7
Add lung and diagnostic kmer analyses
olgabot Jan 25, 2021
d3e8196
Add more dissociation/ribosomal genes
olgabot Jan 25, 2021
7b2dbe5
Add new rule for ksize v8
olgabot Jan 25, 2021
8240aa1
Add comment about ksize_8 v2
olgabot Jan 25, 2021
10ca235
Add all notebooks from lrrr on Tue Mar 30 09:03:12 PDT 2021
olgabot Mar 30, 2021
d92836f
Add all notebooks from lrrr on Wed Apr 14 11:31:39 PDT 2021
olgabot Apr 14, 2021
2ed8e02
Add updated sig2kmer
olgabot Apr 16, 2021
edf25f7
Add python files for aggregating sig2kmer
olgabot Apr 22, 2021
a0d5f78
Update sig2kmer aggregation
olgabot Apr 22, 2021
cebea73
Update sencha --> orpheum
olgabot Apr 22, 2021
faa94e8
Add error message if no objects to contatenate
olgabot Apr 28, 2021
68b7ad7
Add continue
olgabot Apr 28, 2021
affef6b
Update sig2kmer aggregation
olgabot Apr 28, 2021
250b7ee
Add option to skip aligned/unaligned subdirs
olgabot Apr 28, 2021
496c490
Only glob for sketch directories
olgabot Apr 29, 2021
ceb1789
Add script to get unique k-mers per celltype
olgabot Apr 29, 2021
f5604b8
Update uniquify code
olgabot Apr 29, 2021
86d26f2
Get unique k-mers per cell id
olgabot May 3, 2021
04bfb29
Skip pandas tokenization/parser errors
olgabot May 4, 2021
d794a94
read hashval as string
olgabot May 4, 2021
45df976
Add notebooks analyzing lung data
olgabot May 12, 2021
4ffebe3
Add reruning of singlecell fastas
olgabot Jan 30, 2021
6e78a54
Makefile and config update
Oct 26, 2020
8652d52
batlas subsetting and metadata prep
phoenixAja Sep 30, 2020
1aa156a
cleaned notebooks
phoenixAja Oct 8, 2020
9cebad3
cleaned notebook
phoenixAja Oct 8, 2020
01193e4
Add everything from lrrr
olgabot Jul 18, 2021
0cc4293
Use incoming changes (--theirs) for sourmash search commands
olgabot Jul 19, 2021
10b7398
Update human makefile
olgabot Jan 7, 2021
dff83de
Skip compute for makefile
olgabot Jan 7, 2021
a8c7c21
Don't use local tower
olgabot Jan 15, 2021
1b9fe71
Add mini makefile
olgabot Dec 12, 2020
5f3b509
Rename file paths: data_sm --> data_lg/data_sm_copy
olgabot Oct 15, 2021
77c597a
Initial commit
bluegenes Apr 1, 2020
a121fd1
init
bluegenes Apr 12, 2020
2acb80b
add nbs
bluegenes May 19, 2020
c7ec62a
upd
bluegenes Apr 14, 2020
d8cfdc1
plts
bluegenes May 19, 2020
c24250d
Separate reading/writing, then plotting
olgabot May 19, 2020
5d99785
Re-add plotting notebook with full outputs
olgabot May 19, 2020
975fc5c
Add notebook to visualize observed vs theoretical k-mers for each dat…
olgabot Jun 9, 2020
99c3c7f
No more uniref
olgabot Jun 9, 2020
0f5e70d
Remove uniref, update molecular order
olgabot Jun 9, 2020
269e933
Add human qfo notebooks so far
olgabot Oct 1, 2020
e351123
Clean up notebooks with Lekha
olgabot Oct 2, 2020
92fc65c
Add python gitignore
olgabot Oct 2, 2020
8dd3237
Update notebooks using BUSCO mammalia only reads
olgabot Oct 5, 2020
2f45a08
Add more notebooks for qfo analyses
olgabot Oct 6, 2020
9bdfa1b
Finalize more qfo analysesA
olgabot Oct 29, 2020
9d0ee59
Update figure plotting
olgabot Apr 26, 2021
6985920
Update ROC AUC figure
olgabot Apr 26, 2021
6eb2a65
Add notebooks updated for paper on 2021-10-07
olgabot Oct 7, 2021
3e4bed3
Remove old README
olgabot Oct 7, 2021
77f04ec
Add figure 3B-C notebooks, plus data aggregatoin
olgabot Oct 20, 2021
29511ec
Separate analyses to per-figure notebooks
olgabot Oct 20, 2021
d6f3bed
Rename old notebooks
olgabot Oct 20, 2021
0f6cb91
Add Figure 3 and SFig 2, 6 notebooks
olgabot Oct 21, 2021
1a691ea
Update Supplementary Figure 6 notebook
olgabot Oct 22, 2021
5640fc5
Initial commit
bluegenes Apr 1, 2020
80997ac
init
bluegenes Apr 12, 2020
c3ca317
add nbs
bluegenes May 19, 2020
81d0d8c
upd
bluegenes Apr 14, 2020
5f528a7
plts
bluegenes May 19, 2020
6b88941
Separate reading/writing, then plotting
olgabot May 19, 2020
c66c8c8
Re-add plotting notebook with full outputs
olgabot May 19, 2020
6eeff59
Add notebook to visualize observed vs theoretical k-mers for each dat…
olgabot Jun 9, 2020
aa0c347
No more uniref
olgabot Jun 9, 2020
b5e9a54
Remove uniref, update molecular order
olgabot Jun 9, 2020
bb375f2
Add human qfo notebooks so far
olgabot Oct 1, 2020
f03ebb3
Clean up notebooks with Lekha
olgabot Oct 2, 2020
906e22b
Add python gitignore
olgabot Oct 2, 2020
38f278d
Update notebooks using BUSCO mammalia only reads
olgabot Oct 5, 2020
71d5550
Add more notebooks for qfo analyses
olgabot Oct 6, 2020
0035de2
Finalize more qfo analysesA
olgabot Oct 29, 2020
5e981fa
Update figure plotting
olgabot Apr 26, 2021
5674534
Update ROC AUC figure
olgabot Apr 26, 2021
7d1d8d5
Add notebooks updated for paper on 2021-10-07
olgabot Oct 7, 2021
a7acd52
Remove old README
olgabot Oct 7, 2021
dbb9710
Initial commit
bluegenes Apr 1, 2020
bcae423
init
bluegenes Apr 12, 2020
512d05f
add nbs
bluegenes May 19, 2020
dbfbf00
upd
bluegenes Apr 14, 2020
aff2dfe
plts
bluegenes May 19, 2020
df53113
Separate reading/writing, then plotting
olgabot May 19, 2020
dcc5fea
Re-add plotting notebook with full outputs
olgabot May 19, 2020
8f73544
Add notebook to visualize observed vs theoretical k-mers for each dat…
olgabot Jun 9, 2020
4fc2376
No more uniref
olgabot Jun 9, 2020
822fff9
Remove uniref, update molecular order
olgabot Jun 9, 2020
52d8e67
Add human qfo notebooks so far
olgabot Oct 1, 2020
7fc1b08
Clean up notebooks with Lekha
olgabot Oct 2, 2020
9fbd52b
Add python gitignore
olgabot Oct 2, 2020
7e729e5
Update notebooks using BUSCO mammalia only reads
olgabot Oct 5, 2020
1090d2d
Add more notebooks for qfo analyses
olgabot Oct 6, 2020
7450b8f
Finalize more qfo analysesA
olgabot Oct 29, 2020
77418ec
Update figure plotting
olgabot Apr 26, 2021
393513f
Update ROC AUC figure
olgabot Apr 26, 2021
352ed7f
Add notebooks updated for paper on 2021-10-07
olgabot Oct 7, 2021
2fdd98f
Remove old README
olgabot Oct 7, 2021
ba94e1d
Initial commit
bluegenes Apr 1, 2020
f4a39de
init
bluegenes Apr 12, 2020
f9cf1ec
add nbs
bluegenes May 19, 2020
c7a9b24
upd
bluegenes Apr 14, 2020
2a07c3e
plts
bluegenes May 19, 2020
85d37a8
Separate reading/writing, then plotting
olgabot May 19, 2020
0f270c7
Re-add plotting notebook with full outputs
olgabot May 19, 2020
6a5d2cb
Add notebook to visualize observed vs theoretical k-mers for each dat…
olgabot Jun 9, 2020
de4ca9f
No more uniref
olgabot Jun 9, 2020
268217f
Remove uniref, update molecular order
olgabot Jun 9, 2020
2d9e2f9
Add human qfo notebooks so far
olgabot Oct 1, 2020
775d2c7
Clean up notebooks with Lekha
olgabot Oct 2, 2020
e54414d
Add python gitignore
olgabot Oct 2, 2020
7d7bc04
Update notebooks using BUSCO mammalia only reads
olgabot Oct 5, 2020
1898016
Add more notebooks for qfo analyses
olgabot Oct 6, 2020
aa70807
Finalize more qfo analysesA
olgabot Oct 29, 2020
8956760
Update figure plotting
olgabot Apr 26, 2021
7ec3fdd
Update ROC AUC figure
olgabot Apr 26, 2021
4d33f4c
Add notebooks updated for paper on 2021-10-07
olgabot Oct 7, 2021
20400b4
Remove old README
olgabot Oct 7, 2021
166e57e
Add notebook for lemur unaligned genes
olgabot Oct 22, 2021
1f347d6
Use both mouse and human aligned kmers to annotate lemur kmers
olgabot Oct 22, 2021
d5adbcc
Merge branch 'main' into olgabot/add-lekha-subset-busco
olgabot Oct 22, 2021
a096d61
Add lemur unaligned annotation notebook
olgabot Oct 22, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 142 additions & 0 deletions notebooks/110_subset_busco_mouse_human.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Imports"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import math\n",
"import gzip\n",
"import json\n",
"import os\n",
"import glob\n",
"\n",
"\n",
"import pandas as pd\n",
"import screed\n",
"from tqdm import tqdm\n",
"# import seaborn as sns\n",
"\n",
"# %matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Subset protein fasta files for only BUSCO mammalia ids"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Perform subsetting of fasta files from uniprot IDs"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def subset_by_uniprot_ids(fasta, uniprot_ids):\n",
" records_subset = []\n",
" with screed.open(fasta) as records:\n",
" for record in records:\n",
" name = record['name']\n",
" record_id = name.split()[0]\n",
" uniprot_id = record_id.split('|')[1]\n",
" if uniprot_id in uniprot_ids:\n",
" records_subset.append(record)\n",
" return records_subset\n",
"\n",
" \n",
"def write_fasta(output_fasta, records):\n",
" with open(output_fasta, 'w') as f:\n",
" for record in records:\n",
" f.write(\">{name}\\n{sequence}\\n\".format(**record))\n",
"\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(64199, 6)\n",
"{'Capra hircus', 'Camelus bactrianus', 'Ornithorhynchus anatinus', 'Rhinolophus sinicus', 'Homo sapiens', 'Phascolarctos cinereus', 'Macaca mulatta', 'Aotus nancymaae', 'Oryctolagus cuniculus', 'Nannospalax galili', 'Ceratotherium simum simum', 'Tupaia chinensis', 'Erinaceus europaeus', 'Peromyscus maniculatus bairdii', 'Chinchilla lanigera', 'Lipotes vexillifer', 'Sorex araneus', 'Mus musculus'}\n"
]
}
],
"source": [
"orthodb_busco_mammalia = pd.read_csv(\n",
" f'{busco_mammalia_folder}/busco_mammalia__orthodb__to__uniprot__with_species.csv'\n",
")\n",
"print(orthodb_busco_mammalia.shape)\n",
"orthodb_busco_mammalia.head()\n",
"\n",
"print((set(orthodb_busco_mammalia['species_name'])))"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"\n",
"uniprot_fasta = \"/home/olga/data_lg/czbiohub-reference/uniprot/releases/2019_11/manually_downloaded/uniprot__taxonomy_Mammalia_9MAMM_40674.fasta.gz\"\n",
"output_folder = \"/home/olga/data_lg/czbiohub-reference/uniprot/releases/2019_11/manually_downloaded/mammalia_busco_subsets/\"\n",
"\n",
"for (species_id, species_name), df in orthodb_busco_mammalia.groupby([\"species\", \"species_name\"]):\n",
" species_uniprot_ids = set(df.external_db_gene_id)\n",
" species_records = subset_by_uniprot_ids(uniprot_fasta, species_uniprot_ids)\n",
" new_fasta = f\"{output_folder}{species_id}__{species_name.lower().replace(' ', '_')}.fasta\"\n",
" write_fasta(new_fasta, species_records)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:immune-evolution]",
"language": "python",
"name": "conda-env-immune-evolution-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.8"
},
"toc-autonumbering": true
},
"nbformat": 4,
"nbformat_minor": 4
}
441 changes: 441 additions & 0 deletions notebooks/111_subset_busco_all_mammals.ipynb

Large diffs are not rendered by default.

Loading