Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset10x Failing to Load #51

Closed
drneavin opened this issue Nov 16, 2020 · 3 comments
Closed

Dataset10x Failing to Load #51

drneavin opened this issue Nov 16, 2020 · 3 comments

Comments

@drneavin
Copy link

drneavin commented Nov 16, 2020

Hello,

I'm not sure the best location to put this issue - it arises when using the solo package but I'm fairly certain that the issue lies with the scvi package. I am getting the following output with the error:


[2020-11-16 15:18:18,195] INFO - scvi._settings | 'scvi' logger already has a StreamHandler, set its level to 10.
Cuda is not available, switching to cpu running!
[2020-11-16 15:18:18,202] DEBUG - scvi.dataset.dataset10X | Loading extracted local 10X dataset with custom filename
[2020-11-16 15:18:18,202] INFO - scvi.dataset.dataset10X | Preprocessing dataset
/opt/conda/envs/py36/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /opt/conda/conda-bld/pytorch_1603729021865/work/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1032, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1039, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/py36/bin/solo", line 33, in <module>
    sys.exit(load_entry_point('solo-sc', 'console_scripts', 'solo')())
  File "/opt/solo/solo/solo.py", line 123, in main
    dense=True)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/scvi/dataset/dataset10X.py", line 156, in __init__
    delayed_populating=delayed_populating,
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/scvi/dataset/dataset.py", line 2026, in __init__
    self.populate()
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/scvi/dataset/dataset10X.py", line 196, in populate
    gene_names = measurements_info[self.measurement_names_column].astype(np.str)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py", line 2902, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2893, in get_loc
    raise KeyError(key) from err
KeyError: 1

I can replicate this error when in python after importing scvi and trying to load the dataset with Dataset10X but can't identify why I am receiving this error.

The data is a dataset where the barcodes are letters followed by a dash, a number, and sometimes another letter, ie:

AACCGCGGTTGGTTTG-16
AACTCAGCACGGTAAG-16
ACCTTTACAACAACCT-5
ACGCAGCCAATGAAAC-9
ACGGAGAGTCAGATAA-9
ACGGGCTGTTTACTCT-14
ACTGATGTCTTGCAAG-4
ACTTTCAGTCTCTTTA-9
AGGTCATCAAACAACA-4D

and the files are produced with umitools_to_mtx from the R scrunchy package. I have a feeling the main problem is due somehow related to the fact that these files were not directly produced by the 10x cellranger pipeline. Here's the top of the matrix.mtx file:

%%MatrixMarket matrix coordinate integer general
%
20469 15266 4765018
1 1 1
90 1 1
129 1 1
169 1 1
170 1 13
245 1 1

I'll do some more digging but would love if you have some recommendations or input.

Thanks!

@njbernstein
Copy link
Contributor

njbernstein commented Nov 16, 2020

Hi @drneavin ,

I'm not sure what the issue is exactly, and you are right it seems to be a scvi issue. However, they are currently updating their code, so if its a true bug on their end thing might get tough. However they might have some suggestions about things to try.

Another option is you can try reading your file using scanpy into python and then write an anndata file. And then running solo on that file. Sorry, I'm of more help.

taking a quick look at some files I have. the ones you posted seem normal.

nicholas@sci-pvm-nicholas:~$ head matrix.mtx 
%%MatrixMarket matrix coordinate integer general
%metadata_json: {"format_version": 2, "software_version": "3.1.0"}
33646 4745 4759983
33574 1 30
33567 1 69
33566 1 28
33559 1 522
33558 1 50
33551 1 788
33509 1 45
nicholas@sci-pvm-nicholas:~$ head barcodes.tsv 
AAACCCAGTAAGATCA-1
AAACCCATCAGAGCAG-1
AAACGAACACAAATAG-1
AAACGAACAGATTAAG-1
AAACGAAGTTGCCATA-1
AAACGAATCAGGTGTT-1
AAACGCTAGAGTCTTC-1
AAACGCTAGATGTAGT-1
AAACGCTGTCAAGTTC-1
AAACGCTGTGACTCGC-1
nicholas@sci-pvm-nicholas:~$ head features.tsv 
ENSG00000243485	MIR1302-2HG	Gene Expression
ENSG00000237613	FAM138A	Gene Expression
ENSG00000186092	OR4F5	Gene Expression
ENSG00000238009	AL627309.1	Gene Expression
ENSG00000239945	AL627309.3	Gene Expression
ENSG00000239906	AL627309.2	Gene Expression
ENSG00000241599	AL627309.4	Gene Expression
ENSG00000236601	AL732372.1	Gene Expression
ENSG00000284733	OR4F29	Gene Expression
ENSG00000235146	AC114498.1	Gene Expression

@drneavin
Copy link
Author

Thanks for the recommendation @njbernstein!

I just finished checking on scanpy loading and that seems to be failing as well. However, some of my other codes that use just scipy to read the matrix in work fine so I think it probably has to do with some assumptions about the file structures that are built into these functions expected for 10x data.

I'll let you know if I find a good solution.

@drneavin
Copy link
Author

Solved per your recommendation to create an AnnData object saved as standard h5ad and used that as input for solo. Seems to be working now. Thanks for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants