Skip to content

Latest commit

 

History

History
40 lines (34 loc) · 5.51 KB

README.md

File metadata and controls

40 lines (34 loc) · 5.51 KB

GENCODE

GENCODE: massively expanding the lncRNA catalog through capture long-read RNA sequencing

Gazaldeep Kaur1,*, Tamara Perteghella1,2,*, Sílvia Carbonell-Sala1,*, Jose Gonzalez-Martinez3,*,Toby Hunt3,*, Tomasz Mądry4, Irwin Jungreis5,6, Carme Arnan1, Julien Lagarde1,7, Beatrice Borsari8,9, Cristina Sisu10, Yunzhe Jiang8,9, Ruth Bennett3, Andrew Berry3, Daniel Cerdán-Vélez11, Kelly Cochran12, Covadonga Vara13, Claire Davidson3, Sarah Donaldson3, Cagatay Dursun8,9, Silvia González-López1,2, Sasti Gopal Das4, Matthew Hardy3, Zoe Hollis3, Mike Kay3, José Carlos Montañés13, Pengyu Ni8,9, Ramil N. Nurtdinov1, Emilio Palumbo1, Carlos Pulido-Quetglas14,15, Marie-Marthe Suner3, Xuezhu Yu8,9, Dingyao Zhang8,9, Jane E. Loveland3, M. Mar Albà13,16, Mark Diekhans17, Andrea Tanzer18,19, Jonathan M. Mudge3, Paul Flicek3, Fergal J Martin3, Mark Gerstein8,9, Manolis Kellis5,6, Anshul Kundaje12, Benedict Paten17, Michael L. Tress11, Rory Johnson14,15, Barbara Uszczynska-Ratajczak4, Adam Frankish3, Roderic Guigó1,2

1. Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Catalonia, Spain.
2. Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra (UPF).
3. European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
4. Department of Computational Biology of Noncoding RNA, Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Poznan, Poland.
5. Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, MA 02139, USA.
6. The Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, USA.
7. Flomics Biotech, SL, Carrer de Roc Boronat 31, 08005 Barcelona, Catalonia, Spain.
8. Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA.
9. Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA.
10. Department of Life Sciences, Brunel University London, Uxbridge, London, UB8 3PH, UK.
11. Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Calle Melchor Fernandez Almagro, 3, 28029 Madrid, Spain.
12. Department of Computer Science, Stanford University, Stanford, CA, USA.
13. Hospital del Mar Research Institute, Dr. Aiguader 88, Barcelona 08003, Spain.
14. Department of Medical Oncology, Bern University Hospital, Murtenstrasse 35, 3008 Bern, Switzerland.
15. School of Biology and Environmental Science, University College Dublin, University College Dublin, Belfield, Dublin 4, D04 V1W8, Ireland.
16. Catalan Institute for Research and Advanced Studies (ICREA), Barcelona, Spain.
17. UC Santa Cruz Genomics Institute, 2300 Delaware Avenue, University of California, Santa Cruz, CA 95060, USA.
18. University of Vienna, Research Network Data Science, Kolingasse 14-16, 1090 Vienna, Austria.
19. University of Vienna, Faculty of Computer Science, Research Group Visualization and Data Analysis, Waehringerstrasse 29, 1090 Vienna, Austria.

* Equal contribution
Correspondence should be addressed to R.G. ([email protected])

GENCODE is a 20-year international project focused on producing high-quality annotations for human and mouse genomes, crucial for understanding gene function. While the human gene catalog for protein-coding genes is nearly complete, long non-coding RNA (lncRNA) annotations have remained inconsistent across different catalogs. To address this, GENCODE used targeted RNA sequencing to unify and expand lncRNA annotations in human and mouse, employing full-length sequencing across diverse tissues. This effort resulted in 16,817 new human genes and 22,210 new mouse genes, significantly increasing the lncRNA catalog and improving orthology mapping between species. These new annotations enhance the functional interpretation of genome data, linking previously unannotated regions to biological functions.

In this repository:

Summary of the steps taken to process long-read data, upon sequencing but prior to LyRic. Measures undertaken to assess the quality of the data prior to downstream processing are also detailed here.

List of the files used in this work, complemented with descriptions of the steps taken to generate them, links to direct download, and detailed information about formats and tags.

Datasets used in this work, complemented with useful information regarding the files and their processing prior to analysis.

Codes used in various downstream analyses.