USING THE 1000 GENOMES DATA

This note describes some simple steps for using the data from the 1000 Genomes Project http://www.1000genomes.org/.

Bam files

Download an index file

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/alignment_indices/20130502.low_coverage.alignment.index

Obtain a list of bam files (the first column)

cat 20130502.low_coverage.alignment.index |awk '{print $1}'|grep '\.mapped' > listbam.txt

There are 2535 bam files on the page ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/. Therefore, we can make a list of these bam files and their full links.

awk '{print "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/"$0}' < listbam.txt > listbamAndFullLinks.txt

We can choose a population (or multiple populations) to find tagSNPs. For example, here we choose the Mexican Ancestry in Los Angeles (MXL) population and find tagSNPs for FCGR3B gene.

cat listbamAndFullLinks.txt|grep "MXL" > listMXL.txt

The gene is at chr1:161592986-161601753 http://www.ncbi.nlm.nih.gov/gene/2215, so we will use samtools (Li et al. 2009 ) http://samtools.sourceforge.net/ to download a 1Mb region around the gene: chr1:161100000-162100000.

while read line
do
tempName=$(echo $line|awk -F"/" '{print $NF}')
samtools view -hb $line 1:161100000-162100000 > $tempName
done < listMXL.txt

After downloading, we can use samtools to keep only reads mapped:

for file in $(ls *bam)
do

##Index bam file
samtools index $file

##Keep only read mapped
sammtols view -F 4 $file -b > temp.bam

mv temp.bam $file
done

A good website to understand SAM/BAM flags is http://picard.sourceforge.net/explain-flags.html

VCF files

We can use samtools to obtain a vcf file (Danecek et al. 2011 ) http://vcftools.sourceforge.net/ for all MXL samples above or we can use the SNP data of the 1000 Genomes Project.

Here, we use the data from the 1000 Genomes Project.

All VCF files can be obtained from this page http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/

We choose a region, for example chr1:161400000-161700000 flanking the FCGR3 gene to identify tagSNPs.

Download and index the vcf file by using "tabix" and "bgzip" command lines in the tabix tool (Li, 2011 ) http://samtools.sourceforge.net/tabix.shtml:

tabix -h http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz 1:161400000-161700000  > chr1.161400000.161700000.vcf

Index the file:

bgzip chr1.161400000.161700000.vcf -c > chr1.161400000.161700000.vcf.gz

tabix -p vcf chr1.161400000.161700000.vcf.gz

References

P. Danecek, A. Auton, G. Abecasis, C.A. Albers, E. Banks, M.A. DePristo, R.E. Handsaker, G. Lunter, G.T. Marth, S.T. Sherry, others, (2011) The variant call format and VCFtools. Bioinformatics 27 (15) 2156-2158
H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, others, (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25 (16) 2078-2079
Heng Li, (2011) Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics 27 (5) 718-719

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using1000Genome.md

using1000Genome.md

USING THE 1000 GENOMES DATA

Bam files

VCF files

References

Files

using1000Genome.md

Latest commit

History

using1000Genome.md

File metadata and controls

USING THE 1000 GENOMES DATA

Bam files

VCF files

References