HTSeq

GTF 文件，是一个压缩后的 gff文件。 htseq 提供了 GFF_Reader 来读取 GFF 文件。

 gtf_file = HTSeq.GFF_Reader( "Saccharomyces_cerevisiae.SGD1.01.56.gtf.gz",
    end_included=True )

我们通过下面来进行迭代这个文件：

>>> for feature in itertools.islice( gtf_file, 10 ):
...    print(feature)
...
<GenomicFeature: exon 'R0010W' at 2-micron: 251 -> 1523 (strand '+')>
<GenomicFeature: CDS 'R0010W' at 2-micron: 251 -> 1520 (strand '+')>
<GenomicFeature: start_codon 'R0010W' at 2-micron: 251 -> 254 (strand '+')>
<GenomicFeature: stop_codon 'R0010W' at 2-micron: 1520 -> 1523 (strand '+')>
<GenomicFeature: exon 'R0020C' at 2-micron: 3007 -> 1885 (strand '-')>
<GenomicFeature: CDS 'R0020C' at 2-micron: 3007 -> 1888 (strand '-')>
<GenomicFeature: start_codon 'R0020C' at 2-micron: 3007 -> 3004 (strand '-')>
<GenomicFeature: stop_codon 'R0020C' at 2-micron: 1888 -> 1885 (strand '-')>
<GenomicFeature: exon 'R0030W' at 2-micron: 3270 -> 3816 (strand '+')>
<GenomicFeature: CDS 'R0030W' at 2-micron: 3270 -> 3813 (strand '+')>ci

那些 feature 有下面的一些 object GenomicFeature. 如果你比对原始文件中的坐标，你会注意到 GFF_Reader 已经全部减去1.这是因为所以到文件需要通过HTSeq 调整来符合 Python 的习惯。python 习惯从0开始和末尾一般不会保存。因此，你可以直接比较坐标，来之不同的数据类型。也不用考虑细微的区别比如：gff 是1开始， sam 是0开始的。

>>> dir( feature )   
['__class__', ..., '__weakref__', 'attr', 'frame', 'get_gff_line',
'iv', 'name', 'score', 'source', 'type']

忽略那些特征开始于哪里的底线，我们可以看看如何来查看储存于 gff 文件中的数据。

>>> feature.iv
<GenomicInterval object '2-micron', [3270,3813), strand '+'>
>>> feature.source
'protein_coding'
>>> feature.type
'CDS'
>>> feature.score
'.'

最后一列（那列特征）被解析并被展示为一个字典：

>>> sorted(feature.attr.items())    
[('exon_number', '1'),
 ('gene_id', 'R0030W'),
 ('gene_name', 'RAF1'),
 ('protein_id', 'R0030W'),
 ('transcript_id', 'R0030W'),
 ('transcript_name', 'RAF1')]

第一列特征通常为 ID 因此它通常被存在 slot name:

>>> feature.name
'R0030W'

为了解决这些数据，我们会使用 GenomicArrayOfSets (之前介绍过)

>>> exons = HTSeq.GenomicArrayOfSets( "auto", stranded=False )

然而，我们的 RNA-seq 实验不是链特异性的，因此我们确实不知道序列来自于正链还是负链。这就是我们什么定义 GenomicArrayOfSet 为一个非链特异（stranded=False）来让它忽略所有的链信息。我们有很多相互交叉的基因，但是 GenomicArray 可以解决这些问题。

>>> for feature in gtf_file:
...    if feature.type == "exon":
...       exons[ feature.iv ] += feature.name

Note: 我们仅仅存储了基因的名字，因为这样可以很方便我们的后续处理。假如读取一个区间

>>> iv = HTSeq.GenomicInterval( "III", 23850, 23950, "." )

它的左边覆盖了2个基因 (YCL058C, YCL058W-A), 但是它的右侧仅仅覆盖 YCL058C 因为 YCL058-A 一端在 read 的中间。

>>> [(st[0], sorted(st[1])) for st in exons[iv].steps()]   
[(<GenomicInterval object 'III', [23850,23925), strand '.'>,
     ['YCL058C', 'YCL058W-A']),
 (<GenomicInterval object 'III', [23925,23950), strand '.'>,
     ['YCL058C'])]

假如那个转录序列的边界在 GTF 中是正确的，我们可能得出的结论是这条 read 来自基因，显示来自于2个基因（可以比对到2个基因），而不是来自于一个基因（只比对到一个基因）。更普遍的是，一个 read 可以覆盖多个 step, 我们有对于每个 feature 有一系列的 feature 名字。我们必须找到这些的交叉点。我们通过下面的代码来实现：

>>> iset = None
>>> for iv2, step_set in exons[iv].steps():
...    if iset is None:
...       iset = step_set.copy()
...    else:
...       iset.intersection_update( step_set )
...
>>> print(iset)
{'YCL058C'}

当我们看第一步，我们先拷贝了一份 (为了不破坏原来存储在 exon中的值)。接下来，我们使用 intersection_update python 方法，这样为了执行交叉的地方。然后我们得到只有一个的。得到这些一个，单个需要写下来：

>>> list(iset)[0]
'YCL058C'

通过这种方法，我们可以处理一遍所有的比对过的 reads，计算交叉的那些。如果它仅仅包括一个基因的名字，在这个基因的计数加一。对于每个基因，我们通过字典来计数，每个基因都从0开始计数。

>>> counts = {}
>>> for feature in gtf_file:
...    if feature.type == "exon":
...       counts[ feature.name ] = 0

现在，我们终于可以开始计数了：

>>> sam_file = HTSeq.SAM_Reader( "yeast_RNASeq_excerpt.sam" )
>>> for alnmt in sam_file:
...    if alnmt.aligned:
...       iset = None
...       for iv2, step_set in exons[ alnmt.iv ].steps():
...           if iset is None:
...              iset = step_set.copy()
...           else:
...              iset.intersection_update( step_set )
...       if len( iset ) == 1:
...          counts[ list(iset)[0] ] += 1

我们现在可以很方便的打印出结果：

>>> for name in sorted( counts.keys() ):
...    print(name, counts[name])   
15S_rRNA 0
21S_rRNA 0
HRA1 0
...
YPR048W 2
YPR049C 3
YPR050C 0
YPR051W 1
YPR052C 1
YPR053C 5
YPR054W 0
...
tY(GUA)M2 0
tY(GUA)O 0
tY(GUA)Q 0

一些比对软件可以输出有缺口的或者剪切过的比对结果。在 sam 文件中，这个通过 CIGAR 来进行编码。 HTSeq 已经可以处理这些。需要加参数 CigarOperation. Some aligners can output gapped or spliced alignments. In a SAM file, this in encoded in the CIGAR string. HTSeq has facilities to handle this conveniently, too, with the class CigarOperation. Chapter Counting reads in features with htseq-count describes a script which offers some further counting schemes.

The following figure illustrates the effect of these three modes and the --nonunique option:

Usage After you have installed HTSeq (see Prequisites and installation), you can run htseq-count from the command line:

htseq-count [options] <alignment_files> <gff_file>

If the file htseq-count is not in your path, you can, alternatively, call the script with

python -m HTSeq.scripts.count [options] <alignment_files> <gff_file>

The <alignment_files> are one or more files containing the aligned reads in SAM format. (SAMtools contain Perl scripts to convert most alignment formats to SAM.) Make sure to use a splicing-aware aligner such as STAR. HTSeq-count makes full use of the information in the CIGAR field.

To read from standard input, use - as <alignment_files>.

If you have paired-end data, pay attention to the -r option described below.

The <gff_file> contains the features in the GFF format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a4.HTSeq.md

a4.HTSeq.md

HTSeq

The following figure illustrates the effect of these three modes and the --nonunique option:

Files

a4.HTSeq.md

Latest commit

History

a4.HTSeq.md

File metadata and controls

HTSeq

The following figure illustrates the effect of these three modes and the --nonunique option: