'Bioinformatics/SAM/BAM'에 해당되는 글 2건

  1. 2013.07.23 TLEN 정의
  2. 2013.07.23 FLAG 정의

TLEN 정의

Bioinformatics/SAM/BAM 2013. 7. 23. 14:47

SAM spec: 

SAM1.pdf


9. TLEN: signed observed Template LENgth. If all segments are mapped to the same reference, the

unsigned observed template length equals the number of bases from the leftmost mapped base

to the rightmost mapped base. The leftmost segment has a plus sign and the rightmost has a

minus sign. The sign of segments in the middle is undened. It is set as 0 for single-segment

template or when the information is unavailable.


: TLEN은 picard에서 insert size를 계산할 때 사용하는 값으로 insert size로도 대표된다.


TLEN case.pptx




'Bioinformatics > SAM/BAM' 카테고리의 다른 글

FLAG 정의  (0) 2013.07.23
Posted by halloRa
,

FLAG 정의

Bioinformatics/SAM/BAM 2013. 7. 23. 13:43

SAM spec: 

SAM1.pdf


flag 계산기: http://picard.sourceforge.net/explain-flags.html


FLAG: bitwise FLAG. Each bit is explained in the following table:

Bit Description

0x0 read aligned in the forward direction (0)

출처: http://seqanswers.com/forums/showthread.php?t=15280


0x1 template having multiple segments in sequencing (1)

: read paired


0x2 each segment properly aligned according to the aligner (2)

: read mapped in proper pair

: 즉, proper pair의 형태로 pair read가 위치하고 있다는 것임. 굳이 양쪽이 모두 mapped되지 않아도 나타날 수 있는 flag. 하지만 mate나 혹은 자기자신이 unmapped flag를 가지고 있더라도 마찬가지로 strand flag를 가지고 있어야 한다. 이럴 경우에는 proper pair flag가 나타나게 된다.

proper_pair_1.pdf

=> 현재 SAM spec이 바뀌면서 each segment properly aligned according to the aligner 로 의미가 바뀜

이를 보면 단순히 aligner에 의해 align(맵핑이 되었든 안되었든)이 되면 해당 flag가 발생. 

무조건 모든 read들의 맵평 결과 flag로 추가되어 나타남을 알 수 있다.


0x4 segment unmapped (4)

: read unmapped


0x8 next segment in the template unmapped (8)

: mate unmapped


0x10 SEQ being reverse complemented (16)

: read aligned in the reverse direction


0x20 SEQ of the next segment in the template being reversed (32)

: mate reverse strand


0x40 the first segment in the template (64)

: first in pair

: read1 파일에서 추출된 read


0x80 the last segment in the template (128)

: second in pair

: read2 파일에서 추출된 read


0x100 secondary alignment (256)

: a read having split hits may have multiple primary alignment records

: splice junction을 찾아주는 도구들을 사용할 경우 볼 수 있음

출처: http://bioinformatics.bc.edu/chuanglab/wiki/index.php/SAM_pairwise_flag_translation_table


0x200 not passing quality controls (512)

0x400 PCR or optical duplicate (1024)


-------------------------------------------------------------------------------------------------

[또다른 이야기]


출처: http://onetipperday.blogspot.kr/2012/04/understand-flag-code-of-sam-format.html

출처: http://seqanswers.com/forums/showthread.php?p=69643#post69643


Each reported read or pair alignment beyond the first has the SAM 'secondary' bit (which equals 256) set in its FLAG field. So, for multiple mapping reads, SAM alignments also contain their strand information. 

btw, For strand-specific RNAseq, the Tophat output SAM (converted from BAM) does not contain XS:A:+/- tag (which is required by cufflinks) for non-spliced reads. In order to get the proper strand info for the assembly, you need to manually add the tags. Here is an example code:



samtools view -h accepted_hits.bam | awk '{if($0 ~ /XS:A:/ || $1 ~ /^@/) print $0; else {if($2==16 || $2==272) print $0"\tXS:A:-"; if($2==0 || $2==256) print $0"\tXS:A:+";}}' accepted_hits.sam

FLAG=256 and 272 is the corresponding version of 0 and 16 for multiple mapped hits.

UPDATE: This only works for single-end lib. For pair-end lib, the FLAG should be odd number, but in any case, reads on minus strand always have 1 on the 5th bit of binary code (e.g. 0x10 =10000). Thanks to Wei's suggestion on this. Here is the updated code:

samtools view -h accepted_hits.bam | awk '{if($0 ~ /XS:A:/ || $1 ~ /^@/) print $0; else {if(and($2,0x10)==16) print $0"\tXS:A:-"; else print $0"\tXS:A:+";}}' accepted_hits.sam


UPDATE2: (from the  final version of RNAseq lecture slides)ØFor non-strand-specific lib, you’re actually sequencing cDNA from both strands. So, the ‘strand’ info in the alignment is senseless (except for the spliced-reads).

ØFor strand-specific lib, if FLAG contains 0x10 (e.g. 0x53=0x40+0x10+0x2+0x1), reads map to ‘minus’ strand,otherwise, ‘plus’ strand.
ØCufflinks requires XS:A:+/- tag (which Tophat doesnt have for some reads). You need to manually add it by command, e.g. for dUTP library (where /2 is from transcript strand, /1 from the opposite strand):
samtools view -h accepted_hits.bam | awk '{if($0 ~ /XS:A:/ || $1 ~ /^@/) print $0; else {if(and($2,0x40) || and($2,0x90)) print $0"\tXS:A:-"; else print $0"\tXS:A:+";}}' accepted_hits.sam



 See my post here:

http://seqanswers.com/forums/showthread.php?p=69643#post69643

If you used a non-strand-specific protocol (which most people still do) you're not actually sequencing transcripts, you're sequencing cDNA with two strands. So the read could come from either of the two strands of a cDNA and you don't have any information which of the two strands corresponds to the original mRNA strand. This can be inferred when a read spans a splice junctions because splice site are highly conserved at the first 2 bases and last 2 bases of an intron. (Thomas Doktor)


# So, the strand info in the SAM alignment for non-strand-specific RNAseq lib cannot be used as evidence of transcript strand.

'Bioinformatics > SAM/BAM' 카테고리의 다른 글

TLEN 정의  (0) 2013.07.23
Posted by halloRa
,