SAM spec:
SAM1.pdf
flag 계산기: http://picard.sourceforge.net/explain-flags.html
FLAG: bitwise FLAG. Each bit is explained in the following table:
Bit Description
0x0 read aligned in the forward direction (0)
출처: http://seqanswers.com/forums/showthread.php?t=15280
0x1 template having multiple segments in sequencing (1)
: read paired
0x2 each segment properly aligned according to the aligner (2)
: read mapped in proper pair
: 즉, proper pair의 형태로 pair read가 위치하고 있다는 것임. 굳이 양쪽이 모두 mapped되지 않아도 나타날 수 있는 flag. 하지만 mate나 혹은 자기자신이 unmapped flag를 가지고 있더라도 마찬가지로 strand flag를 가지고 있어야 한다. 이럴 경우에는 proper pair flag가 나타나게 된다.
proper_pair_1.pdf
=> 현재 SAM spec이 바뀌면서 each segment properly aligned according to the aligner 로 의미가 바뀜
이를 보면 단순히 aligner에 의해 align(맵핑이 되었든 안되었든)이 되면 해당 flag가 발생.
무조건 모든 read들의 맵평 결과 flag로 추가되어 나타남을 알 수 있다.
0x4 segment unmapped (4)
: read unmapped
0x8 next segment in the template unmapped (8)
: mate unmapped
0x10 SEQ being reverse complemented (16)
: read aligned in the reverse direction
0x20 SEQ of the next segment in the template being reversed (32)
: mate reverse strand
0x40 the first segment in the template (64)
: first in pair
: read1 파일에서 추출된 read
0x80 the last segment in the template (128)
: second in pair
: read2 파일에서 추출된 read
0x100 secondary alignment (256)
: a read having split hits may have multiple primary alignment records
: splice junction을 찾아주는 도구들을 사용할 경우 볼 수 있음
출처: http://bioinformatics.bc.edu/chuanglab/wiki/index.php/SAM_pairwise_flag_translation_table
0x200 not passing quality controls (512)
0x400 PCR or optical duplicate (1024)
-------------------------------------------------------------------------------------------------
[또다른 이야기]
출처: http://onetipperday.blogspot.kr/2012/04/understand-flag-code-of-sam-format.html
출처: http://seqanswers.com/forums/showthread.php?p=69643#post69643
Each reported read or pair alignment beyond the first has the SAM 'secondary' bit (which equals 256) set in its FLAG field. So, for multiple mapping reads, SAM alignments also contain their strand information.
btw, For strand-specific RNAseq, the Tophat output SAM (converted from BAM) does not contain XS:A:+/- tag (which is required by cufflinks) for non-spliced reads. In order to get the proper strand info for the assembly, you need to manually add the tags. Here is an example code:
samtools view -h accepted_hits.bam | awk '{if($0 ~ /XS:A:/ || $1 ~ /^@/) print $0; else {if($2==16 || $2==272) print $0"\tXS:A:-"; if($2==0 || $2==256) print $0"\tXS:A:+";}}' accepted_hits.sam
FLAG=256 and 272 is the corresponding version of 0 and 16 for multiple mapped hits.
UPDATE: This only works for single-end lib. For pair-end lib, the FLAG should be odd number, but in any case, reads on minus strand always have 1 on the 5th bit of binary code (e.g. 0x10 =10000). Thanks to Wei's suggestion on this. Here is the updated code:
samtools view -h accepted_hits.bam | awk '{if($0 ~ /XS:A:/ || $1 ~ /^@/) print $0; else {if(and($2,0x10)==16) print $0"\tXS:A:-"; else print $0"\tXS:A:+";}}' accepted_hits.sam
UPDATE2: (from the final version of RNAseq lecture slides)ØFor non-strand-specific lib, you’re actually sequencing cDNA from both strands. So, the ‘strand’ info in the alignment is senseless (except for the spliced-reads).
ØFor strand-specific lib, if FLAG contains 0x10 (e.g. 0x53=0x40+0x10+0x2+0x1), reads map to ‘minus’ strand,otherwise, ‘plus’ strand.
ØCufflinks requires XS:A:+/- tag (which Tophat doesn’t have for some reads). You need to manually add it by command, e.g. for dUTP library (where /2 is from transcript strand, /1 from the opposite strand): samtools view -h accepted_hits.bam | awk '{if($0 ~ /XS:A:/ || $1 ~ /^@/) print $0; else {if(and($2,0x40) || and($2,0x90)) print $0"\tXS:A:-"; else print $0"\tXS:A:+";}}' accepted_hits.sam
See my post here:
http://seqanswers.com/forums/showthread.php?p=69643#post69643
If you used a non-strand-specific protocol (which most people still do) you're not actually sequencing transcripts, you're sequencing cDNA with two strands. So the read could come from either of the two strands of a cDNA and you don't have any information which of the two strands corresponds to the original mRNA strand. This can be inferred when a read spans a splice junctions because splice site are highly conserved at the first 2 bases and last 2 bases of an intron. (Thomas Doktor)
# So, the strand info in the SAM alignment for non-strand-specific RNAseq lib cannot be used as evidence of transcript strand.