'Bioinformatics'에 해당되는 글 9건

  1. 2014.03.13 Vecscreen
  2. 2014.01.23 vector trim을 위한 NCBI vecscreen 서버에 설치하기
  3. 2014.01.20 BLAST local db setting
  4. 2013.11.01 GWAS? TCGA? ENCODE?
  5. 2013.10.11 Fold Change
  6. 2013.09.09 FastX toolkit local install
  7. 2013.07.29 미토콘드리아 모계 유전
  8. 2013.07.23 TLEN 정의
  9. 2013.07.23 FLAG 정의

Vecscreen

Bioinformatics/New Tech 2014. 3. 13. 09:50

출처: http://www.ncbi.nlm.nih.gov/tools/vecscreen/about/


About VecScreen

VecScreen is a system that quickly finds segments of a nucleic acid sequence that may be of vector origin. It helps researchers identify and remove any segments of vector origin before they analyze or submit sequences. Researchers are encouraged to screen their sequences for vector contamination using the form on the VecScreen search page.

Failure to recognize foreign segments in a sequence can:

  • lead to erroneous conclusions about the biological significance of the sequence
  • waste time and effort in analysis of contaminated sequence
  • delay the release of the sequence in a public database
  • pollute public databases with contaminated sequence

GenBank Annotation Staff use VecScreen to verify that sequences submitted for inclusion in the database are free of vector contamination

VecScreen searches a query sequence for segments that match any sequence in UniVec, a specialized non-redundant vector database. The search uses BLAST with parameters preset for optimal detection of vector contamination. Those segments of the query that match vector sequences are categorized according to the strength of the match, and their locations are displayed (see an example of a positive result).

Although a VecScreen search against UniVec will not identify the vector that is the most likely source of the contamination (see UniVec Limitations), this can usually be deduced from the cloning history of the sequenced DNA (see Identifying the Foreign Sequence for more details).

Guidance on how to interpret positive VecScreen results and also on how to remove the foreign segment(s) from a contaminated sequence is available in Interpretation of VecScreen Results.

VecScreen Search Parameters

The sequence of any vector contamination should theoretically be identical to the known sequence of the vector. In practice, occasional differences are expected to arise from sequencing errors, and less frequently, from engineered variants or spontaneous mutations. The search parameters used for VecScreen have, therefore, been chosen to find sequence segments that are identical to known vector sequences or which deviate only slightly from the known sequence.

The blastn parameters used for VecScreen are significantly more stringent than the default blastn parameters. The principal differences are:

  • Increased penalty for mismatches
    • This severely limits the frequency of mismatches in alignments.
  • Gap penalties more tolerant of single base insertions or deletions
    • This accommodates the type of sequencing error that adds or omits a base.
  • Low complexity filtering only for initial hits
    • This prevents an alignment from being initiated in a low complexity region while allowing alignments that extend across regions of low complexity to be scored appropriately.

The VecScreen parameters are pre-set using blastn options-q -5 -G 3 -E 3 -F "m D" -e 700 -Y 1.75e12

VecScreen Match Categories

Vector contamination usually occurs at the beginning or end of a sequence; therefore, different criteria are applied for terminal and internal matches. VecScreen considers a match to be terminal if it starts within 25 bases of the beginning of the query sequence or stops within 25 bases of the end of the sequence. Matches are categorized according to the expected frequency of an alignment with the same score occurring between random sequences.

Strong Match to Vector
(Expect 1 random match in 1,000,000 queries of length 350 kb.)
Terminal match with Score ≥ 24.
Internal match with Score ≥ 30.
Moderate Match to Vector
(Expect 1 random match in 1,000 queries of length 350 kb.)
Terminal match with Score 19 to 23.
Internal match with Score 25 to 29.
Weak Match to Vector
(Expect 1 random match in 40 queries of length 350 kb.)
Terminal match with Score 16 to 18.
Internal match with Score 23 to 24.
Segment of Suspect Origin
Any segment of fewer than 50 bases between two vector matches or between a match and an end.


-------------------------------------------------------------------------------------------------------------------

출처: https://gist.github.com/brantfaircloth/4325589


local에서 직접 blastn으로 univec db에 돌려볼 경우에는 아래와 같이 option 값들을 설정하여 돌리면 가능

--> 실제 실행 결과 vecscreen을 사용한 결과와는 다르게 나타난다.

--> But, 실제 위의 vecscreen 페이지의 설명대로면 아래와 같이 옵션 설정되는 것이 틀린 것은 아님.

blastn -task blastn -db UniVec_core -query test.fsa \
    -evalue 1 -gapopen 3 -gapextend 3 -word_size 11 \
    -reward 1 -penalty -5 -out blast.out -num_threads 4 \
    -dust yes -searchsp 1750000000000 -soft_masking true \
    -outfmt 6


'Bioinformatics > New Tech' 카테고리의 다른 글

vector trim을 위한 NCBI vecscreen 서버에 설치하기  (0) 2014.01.23
BLAST local db setting  (0) 2014.01.20
GWAS? TCGA? ENCODE?  (0) 2013.11.01
Fold Change  (0) 2013.10.11
FastX toolkit local install  (0) 2013.09.09
Posted by halloRa
,

출처: http://www.biostars.org/p/69584/


Question: How to automatically screen thousands of sequences using VecScreen

I looked at the NCBI Vecscreen website (http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html), and you can put multiple sequences in fasta format in at a time. It allows you to download results, but there are several blast "hits" in the results. The downloaded results are not in the same format as displayed on the website where the website indicates a section of the sequence that has strong, medium, etc. similarity to a vector. Is there a way to download the results? Particularly the sections of the sequences that are possibly contaminated.

I'm really looking for a way to automate screening hundreds of thousands of sequences for vector contamination (and then cutting the sequences to remove the contamination.)

Any help is appreciated.


1 answer

Okay, I got vecscreen to work. The problem was that the app wasn't included in the FTP files that I downloaded from NCBI. I used subversion to get all the code and was able to find and build vecscreen. The text output can be used to clean the sequences. This could be done with Python and BioPython.

Here is what I finally did to get vecscreen to compile. As I mentioned, for some reason it wasn't in the tarball from the FTP site, so I had to check out with subversion (svn) NCBI toolbox users manual for building: http://www.ncbi.nlm.nih.gov/books/NBK7167/

With Linux ... Make sure G++ is installed (could be different for different platforms), etc. Use the following command to get the source: svn co http://anonsvn.ncbi.nlm.nih.gov/repos/v1/trunk/c++ From the compilers directory, do ./GCC.sh (this is different for different platforms) This step could be unnecessary From the top-level directory of the checked out files, do ./configure --with-flat-makefile cd GCC444-Debug/build make -f Makefile.flat $PROJECT_NAME (i.e. app/vecscreen/) Also need app/blast/ and app/blastdb/ Downloaded UniVec_Core fasta file from ftp://ftp.ncbi.nih.gov/pub/UniVec/ (This has only non-mammalian vectors) Make local copy of UniVec_Core database in the GCC444-Debug/bin directory with the command: ./makeblastdb -in UniVec_Core -dbtype nucl -out UniVec_Core.db Use the vecscreen command (found in GCC444-Debug/bin/) ./vecscreen -db UniVec_Core.db -query $fasta_file -out $vecscreen_outfile -outfmt 0 -text_output


--------------------------------------------------------------------------------------------------------------------

실제 실행 ]


1. 서버 단에 svn 설치

> yum install svn


2. svn을 통하여 NCBI FTP로부터 파일을 다운로드

> svn co http://anonsvn.ncbi.nlm.nih.gov/repos/v1/trunk/c++


3. 해당하는 플랫폼 디렉토리로 들어가서 컴파일

> cd ./c++/compiler/unix

> ./GCC.sh


4. 다음 다시 top level directory로 돌아가서 아래와 같이 configure

> cd ../..

./configure --with-flat-makefile

 cd GCC444-Debug/build

make -f Makefile.flat ./app/vecscreen/



5. 다음 NCBI로부터 Univec_core 디비 설정 (http://hallora.tistory.com/304)



6. 마지막으로 실행

> cd ../GCC444-Debug/bin/

> ./vecscreen -db [path]/UniVec_core -query [fastafile] -out [output] -outfmt 0 -text_output


'Bioinformatics > New Tech' 카테고리의 다른 글

Vecscreen  (0) 2014.03.13
BLAST local db setting  (0) 2014.01.20
GWAS? TCGA? ENCODE?  (0) 2013.11.01
Fold Change  (0) 2013.10.11
FastX toolkit local install  (0) 2013.09.09
Posted by halloRa
,

출처: http://seqanswers.com/forums/showthread.php?t=9452


vector trim을 위한 vector search를 위해 UniVec_Core 정보에 

blast를 돌리기 위하여 UniVec_Core 정보를 db화하여 blast에서 참조할 수 있도록 해야 하는데

이 때


makeblastdb -in UniVec_Core -dbtype nucl -out UniVec_core


와 같이 작성해서 돌리면 끝!


이 때 dbtype은 protein 정보일 경우 prot 이라고 적어주면 된다.


'Bioinformatics > New Tech' 카테고리의 다른 글

Vecscreen  (0) 2014.03.13
vector trim을 위한 NCBI vecscreen 서버에 설치하기  (0) 2014.01.23
GWAS? TCGA? ENCODE?  (0) 2013.11.01
Fold Change  (0) 2013.10.11
FastX toolkit local install  (0) 2013.09.09
Posted by halloRa
,

GWAS: Genome-Wide Association Study --> about SNP

http://gds.nih.gov/

GWAS 이야기: http://blog.daum.net/_blog/BlogTypeView.do?blogid=0MSQg&articleno=7530895&admin=


TCGA: The Cancer Genome Atlas --> about Cancer

http://cancergenome.nih.gov/


ENOCDE: the ENCyclopedia Of DNA Elements --> about epigenetic

http://genome.ucsc.edu/encode/

'Bioinformatics > New Tech' 카테고리의 다른 글

vector trim을 위한 NCBI vecscreen 서버에 설치하기  (0) 2014.01.23
BLAST local db setting  (0) 2014.01.20
Fold Change  (0) 2013.10.11
FastX toolkit local install  (0) 2013.09.09
미토콘드리아 모계 유전  (0) 2013.07.29
Posted by halloRa
,

Fold Change

Bioinformatics/New Tech 2013. 10. 11. 09:59

출처: http://en.wikipedia.org/wiki/Fold_change


Fold change is a measure describing how much a quantity changes going from an initial to a final value. For example, an initial value of 30 and a final value of 60 corresponds to a fold change of 2, or in common terms, a two-fold increase. Fold change is calculated simply as the ratio of the final value to the initial value[citation needed], i.e. if the initial value is A and final value is B, the fold change is B/A. As another example, a change from 80 to 20 would be a fold change of 0.25, while a change from 20 to 80 would be a fold change of 4. Some practitioners replace a fold-change value that is less than 1 by the negative of its inverse[citation needed], e.g. a change from 80 to 20 would be a fold change of -4 (or in common terms, a four-fold decrease).

A benefit of expressing a change as the ratio between an initial value and a final value -- a fold change -- is that the change itself is emphasized rather than the absolute values. This property makes the fold change suitable for statistical tests that need to normalize data to eliminate systematic error. The distributional fold change test is based upon this idea.

Fold change is often used in analysis of gene expression data in microarray and RNA-Seq experiments, for measuring change in the expression level of a gene. [1]

'Bioinformatics > New Tech' 카테고리의 다른 글

vector trim을 위한 NCBI vecscreen 서버에 설치하기  (0) 2014.01.23
BLAST local db setting  (0) 2014.01.20
GWAS? TCGA? ENCODE?  (0) 2013.11.01
FastX toolkit local install  (0) 2013.09.09
미토콘드리아 모계 유전  (0) 2013.07.29
Posted by halloRa
,

download: http://hannonlab.cshl.edu/fastx_toolkit/download.html


1. libgtextutils-0.6.1.tar.bz2 설치

> wget -r [tools download page link]

> bunzip2 libgtextutils-0.6.1.tar.bz2

> tar -xvf libgtextutils-0.6.1.tar



2. 해당 폴더로 가서 설치

> ./configure --prefix=/my/local/page/libgtextutils-0.6.1

> make

> make install



출처: http://hannonlab.cshl.edu/fastx_toolkit/pkg_config_email.txt

3. path 설정 

> pkg-config --cflags libgtextutils-0.6.1

If your file was installed into "/usr/local/lib..." but pkg-config shows
the following error:
  Package gtextutils-0.1 was not found in the pkg-config search path.
  Perhaps you should add the directory containing `gtextutils-0.1.pc'
  to the PKG_CONFIG_PATH environment variable
  No package 'gtextutils-0.1' found
export PKG_CONFIG_PATH=/my/local/page/libgtextutils-0.6.1:$PKG_CONFIG_PATH
> cp gtextutils.pc libgtextutils-0.6.1.pc
> pkg-config --cflags libgtextutils-0.6.1
-I /my/local/page/libgtextutils-0.6.1/include/gtextutils


4. fastx_toolkit-0.0.13.2.tar.bz2 설치

> wget -r [tools download page link]

> bunzip2 fastx_toolkit-0.0.13.2.tar.bz2

> tar -xvf fastx_toolkit-0.0.13.2.tar



5. 해당 폴더로 가서 설치

> ./configure --prefix=/my/local/page/fastx_toolkit-0.0.13.2

> make

> make install



6. /my/local/page/fastx_toolkit-0.0.13.2/bin 폴더에 fastx toolkit들이 설치된 것을 확인



---------------------------------------------------------------------------------------------------


fastx toolkit 의 fastq_quality_boxplot_graph.sh 명령어 실행 시

fastq_quality_boxplot_graph.sh: line 162: gnuplot: command not found 와 같은 오류 발생.


gnuplot 프로그램을 설치해야 함.


homepage: http://www.gnuplot.info/

download: http://sourceforge.net/projects/gnuplot/files/gnuplot/4.6.3/


1. 프로그램 다운로드

> wget -r [program]



2. 압축을 풀고 해당 폴더로 가서 설치

> tar -xvzf gnuplot-4.6.3.tar.gz

> cd gnuplot-4.6.3

> ./configure

> make

> sudo make install




'Bioinformatics > New Tech' 카테고리의 다른 글

vector trim을 위한 NCBI vecscreen 서버에 설치하기  (0) 2014.01.23
BLAST local db setting  (0) 2014.01.20
GWAS? TCGA? ENCODE?  (0) 2013.11.01
Fold Change  (0) 2013.10.11
미토콘드리아 모계 유전  (0) 2013.07.29
Posted by halloRa
,

출처: http://blog.naver.com/yhk1975?Redirect=Log&logNo=138680860



......핵에 있는 DNA 양에 비하면 미토콘드리아 DNA의 양은 약 1퍼센트 정도 밖에 되지 않지만 다른 고양이의 난자에 들어있던 미토콘드리아의 유전물질 때문에 두 고양이의 털 색깔이 달라진 것으로 유추해 볼 수 있다. 즉 CC (Copy Cat) 가 레인보우와 다른 유전 형질을 가진 것은 난자를 제공한 고양이의 형질을 받았다는 것을 의미한다. 이와 같이 난자에 DNA를 포함한 미토콘드리아가 있다는 것을 알 수 있다.


미토콘드리아가 오로지 모계 유전이 되는 것은 미토콘드리아가 핵이 아닌 세포질 속에 있기 때문이다


......난자도, 정자도 모두 미토콘드리아를 가지고 있다. 정자는 세포질이 없고 중편에 미토콘드리아가 밀집되어 있어 세포 호흡을 통해 정자가 운동하는데 필요한 에너지를 생산한다............난자의 세포막과 정자의 막이 융합하면서 정자의 머리 부분이 난자 속으로 침입하는데, 수정을 하면 정자가 가지고 들어온 미토콘드리아는 난자에 의해 모두 파괴되어 버린다.........정자에 있는 미토콘드리아는 파괴되고 난자의 세포질에 있는 미토콘드리아만이 남게 되기 때문에 수정란에 존재하는 것은 난자의 미토콘드리아가 된다. 즉 모든 생명체는 모계 미토콘드리아를 가지게 된다는 의미이다. 

'Bioinformatics > New Tech' 카테고리의 다른 글

vector trim을 위한 NCBI vecscreen 서버에 설치하기  (0) 2014.01.23
BLAST local db setting  (0) 2014.01.20
GWAS? TCGA? ENCODE?  (0) 2013.11.01
Fold Change  (0) 2013.10.11
FastX toolkit local install  (0) 2013.09.09
Posted by halloRa
,

TLEN 정의

Bioinformatics/SAM/BAM 2013. 7. 23. 14:47

SAM spec: 

SAM1.pdf


9. TLEN: signed observed Template LENgth. If all segments are mapped to the same reference, the

unsigned observed template length equals the number of bases from the leftmost mapped base

to the rightmost mapped base. The leftmost segment has a plus sign and the rightmost has a

minus sign. The sign of segments in the middle is undened. It is set as 0 for single-segment

template or when the information is unavailable.


: TLEN은 picard에서 insert size를 계산할 때 사용하는 값으로 insert size로도 대표된다.


TLEN case.pptx




'Bioinformatics > SAM/BAM' 카테고리의 다른 글

FLAG 정의  (0) 2013.07.23
Posted by halloRa
,

FLAG 정의

Bioinformatics/SAM/BAM 2013. 7. 23. 13:43

SAM spec: 

SAM1.pdf


flag 계산기: http://picard.sourceforge.net/explain-flags.html


FLAG: bitwise FLAG. Each bit is explained in the following table:

Bit Description

0x0 read aligned in the forward direction (0)

출처: http://seqanswers.com/forums/showthread.php?t=15280


0x1 template having multiple segments in sequencing (1)

: read paired


0x2 each segment properly aligned according to the aligner (2)

: read mapped in proper pair

: 즉, proper pair의 형태로 pair read가 위치하고 있다는 것임. 굳이 양쪽이 모두 mapped되지 않아도 나타날 수 있는 flag. 하지만 mate나 혹은 자기자신이 unmapped flag를 가지고 있더라도 마찬가지로 strand flag를 가지고 있어야 한다. 이럴 경우에는 proper pair flag가 나타나게 된다.

proper_pair_1.pdf

=> 현재 SAM spec이 바뀌면서 each segment properly aligned according to the aligner 로 의미가 바뀜

이를 보면 단순히 aligner에 의해 align(맵핑이 되었든 안되었든)이 되면 해당 flag가 발생. 

무조건 모든 read들의 맵평 결과 flag로 추가되어 나타남을 알 수 있다.


0x4 segment unmapped (4)

: read unmapped


0x8 next segment in the template unmapped (8)

: mate unmapped


0x10 SEQ being reverse complemented (16)

: read aligned in the reverse direction


0x20 SEQ of the next segment in the template being reversed (32)

: mate reverse strand


0x40 the first segment in the template (64)

: first in pair

: read1 파일에서 추출된 read


0x80 the last segment in the template (128)

: second in pair

: read2 파일에서 추출된 read


0x100 secondary alignment (256)

: a read having split hits may have multiple primary alignment records

: splice junction을 찾아주는 도구들을 사용할 경우 볼 수 있음

출처: http://bioinformatics.bc.edu/chuanglab/wiki/index.php/SAM_pairwise_flag_translation_table


0x200 not passing quality controls (512)

0x400 PCR or optical duplicate (1024)


-------------------------------------------------------------------------------------------------

[또다른 이야기]


출처: http://onetipperday.blogspot.kr/2012/04/understand-flag-code-of-sam-format.html

출처: http://seqanswers.com/forums/showthread.php?p=69643#post69643


Each reported read or pair alignment beyond the first has the SAM 'secondary' bit (which equals 256) set in its FLAG field. So, for multiple mapping reads, SAM alignments also contain their strand information. 

btw, For strand-specific RNAseq, the Tophat output SAM (converted from BAM) does not contain XS:A:+/- tag (which is required by cufflinks) for non-spliced reads. In order to get the proper strand info for the assembly, you need to manually add the tags. Here is an example code:



samtools view -h accepted_hits.bam | awk '{if($0 ~ /XS:A:/ || $1 ~ /^@/) print $0; else {if($2==16 || $2==272) print $0"\tXS:A:-"; if($2==0 || $2==256) print $0"\tXS:A:+";}}' accepted_hits.sam

FLAG=256 and 272 is the corresponding version of 0 and 16 for multiple mapped hits.

UPDATE: This only works for single-end lib. For pair-end lib, the FLAG should be odd number, but in any case, reads on minus strand always have 1 on the 5th bit of binary code (e.g. 0x10 =10000). Thanks to Wei's suggestion on this. Here is the updated code:

samtools view -h accepted_hits.bam | awk '{if($0 ~ /XS:A:/ || $1 ~ /^@/) print $0; else {if(and($2,0x10)==16) print $0"\tXS:A:-"; else print $0"\tXS:A:+";}}' accepted_hits.sam


UPDATE2: (from the  final version of RNAseq lecture slides)ØFor non-strand-specific lib, you’re actually sequencing cDNA from both strands. So, the ‘strand’ info in the alignment is senseless (except for the spliced-reads).

ØFor strand-specific lib, if FLAG contains 0x10 (e.g. 0x53=0x40+0x10+0x2+0x1), reads map to ‘minus’ strand,otherwise, ‘plus’ strand.
ØCufflinks requires XS:A:+/- tag (which Tophat doesnt have for some reads). You need to manually add it by command, e.g. for dUTP library (where /2 is from transcript strand, /1 from the opposite strand):
samtools view -h accepted_hits.bam | awk '{if($0 ~ /XS:A:/ || $1 ~ /^@/) print $0; else {if(and($2,0x40) || and($2,0x90)) print $0"\tXS:A:-"; else print $0"\tXS:A:+";}}' accepted_hits.sam



 See my post here:

http://seqanswers.com/forums/showthread.php?p=69643#post69643

If you used a non-strand-specific protocol (which most people still do) you're not actually sequencing transcripts, you're sequencing cDNA with two strands. So the read could come from either of the two strands of a cDNA and you don't have any information which of the two strands corresponds to the original mRNA strand. This can be inferred when a read spans a splice junctions because splice site are highly conserved at the first 2 bases and last 2 bases of an intron. (Thomas Doktor)


# So, the strand info in the SAM alignment for non-strand-specific RNAseq lib cannot be used as evidence of transcript strand.

'Bioinformatics > SAM/BAM' 카테고리의 다른 글

TLEN 정의  (0) 2013.07.23
Posted by halloRa
,