Vecscreen

Bioinformatics/New Tech 2014. 3. 13. 09:50

출처: http://www.ncbi.nlm.nih.gov/tools/vecscreen/about/


About VecScreen

VecScreen is a system that quickly finds segments of a nucleic acid sequence that may be of vector origin. It helps researchers identify and remove any segments of vector origin before they analyze or submit sequences. Researchers are encouraged to screen their sequences for vector contamination using the form on the VecScreen search page.

Failure to recognize foreign segments in a sequence can:

  • lead to erroneous conclusions about the biological significance of the sequence
  • waste time and effort in analysis of contaminated sequence
  • delay the release of the sequence in a public database
  • pollute public databases with contaminated sequence

GenBank Annotation Staff use VecScreen to verify that sequences submitted for inclusion in the database are free of vector contamination

VecScreen searches a query sequence for segments that match any sequence in UniVec, a specialized non-redundant vector database. The search uses BLAST with parameters preset for optimal detection of vector contamination. Those segments of the query that match vector sequences are categorized according to the strength of the match, and their locations are displayed (see an example of a positive result).

Although a VecScreen search against UniVec will not identify the vector that is the most likely source of the contamination (see UniVec Limitations), this can usually be deduced from the cloning history of the sequenced DNA (see Identifying the Foreign Sequence for more details).

Guidance on how to interpret positive VecScreen results and also on how to remove the foreign segment(s) from a contaminated sequence is available in Interpretation of VecScreen Results.

VecScreen Search Parameters

The sequence of any vector contamination should theoretically be identical to the known sequence of the vector. In practice, occasional differences are expected to arise from sequencing errors, and less frequently, from engineered variants or spontaneous mutations. The search parameters used for VecScreen have, therefore, been chosen to find sequence segments that are identical to known vector sequences or which deviate only slightly from the known sequence.

The blastn parameters used for VecScreen are significantly more stringent than the default blastn parameters. The principal differences are:

  • Increased penalty for mismatches
    • This severely limits the frequency of mismatches in alignments.
  • Gap penalties more tolerant of single base insertions or deletions
    • This accommodates the type of sequencing error that adds or omits a base.
  • Low complexity filtering only for initial hits
    • This prevents an alignment from being initiated in a low complexity region while allowing alignments that extend across regions of low complexity to be scored appropriately.

The VecScreen parameters are pre-set using blastn options-q -5 -G 3 -E 3 -F "m D" -e 700 -Y 1.75e12

VecScreen Match Categories

Vector contamination usually occurs at the beginning or end of a sequence; therefore, different criteria are applied for terminal and internal matches. VecScreen considers a match to be terminal if it starts within 25 bases of the beginning of the query sequence or stops within 25 bases of the end of the sequence. Matches are categorized according to the expected frequency of an alignment with the same score occurring between random sequences.

Strong Match to Vector
(Expect 1 random match in 1,000,000 queries of length 350 kb.)
Terminal match with Score ≥ 24.
Internal match with Score ≥ 30.
Moderate Match to Vector
(Expect 1 random match in 1,000 queries of length 350 kb.)
Terminal match with Score 19 to 23.
Internal match with Score 25 to 29.
Weak Match to Vector
(Expect 1 random match in 40 queries of length 350 kb.)
Terminal match with Score 16 to 18.
Internal match with Score 23 to 24.
Segment of Suspect Origin
Any segment of fewer than 50 bases between two vector matches or between a match and an end.


-------------------------------------------------------------------------------------------------------------------

출처: https://gist.github.com/brantfaircloth/4325589


local에서 직접 blastn으로 univec db에 돌려볼 경우에는 아래와 같이 option 값들을 설정하여 돌리면 가능

--> 실제 실행 결과 vecscreen을 사용한 결과와는 다르게 나타난다.

--> But, 실제 위의 vecscreen 페이지의 설명대로면 아래와 같이 옵션 설정되는 것이 틀린 것은 아님.

blastn -task blastn -db UniVec_core -query test.fsa \
    -evalue 1 -gapopen 3 -gapextend 3 -word_size 11 \
    -reward 1 -penalty -5 -out blast.out -num_threads 4 \
    -dust yes -searchsp 1750000000000 -soft_masking true \
    -outfmt 6


'Bioinformatics > New Tech' 카테고리의 다른 글

vector trim을 위한 NCBI vecscreen 서버에 설치하기  (0) 2014.01.23
BLAST local db setting  (0) 2014.01.20
GWAS? TCGA? ENCODE?  (0) 2013.11.01
Fold Change  (0) 2013.10.11
FastX toolkit local install  (0) 2013.09.09
Posted by halloRa
,