출처: http://www.biostars.org/p/69584/
Question: How to automatically screen thousands of sequences using VecScreen
I looked at the NCBI Vecscreen website (http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html), and you can put multiple sequences in fasta format in at a time. It allows you to download results, but there are several blast "hits" in the results. The downloaded results are not in the same format as displayed on the website where the website indicates a section of the sequence that has strong, medium, etc. similarity to a vector. Is there a way to download the results? Particularly the sections of the sequences that are possibly contaminated.
I'm really looking for a way to automate screening hundreds of thousands of sequences for vector contamination (and then cutting the sequences to remove the contamination.)
Any help is appreciated.
1 answer
Okay, I got vecscreen to work. The problem was that the app wasn't included in the FTP files that I downloaded from NCBI. I used subversion to get all the code and was able to find and build vecscreen. The text output can be used to clean the sequences. This could be done with Python and BioPython.
Here is what I finally did to get vecscreen to compile. As I mentioned, for some reason it wasn't in the tarball from the FTP site, so I had to check out with subversion (svn) NCBI toolbox users manual for building: http://www.ncbi.nlm.nih.gov/books/NBK7167/
With Linux ... Make sure G++ is installed (could be different for different platforms), etc. Use the following command to get the source: svn co http://anonsvn.ncbi.nlm.nih.gov/repos/v1/trunk/c++ From the compilers directory, do ./GCC.sh (this is different for different platforms) This step could be unnecessary From the top-level directory of the checked out files, do ./configure --with-flat-makefile cd GCC444-Debug/build make -f Makefile.flat $PROJECT_NAME (i.e. app/vecscreen/) Also need app/blast/ and app/blastdb/ Downloaded UniVec_Core fasta file from ftp://ftp.ncbi.nih.gov/pub/UniVec/ (This has only non-mammalian vectors) Make local copy of UniVec_Core database in the GCC444-Debug/bin directory with the command: ./makeblastdb -in UniVec_Core -dbtype nucl -out UniVec_Core.db Use the vecscreen command (found in GCC444-Debug/bin/) ./vecscreen -db UniVec_Core.db -query $fasta_file -out $vecscreen_outfile -outfmt 0 -text_output
--------------------------------------------------------------------------------------------------------------------
실제 실행 ]
1. 서버 단에 svn 설치
> yum install svn
2. svn을 통하여 NCBI FTP로부터 파일을 다운로드
> svn co http://anonsvn.ncbi.nlm.nih.gov/repos/v1/trunk/c++
3. 해당하는 플랫폼 디렉토리로 들어가서 컴파일
> cd ./c++/compiler/unix
> ./GCC.sh
4. 다음 다시 top level directory로 돌아가서 아래와 같이 configure
> cd ../..
> ./configure --with-flat-makefile
> cd GCC444-Debug/build
> make -f Makefile.flat ./app/vecscreen/
5. 다음 NCBI로부터 Univec_core 디비 설정 (http://hallora.tistory.com/304)
6. 마지막으로 실행
> cd ../GCC444-Debug/bin/
> ./vecscreen -db [path]/UniVec_core -query [fastafile] -out [output] -outfmt 0 -text_output
'Bioinformatics > New Tech' 카테고리의 다른 글
Vecscreen (0) | 2014.03.13 |
---|---|
BLAST local db setting (0) | 2014.01.20 |
GWAS? TCGA? ENCODE? (0) | 2013.11.01 |
Fold Change (0) | 2013.10.11 |
FastX toolkit local install (0) | 2013.09.09 |