BLAST SNP Extractor
About
When you want to check the genotype of particular genome at particular dbSNP site,
the easiest way is to BLAST the dbSNP sequence against the genome.
The resulting search output, however, it bulky and difficult to read.
The script presented here allows to quickly extract the SNP site from the BLAST output.
This tool was made by Kirill Kryukov, following the discussion with Drs. Tadashi Imanishi and Yoko Nagai.
It is shared with the hope that it can be useful, but without any warranties.
News
2013-09-05 – This page is created, version 0.1.0 of the script is uploaded.
Download
(Distributed under the zlib/libpng license, see the source file for details).
Usage
perl blast_snp_extractor.pl <input >output
Example
A typical dbSNP sequence may look like this:
Note that the SNP position is specified in the sequence name: "allelePos=501"
— this is the position that will be extracted by the script.
When you BLAST-search this sequence against a genome, you may see something like this:
BLASTN 2.2.28+
Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.
Database: GRCh37.fa
24 sequences; 3,095,677,412 total letters
Query= gnl|dbSNP|rs1815739|allelePos=501|totalLen=1001|taxid=9606|snpclass=
1|alleles='C/T'|mol=Genomic|build=138
Length=1001
Score E
Sequences producing significant alignments: (Bits) Value
gi|224384758|gb|CM000673.1| Homo sapiens chromosome 11, GRCh37 ... 1845 0.0
gi|224384764|gb|CM000667.1| Homo sapiens chromosome 5, GRCh37 p... 56.5 3e-05
gi|224384755|gb|CM000676.1| Homo sapiens chromosome 14, GRCh37 ... 52.8 4e-04
gi|224384763|gb|CM000668.1| Homo sapiens chromosome 6, GRCh37 p... 52.8 4e-04
gi|224384767|gb|CM000664.1| Homo sapiens chromosome 2, GRCh37 p... 52.8 4e-04
> gi|224384758|gb|CM000673.1| Homo sapiens chromosome 11, GRCh37
primary reference assembly
Length=135006516
Score = 1845 bits (999), Expect = 0.0
Identities = 1000/1001 (99%), Gaps = 0/1001 (0%)
Strand=Plus/Plus
Query 1 CAGGTCTGCAGGGCAGGCTTCTGACCCACTACGCCTCCCACCTCTAGCGGATGGAGAAGC 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 66327595 CAGGTCTGCAGGGCAGGCTTCTGACCCACTACGCCTCCCACCTCTAGCGGATGGAGAAGC 66327654
Query 61 TCCTGGAGACCATTGACCGGCTGCAACTGGAGTTTGCCCGGCGGGCCGCGCCCTTCAACA 120
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 66327655 TCCTGGAGACCATTGACCGGCTGCAACTGGAGTTTGCCCGGCGGGCCGCGCCCTTCAACA 66327714
Query 121 ACTGGCTGGATGGTGCCGTGGAGGACCTGCAGGACGTGTGGCTGGTACACTCTGTGGAGG 180
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 66327715 ACTGGCTGGATGGTGCCGTGGAGGACCTGCAGGACGTGTGGCTGGTACACTCTGTGGAGG 66327774
Query 181 AGACCCAGGTGGGTGCCAGGGTTGCAGGGGATGGATAGGATGACAGGAAAGCTGGCCCCA 240
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 66327775 AGACCCAGGTGGGTGCCAGGGTTGCAGGGGATGGATAGGATGACAGGAAAGCTGGCCCCA 66327834
Query 241 AATTCTGCCACCCACAACTTTAGGCTCCTGGGGCATAGGGATGGGAGGAAAACCCCAGTT 300
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 66327835 AATTCTGCCACCCACAACTTTAGGCTCCTGGGGCATAGGGATGGGAGGAAAACCCCAGTT 66327894
Query 301 CCCGAGTGCTGGGCTGGAAGACAGGAGGCCGGGGTTCTTGTGTCAGGACTGCCCAGGACT 360
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 66327895 CCCGAGTGCTGGGCTGGAAGACAGGAGGCCGGGGTTCTTGTGTCAGGACTGCCCAGGACT 66327954
Query 361 GGTGGGTGGCCTGGGGCACACTGCTGCCCTTTCTGTTGCCTGTGGTAAGTGGGGGACACC 420
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 66327955 GGTGGGTGGCCTGGGGCACACTGCTGCCCTTTCTGTTGCCTGTGGTAAGTGGGGGACACC 66328014
Query 421 AGCTGACACTTCCTGCCTGTCGTCCCCAGAGCCTGCTGACAGCGCACGATCAGTTCAAGG 480
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 66328015 AGCTGACACTTCCTGCCTGTCGTCCCCAGAGCCTGCTGACAGCGCACGATCAGTTCAAGG 66328074
Query 481 CAACACTGCCCGAGGCTGACYGAGAGCGAGGTGCCATCATGGGCATCCAGGGTGAGATCC 540
|||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||
Sbjct 66328075 CAACACTGCCCGAGGCTGACTGAGAGCGAGGTGCCATCATGGGCATCCAGGGTGAGATCC 66328134
Query 541 AGAAGATCTGCCAGACGTATGGGCTGCGGCCCTGCTCCACCAATCCCTACATCACCCTCA 600
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 66328135 AGAAGATCTGCCAGACGTATGGGCTGCGGCCCTGCTCCACCAATCCCTACATCACCCTCA 66328194
Query 601 GCCCGCAGGACATCAACACCAAGTGGGATATGGTCAGTGCCACCTGCAGCCTTCCTCCCA 660
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 66328195 GCCCGCAGGACATCAACACCAAGTGGGATATGGTCAGTGCCACCTGCAGCCTTCCTCCCA 66328254
Query 661 CCCCCTCCTGCATACTGTGACCACCCTGAAATCTCGGGTGGCCCAAGATATGGAGAATAA 720
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 66328255 CCCCCTCCTGCATACTGTGACCACCCTGAAATCTCGGGTGGCCCAAGATATGGAGAATAA 66328314
Query 721 AGTCCATCTTCAGATGTGGGGTCTGCCACAGACTCCACGTGGGATTGGATAAATCGCCTT 780
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 66328315 AGTCCATCTTCAGATGTGGGGTCTGCCACAGACTCCACGTGGGATTGGATAAATCGCCTT 66328374
Query 781 GCCTGTCTCGGGCTCATCTGTATATGATGAGCTGTAACAGCAGCATTCTCCTGTCAGGAC 840
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 66328375 GCCTGTCTCGGGCTCATCTGTATATGATGAGCTGTAACAGCAGCATTCTCCTGTCAGGAC 66328434
Query 841 TGATGTGAAGTAGAATTAAGCCTTGTGTGTGGACATCTTTTATAAATCCACCCATATTAG 900
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 66328435 TGATGTGAAGTAGAATTAAGCCTTGTGTGTGGACATCTTTTATAAATCCACCCATATTAG 66328494
Query 901 TAAATGATGCCATCAGCCCCATTTTGCAGATGAGAAAACTGAAGCCCACATGGGTTAGTG 960
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 66328495 TAAATGATGCCATCAGCCCCATTTTGCAGATGAGAAAACTGAAGCCCACATGGGTTAGTG 66328554
Query 961 ACTTGCCCAAGAGCTTTCTCGTATATCCCAGACAAGTTGGC 1001
|||||||||||||||||||||||||||||||||||||||||
Sbjct 66328555 ACTTGCCCAAGAGCTTTCTCGTATATCCCAGACAAGTTGGC 66328595
> gi|224384764|gb|CM000667.1| Homo sapiens chromosome 5, GRCh37
primary reference assembly
Length=180915260
Score = 56.5 bits (30), Expect = 3e-05
Identities = 53/63 (84%), Gaps = 5/63 (8%)
Strand=Plus/Plus
Query 912 ATCAGCCCCATTTTGCAGATGAGAAAACTGAAGCC-CACATGG-G-TTAG-T-GACTTGC 966
|||| |||||||||||||||||||||||||||| || | || | |||| | |||||||
Sbjct 140890207 ATCATCCCCATTTTGCAGATGAGAAAACTGAAGTGGCATAAGGAGGTTAGGTAGACTTGC 140890266
Query 967 CCA 969
|||
Sbjct 140890267 CCA 140890269
> gi|224384755|gb|CM000676.1| Homo sapiens chromosome 14, GRCh37
primary reference assembly
Length=107349540
Score = 52.8 bits (28), Expect = 4e-04
Identities = 28/28 (100%), Gaps = 0/28 (0%)
Strand=Plus/Minus
Query 915 AGCCCCATTTTGCAGATGAGAAAACTGA 942
||||||||||||||||||||||||||||
Sbjct 97481672 AGCCCCATTTTGCAGATGAGAAAACTGA 97481645
> gi|224384763|gb|CM000668.1| Homo sapiens chromosome 6, GRCh37
primary reference assembly
Length=171115067
Score = 52.8 bits (28), Expect = 4e-04
Identities = 28/28 (100%), Gaps = 0/28 (0%)
Strand=Plus/Minus
Query 917 CCCCATTTTGCAGATGAGAAAACTGAAG 944
||||||||||||||||||||||||||||
Sbjct 15781064 CCCCATTTTGCAGATGAGAAAACTGAAG 15781037
> gi|224384767|gb|CM000664.1| Homo sapiens chromosome 2, GRCh37
primary reference assembly
Length=243199373
Score = 52.8 bits (28), Expect = 4e-04
Identities = 28/28 (100%), Gaps = 0/28 (0%)
Strand=Plus/Minus
Query 918 CCCATTTTGCAGATGAGAAAACTGAAGC 945
||||||||||||||||||||||||||||
Sbjct 195875306 CCCATTTTGCAGATGAGAAAACTGAAGC 195875279
Lambda K H
1.33 0.621 1.13
Gapped
Lambda K H
1.28 0.460 0.850
Effective search space used: 3005902067932
Database: GRCh37.fa
Posted date: Sep 3, 2013 11:57 AM
Number of letters in database: 3,095,677,412
Number of sequences in database: 24
Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0, Extension: 2.5
Reading this raw BLAST output can take time, especially if you are trying to check multiple SNPs or genomes.
While all you need is just the nucleotide aligned to 'Y' in the query ('T' in this case).
Therefore you can simply run the script, obtaining the following output:
gnl|dbSNP|rs1815739|allelePos=501|totalLen=1001|taxid=9606|snpclass= 1|alleles='C/T'|mol=Genomic|build=138
gi|224384758|gb|CM000673.1| Homo sapiens chromosome 11, GRCh37 primary reference assembly
Score: 1845, E: 0.0, Q: 1..1001, Nuc: Y - T
The output is organized as following. All lines with no indentaion are query names.
All lines indented with 4 spaces are database sequence names.
Each line indented with 8 spaces represents a BLAST hit,
and contains the hit summary, including score, e-value, query coverage and the nucleotide at the SNP site in both sequences.
Contact
Please contact Kirill
if you have any questions, comments or suggestions.