BLAST Compressor

About

The output of BLAST homology search can consume a lot of storage space. This script allows to compress the BLAST output, while retaining all important information, including the placement of gaps in alignments.

This tool was made by Kirill Kryukov in Saitou lab, NIG. I share it with the hope that it can be useful, but without any warranties.

The compressor was tested with the default pairwise output format (-outfmt 0) of blastn, blastx and tblastx from BLAST+ 2.2.25 package, and should probably work with any other version. Please report any issues or incompatibilities.

The space saving from using this compressor varies depending on the nature of the search (average alignment length, number of hits per database sequence, number of gaps, lengths of the database sequence names, etc). Typically I see around 20 times rediction of output size with my searches.

News

2012-06-22 – Decompressor v.0.2.2: Now comment lines are dropped during decompression.

2012-04-13 – Added "-silent" option to the compressor.

2012-04-09 – Decompressor is updated to version 0.2.1. Its output is now more verbose and closer to BLAST format.

2012-02-14 – Version 0.2.1. Improved handling of multiple files.

2012-02-10 – Version 0.2.0. Compression strength improved, added proof-of-concept decompressor. Added ability to process multiple files without reloading the names.

2012-02-02 – Version 0.1.3 - compressed output always has unix-style end of line.

2012-01-31 – Version 0.1.2 - loading of database sequence names is improved.

2012-01-30 – Version 0.1.1 - loading of query sequence names is improved.

2012-01-27 – This page is created, version 0.1 is uploaded.

Download

(Distributed under the zlib/libpng license, see the source file for details)

How does it work?

Query and database sequence names are replaced by their indices represented as base 94 number (prepended with double and single space, respectively) Properties of each hit are encoded and written as a single line, in format:

<Frame><Aln-Length> <Identities> <Positives> <Gaps> <Bit-score> <Expect> <N-Parts> <Q-Start> <S-Start>[ <Q-Gaps> <S-Gaps>]

This format is easy to parse and process, and remains a text format. The included decompressor script demonstrates parsing and decompressing.

Note that you need to keep the names of query and database sequences in order to be able to restore them from indices stored in compressed form. If you need to reconstruct the alignments, you have to also keep the original query file and blast database.

Preparing the database

Both compressor and decompressor need an additional file with database sequence names. If you have FASTA file for the database, you can produce such file with this command:

grep ">" "db.fa" >db.names

If you only have the database already formatted as BLAST database, you can use the following command:

blastdbcmd -db "db-path" -entry all -outfmt %f | grep ">" >db.names

Decompressor also needs a file with database sequence lengths, which you can produce with FASTA-Get-Seq-Length script.

Using the compressor

perl blast_compressor.pl -query "query-path" -dbnames "db-names-path" <blastout.txt >compact.txt

You can also compress the BLAST output on the fly as it is generated: Simply append

| perl blast_compressor.pl -query "query-path" -dbnames "db-names-path" >compact.txt

to the end of your search command instead of specifying the output file.

To compress multiple files without reloading the database names every time:

perl blast_compressor.pl -query "query-path" -dbnames "db-names-path" -in blastout*.txt

To compress multiple files into a single compressed file:

perl blast_compressor.pl -query "query-path" -dbnames "db-names-path" -in blastout*.txt -out compact.txt

Using the decompressor

perl blast_decompressor.pl -query "query-path" -db "db-path" -dbnames "db-names-path" -dblengths "db-lengths-path" -blastdbcmd "blastdbcmd-path" <blastcompact.txt >blastout.decompressed.txt

Example output

" '~@ lA 3 = n.6 0.84 "h $W3 # $ ]p n#@ "f "p 5x 8-"& %- u5 K#? "g "n 5m 2-"" %1 u4 g#< "b "m 5I 1-w %: u4 c#B "\ "h 57 2-t %& u6 R#> "b "k 4r 2-o %5 u3 D#@ "[ "e 4o 1-n %- u5 g> ; ; 13.2 2-( $2 |< g> ; ; 13.2 2-( n |< R> 9 ; y.2 7-' o |; R> 9 ; y.2 7-' $3 |; K< : < w.3 3-& t |< K< : < w.3 3-& $8 |< nC 8 ? v.4 5-& $4 |+ nC 8 ? v.4 5-& p |+ D> 9 9 r.3 8-% $@ |. D7 6 6 o.6 0.9g | |C

If you have any questions, comments or suggestions, please contact me.


  © 2012 Kirill Kryukov
This page is available under the CC BY 3.0 License