FASTA Splitter

About

When sequence data is large it often makes sense to analyze it in smaller chunks. This script divides a large FASTA file into a set of smaller, approximately equally sized files. It works with whole sequences, never dividing a sequence in the middle.

News

2017-08-01 – Version 0.2.6:

Removed dependency on File::Util.

2016-08-31 – Version 0.2.5:

Added option '--nopad' to disable padding the part number in output file names with 0.
Added option '--part-num-prefix TEXT' to improve configurability of output file names. The default TEXT is ".part-"

Older news

Download

Current version:

Version 0.2.6 (2017-08-01) (4 kB)

Old versions:

Version 0.1.1 (2012-03-02) (2 kB)

(Distributed under the zlib/libpng license, see the source file for details)

Example

fasta-splitter.pl --part-size 100 a.fa --nopad --measure seq --line-length 100 --out-dir out

Usage

Usage: fasta-splitter.pl [options] <file>... Options: --n-parts <N> - Divide into <N> parts --part-size <N> - Divide into parts of size <N> --measure (all|seq|count) - Specify whether all data, sequence length, or number of sequences is used for determining part sizes ('all' by default). --line-length - Set output sequence line length, 0 for single line (default: 60). --eol (dos|mac|unix) - Choose end-of-line character ('unix' by default). --part-num-prefix T - Put T before part number in file names (def.: .part-) --out-dir - Specify output directory. --nopad - Don't pad part numbers with 0. --version - Show version. --help - Show help.

The script supports two strategies: dividing into given number of parts (--n-parts <N>) and dividing into parts of given size (--part-size <N>).

It's possible to specify both --n-parts <N> and --part-size <M>. In such case the size of each part will not exceed <M>, and at most <N> parts will be written. This can be useful to extract some parts from the beginning of a large FASTA file without processing the whole file.

--measure option controls what is used to determine part sizes. With --measure count simply the number of sequences is used to delimit parts. With --measure seq sequence length in basepairs is used. With --measure all total size in bytes is used (including sequence names and end of line characters).

Limitations

This script reads the input twice - this allows to avoid using lot of memory, which is prohibitive with some huge input datasets.
When splitting into given number of parts, the script will load sequences into memory one be one. This might be a problem if you have enormous sequences and small amount of RAM.
This script does not cut sequences into parts. If you have a single huge sequence, this script won't help you to partition it.
This script does not try to partition optimally (with as close part sizes as possible). In fact it never reorders the sequences, so concatenating the parts in order should reproduce the original input file (with possible line length and line break differences).

If you have any questions, comments or suggestions, please contact me.


	© 2012-2017 Kirill Kryukov This page is available under the CC BY 3.0 License