FASTA Splitter
About
When sequence data is large it often makes sense to analyze it in smaller chunks. This script divides a large FASTA file into a set of smaller, approximately equally sized files. It works with whole sequences, never dividing a sequence in the middle.
News
2024-09-13 – Version 0.2.7:
- Changed default line length to 100.
2017-08-01 – Version 0.2.6:
- Removed dependency on File::Util.
Download
Current version:
(Distributed under the zlib/libpng license, see the source file for details)
Usage
Usage: fasta-splitter.pl [options] <file>... Options: --n-parts <N> - Divide into <N> parts --part-size <N> - Divide into parts of size <N> --measure (all|seq|count) - Specify whether all data, sequence length, or number of sequences is used for determining part sizes ('all' by default). --line-length - Set output sequence line length, 0 for single line (default: 100). --eol (dos|mac|unix) - Choose end-of-line character ('unix' by default). --part-num-prefix T - Put T before part number in file names (def.: .part-) --out-dir - Specify output directory. --nopad - Don't pad part numbers with 0. --version - Show version. --help - Show help.
The script supports two strategies: dividing into given number of parts (--n-parts <N>) and dividing into parts of given size (--part-size <N>).
It's possible to specify both --n-parts <N> and --part-size <M>. In such case the size of each part will not exceed <M>, and at most <N> parts will be written. This can be useful to extract some parts from the beginning of a large FASTA file without processing the whole file.
--measure option controls what is used to determine part sizes. With --measure count simply the number of sequences is used to delimit parts. With --measure seq sequence length in basepairs is used. With --measure all total size in bytes is used (including sequence names and end of line characters).
Limitations
- This script reads the input twice - this allows to avoid using lot of memory, which is prohibitive with some huge input datasets.
- When splitting into given number of parts, the script will load sequences into memory one be one. This might be a problem if you have enormous sequences and small amount of RAM.
- This script does not cut sequences into parts. If you have a single huge sequence, this script won't help you to partition it.
- This script does not try to partition optimally (with as close part sizes as possible). In fact it never reorders the sequences, so concatenating the parts in order should reproduce the original input file (with possible line length and line break differences).
Please email any questions or comments to kkryukov@gmail.com.