FASTA Char Counter

About

FASTA Char Counter is a tool that counts characters in a FASTA file.

News

2014-05-20 – This page is created, version 0.1.0 is uploaded

Download

Current version: 0.1.0, 2014-05-20

(Distributed under the zlib/libpng license, see the source file for details)

How to use it

fasta-char-counter --in <FASTA file> --out <charcount file>

Details

Character counts is often informative. For example, it shows the GC-content, and presence of 'N' or other ambiguous IUPAC codes. This tool will count characters in all sequences of a FASTA file, and print them out in a text format, such as:

> 198 A 862714308 C 599132272 G 599499760 M 1 N 242793800 R 2 T 863804600

Naturally only the sequences are scanned, all the names are ignored. However the number of sequences is reported as the number of '>' characters.

This program is not aware of Unicode, it simply counts printable ASCII (and extended ASCII) characters - those with codes from 32 to 255. It ignores charactes with code below 32, including tabs and end of line characters.

Space (code 32) is included, and the corresponding output line begins with two spaces. Therefore, when parsing the output, don't expect every line to match /^\S\s+\d+/. Instead extract the first character and then parse the rest.

This program uses 64-bit numbers for all counting, so it should have no problems dealing with any huge data. (FASTA file can be larger than 4 GB, individual sequence can be larger than 4 GB, also number of any character can be larger than 232).


  © 2014 Kirill Kryukov
This page is available under the CC BY 3.0 License