Nucleotide Sequence Clusterizer


Nucleotide Sequence Clusterizer is a tool for clustering DNA sequences, using only the initial fixed lenght fragment.


2015-04-17 – Version 0.0.7: Cosmetic fixes, silencing compiler warnings, linux binary.

2014-09-10 – Version 0.0.6. Previously sequences with duplicate prefixes were removed (only one sequence was kept from each set of sequences sharing the same prefix). Now the clusterizer keeps all sequences by default. Removing duplicates is optional, enabled by "-p" or "--remove-duplicate-barcodes" switch.

2014-07-22 – Version 0.0.4 adds the output of all clusters (instead of just counting them).

2014-04-16 – Version 0.0.2 improves performance.

2014-04-06 – This page is created, version 0.0.1 is uploaded.


Current version: 0.0.7, 2015-04-17

(Distributed under the zlib/libpng license, see LICENSE.txt for details)

What does it do?

This tool implements a sort of bounded single-linkage clustering. Initially each sequence is in a cluster of its own. When any two sequences differ from each other by not more than D substitutions, their clusters are combined. The process stops when there are no more clusters to combine. The final number of clusters is then reported. Also, all clusters are stored in the output file (specified with the '--out' option).

How to use it

Usage: nucleotide-sequence-clusterizer [<options>] Options: -i, --in FILE - Specify input FASTA file. -o, --out FILE - Specify output file. -d, --distance D - Cluster sequences separated by D or fewer substitutions. -t, --template T - Use only sequences matching template T. -p, --remove-duplicate-barcodes - Use only one sequence out of all sharing same barcode. -v, --version - Show version. -h, --help - Show help.

The input is a set of sequences in FASTA format. The input can be read from a file, or from standard input. Each sequence should contain only [ATCG] - sequences containing any other characters are ignored.

Template is a string of [ATCGN.] ('.' and 'N' are equivalent, and represent any nucleotide). Only sequences matching the template are used for clustering. Note that template also provides sequence length. To use all sequences of certain length, simply provide a stretch of 'N' as a template.

With distance 0 it simply counts unique sequences.


Development started in March 2014. Older versions are provided here for reference (please don't use in production).


Other clusterizing software:


This tool was made by Kirill Kryukov, based on discussion with Katsuyuki Shiroguchi. It is shared with the hope that it can be useful, but without any warranties.

If you have any questions, comments or suggestions, please email to kkryukov at gmail dot com.

  © 2014-2015 Kirill Kryukov
This page is available under the CC BY 3.0 License