Alignment Mismatch Finder

About

Alignment Mismatch Finder is a tool that scans a set of pairwise DNA alignments and extracts all regions with high density of mismatches and gap blocks.

This tool was made by Kirill Kryukov, based on discussion with Yuichiro Hara and Tadashi Imanishi. It is shared with the hope that it can be useful, but without any warranties.

News

2014-03-23 – This page is created, version 0.0.2 is uploaded

Purpose of this tool

This tool processes a pairwise DNA alignment (or a set of alignments). It locates and prints all alignment regions that are densely covered by mismatches and (optionally) gaps, in other words, all poorly aligned regions. Our motivation is to re-analyze such regions, because poor alignment might be an indication of something interesting.

What is a mismatch-rich region?

Basically, a region of length L containing N mismatches (or gap blocks) is considered sufficiently rich with mismatches, if the expected probability of finding L or more mismatches in a random region of the same length is at or below certain threshold E (Under the assumption of certain evolutionary distance D between sequences, and uniformly randomly distributed mismatches). The assumed distance and evalue are specified using --distance and --evalue options.

Supported formats

Format	Input or output?
FASTA	IN & OUT
G-compass	IN
axt	IN
NSS	IN & OUT
CMEH	OUT

In case of FASTA, each alignment is represented as two consequtive sequences.

CMEH is an alignment format supported for compatibility with Dr. Hara's programs.

Download

Current version: 0.0.2, 2014-03-23

source (22 kB)

(Distributed under the zlib/libpng license, see LICENSE.txt for details)

How to use it

Usage: alignment-mismatch-finder [<options>] [<file>...]
Options:
  -i, --in-format FORMAT  - Set input format to one of: fasta, nss, axt, g-compass.
  -o, --out-format FORMAT - Set output format to one of: fasta, nss, cmeh.
  -d, --distance D        - Assume that distance between sequences is D.
  -e, --evalue E          - Use E as expect value for variant-rich regions (default: 0.01).
  -L, --min-aln-length N  - Only scan alignments that are at least N bp long (default: 1).
  -n, --min-vars N        - Scan for clusters of at least N variants (substitutions and gap blocks) (default: 3).
  -m, --max-vars M        - Scan for clusters of at most M variants (default: same with --min-vars).
  -x, --extend N          - Extend each region by N bases on both ends before printing (default: 0).
  -j, --join              - Extended regions may overlaps, with '--join' such regions are joined together.
  -t, --trim N            - Ignore mismatches/gaps within first and last N bp of each alignment (default: 0).
  -g, --use-gaps          - Use gaps together with mismatches (by default only mismatches are used).
  -c, --ignore-CpG        - Ignore mismatches/gaps at CpG sites.
  -p, --palindrome        - Search for palindromes surrounded by gaps.
  -v, --version           - Show version.
  -h, --help              - Show help.

--palindrome option enables search for palindromic sequence, enclosed between two gaps, which are not on the same sequence. Both text palindromes (e.g., 'ATTA') and genomic palindromes (e.g., 'AATT') are detected. --palindrome requires --gaps option (it has no effects without --gaps). --palindrome simply adds palindrome detection to the usual mismatch-rich region detection, so it's not affected by --min-vars and --max-vars options. Also, palindrome detection is not affected by --distance and --evalue options — all palindromes will be reported no matter how short.

If you have any questions, comments or suggestions, please contact me.


	© 2014 Kirill Kryukov This page is available under the CC BY 3.0 License