Skip to content

Welcome courses Facilities research Members News Search
  You are not logged in Log in
You are here: Home » biocourses » Bioinformatics for Biologists - BIOS 599 - Spring 2004 » projects » Alignment Programs (dotplot, global & local alignment)

« October 2008 »
Su Mo Tu We Th Fr Sa
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  

Alignment Programs (dotplot, global & local alignment)

Table of Contents

  1. iNquiry alignment background
  2. Dotplot programs
    1. how to use dotplot programs
    2. dotplot alignment example: Dotmatcher

           Figure one: Sample dotplot using Dotmatcher

           Table one: Other dotplot programs

  3. Global alignment programs
    1. how to use global alignment programs
    2. global alignment example: Needle

           Figure two: Sample global alignment

           Table two: Other global alignment programs

  4. Local alignment programs
    1. how to use local alignment programs
    2. local alignment example: Water

           Figure three: Sample output from a local alignment

           Table three: Other local alignment programs

  5. Research using alignment programs
  6. Sources

iNquiry Alignment

        Inquiry tools for alignment find commonalties between nucleic acid or protein sequences. Some of these tools, such as EMMA and Needle , simply create alignments. Other programs, such as Prettyplot and Cons , go one step further by displaying the alignment graphically or making consensus sequences from alignments, respectively. Thus, there are a variety of alignment tools able to align and analyze sequences through a variety of methods.

        Sequence alignment begins with the selection of an algorithm that generates a score for each alignment. This algorithm must adjust for gaps hypothetically resulting from insertions and deletions in the sequence. The score constructed by the algorithm represents the degree of similarity between sequences. Global alignment programs create alignments from an entire genome, while local alignment programs use only certain regions of the genome to find enough similarity to maximize the alignment score. Other programs, such as dotplot programs, graphically display the alignments.

Dot plot Programs

        Dot plots graphically display regions of similarity between two sequences. One sequence is placed on the y-axis of a rectangular graph, and the other sequence is placed on the x-axis.   A dot is plotted at every coordinate where there is similarity between the sequences. Where two sequences are very similar many dots align to form diagonal lines. These diagonals make it possible to visualize local regions of similarity. Diagonals also make it easy to see other features such as repeats (which form parallel diagonal lines), and insertions or deletions (which form breaks or discontinuities in the diagonal lines)(2). Dot plot programs include Dotmatcher, Dotpath, Dottup, and Polydot . Each program is uniquely designed to compare sequences using a dot plot.

How to use dotplot programs

         Input for two nucleic acid or protein sequences must be entered as sequence a and b. Because dotplot programs look for matches in each threshold window, both a window size and threshold must be chosen. The window size is the area to which the threshold for a consensus will be applied. The smaller the window size, the more likely it is for a consensus to be made. The threshold is a value that will determine how strict an identity must be to be considered for the dotplot. The user must also identify how the output is to be displayed (i.e.- text, windows graph, etc.).

Dotmatcher

         Dotmatcher uses a threshold, calculated from a substitution matrix such as Blosum, to define whether a match is plotted. A window of specified length moves up all possible diagonals and a score is calculated within each window for each position. The score is the sum of the comparison of the two sequences using the given similarity matrix along the window. If the score is above the threshold, then a line is plotted on the image over the position of the window (2).

Figure one: An exmaple of the output created by a dotplot program. Human HBB and HBA protiens are compared using a window size of 3, and a threshold of 23.00. (From the EMBOSS user guid found on http://egg.isu.edu/inquiry.html

Table One: other dotplot programs

Dotpath

  • displays a non-overlapping wordmatch dotplot of two sequences.
  • very similar to another alignment program called Dottup .
  • Both programs look for places where words, called tuples, of a chosen length have an exact match which is represented by a diagonal line over the position of these words
  • Using a longer word size displays less random noise, runs extremely quickly, but is less sensitive

Dottup

  • nearly identical to Dotpath in its capabilities.
  • looks for places where words (tuples) of a specified length have an exact match in both sequences and draws a diagonal line over the position of these words
  • unlike Dotpath , overlaps are not significant disqualifiers
  • fast, but not especially sensitive way of creating dotplots

Polydot

  • compares all sequences in a set of sequences, and draws a dot plot for each pair of sequences by marking where tuples of a certain length have an exact match in both sequences.
  • should be used when sequences are to be compared exactly one to one

Global Alignment Programs

         Global alignment programs align entire sequences. Each program compares a whole sequence to either another complete sequence or a sequence fragmented by cloning or changed by natural divergence. As with all the alignments, in a global pairwise alignment it is assumed that the two sequences have diverged from a common ancestor. Each program uses a different means to compensate for area uncommon to both sequences. The best alignment over the whole length of the two sequences then best illustrates their similarities. Global alignment programs include Est2genome, Stretcher , and Needle .

How to use Global alignment programs

        Two sequences must be entered. Then the user must specify a gap penalty (default is 10), or score taken away for every gap created. This penalty ensures that the best alignment will contain the fewest possible gaps. A gap extension penalty (default is 0.5) must also be specified. This penalty is added for every base included in each gap. This number should be lower than the gap penalty because longer gaps are more common than multiple short gaps. An output file must also be designated. Typically, the cost of extending a gap is set to be 5-10 times lower than the cost for opening a gap. (2)

Needle

         Needle finds the best alignment for two whole sequences using the Needleman-Wunsch global alignment algorithm. This algorithm chooses the opotimal global alignment by exploring all possible alignments and choosing the one with the best score. It does this by reading from a scoring matrix that contains values for every possible residue or nucleotide match. Needle finds an alignment with the maximum possible score where the score of an alignment is equal to the sum of the matches taken from the scoring matrix (2). Thus, all possible alignments are considered to ensure that only the best is used. Needle scores alignments in a scoring matrix that contains values for every possible residue or nucleotide match. An alignment with the maximum possible score is found after the score of an alignment equal to the sum of the matches is taken from the scoring matrix.

########################################
# Program:  needle
# Rundate:  Tue Apr 06 14:30:14 2004
# Align_format: srspair
# Report_file: hba_human.needle
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: HBA_HUMAN
# 2: HBB_HUMAN
# Matrix: EBLOSUM62
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 148
# Identity:      63/148 (42.6%)
# Similarity:    88/148 (59.5%)
# Gaps:           9/148 ( 6.1%)
# Score: 290.5
# 
#
#=======================================

HBA_HUMAN          1  VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DL     48
                      .|:|.:|:.|.|.||||  :..|.|.|||.|:.:.:|.|:.:|..| ||
HBB_HUMAN          1 VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDL     48

HBA_HUMAN         49 S-----HGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRV     93
                     |     .|:.:||.|||||..|.::.:||:|::....:.||:||..||.|
HBB_HUMAN         49 STPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHV     98

HBA_HUMAN         94 DPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR    141
                     ||.||:||.:.|:..||.|...||||.|.|:..|.:|.|:..|..||.
HBB_HUMAN         99 DPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH    146

Figure two: Example output from Needle. In this alignment of HBB and HBA proteins, a line represents an exact identity, two dots represents more similarity, and one dot shows some similarity. From http://egg.isu.edu

Table Two: other global alignment programs

Stretcher

  • calculates a global alignment of two sequences.
  • takes the entire length of the genome into account when making an alignment.

Est2genome

  • aligns spiced nucleotide sequences (EST, cDNA, or mRNA) with unspliced genomic DNA sequences.
  • aids in the prediction of the spliced nucleotide sequences via sequence homology.
  • When needed, introns and exons are added to make the alignment ideal. A list of these are included in the output file
  • uses three alignments to make predictions
  • should be used to piece together a predicted sequence for spliced nucleotide sequences based on sequence homology

Local Alignment Programs

   Local alignment programs find local, small parts of two sequences where there is some similarity and makes no assumption about the whole length of the sequence needing to be similar. Local alignment programs include Matcher, Water, Wordmatch, Seqmatchall , and Supermatcher.

How to use local alignment programs

   Two sequences must be entered by the user, as well as a gap penalty, and gap extension penalty. With some local alignment programs, such as Matcher , the user can request that more than just the best alignment be shown. Other programs require that the user choose the scoring matrix. Most of the programs, however, use EDNAMAT and eBLOSUM62 as default scoring matrices for nucleic acid and protein sequences, respectively.

Water

  Water uses the Smith-Waterman local alignment to compare sequences. Regions of similarity between two sequences are found, and local regions from one sequence can even be used to scan a sequence database. This program is best used to look for similarities between small parts of two sequences. In addition to the acutal alignment, Water will show an alignment score, in addition to percent identity and similary, as seen in figure three.

Figure three: Sample data using Water to compare HBB and HBA proteins

########################################
# Program:  water
# Rundate:  Tue Apr 06 14:34:38 2004
# Align_format: srspair
# Report_file: hba_human.water
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: HBA_HUMAN
# 2: HBB_HUMAN
# Matrix: EBLOSUM62
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 145
# Identity:      63/145 (43.4%)
# Similarity:    88/145 (60.7%)
# Gaps:           8/145 ( 5.5%)
# Score: 293.5
# 

Table Three: other local alignment programs

Wordmatch

  • finds all exact matches of a specified size between two sequences.
  • identical regions are reported as the output.
  • only matches of the specified size and larger will be shown

Supermatcher

  • looks for all word matches between one sequence and other sequences.
  • This rough alignment tool is best used with large sequences that would otherwise take an appreciable amount of time and memory to align

Matcher

  • uses a very specific algorithm to find the best local alignment between two sequences
  • similar to Water but uses less memory
  • can be used to align large sequences.
  • reports more than just the best alignment in contrast to Water, which only reports the best alignment

Seqmatchall

  • compares tuples, or word fragments of the sequences, from a set of sequences to do an "all against all" pairwise alignment
  • best used to align a group of sequences
  • program runs slowly due to its "all against all" analysis, so the user must designate a word size that is small enough to find regions of similarity in a reason able amount of time
  • With a larger word size the program will run more quickly, however, any tuples    smaller than the specified size will be overlooked

Research using alignment programs

Use of Dotplot programs

  • Coenye, T. Vandamme, P. Extracting phylogenetic information from whole genome sequencing projects: the lactic acid bacteria as a test case . Microbiol. Dec 2003; 149:3507-3517.
  • Huang, Y. Zhang, L. Rapid and sensitive dot matrix methods for genome analysis . Bioinformatics. Mar 2004; 20(4): 460-6.
  • Jancovich,J. Mao, J. Chinchar, V. Wyatt, C. et. al. Genomic sequence of a ranavirus associated with salamander mortalities in North America . Virology. Nov. 2003; 316:90-103.
  • Warren, M. Smith, A. Partridge, N. Masabanda, J. Structural analysis of the chicken BRCA2 gene facilitates identification of functional domains and disease causing mutations . Human Molecular Genetics. Apr. 2002; 11:841-851
  • Wicker, T. Guyot, R. Yahiaoui, N. Keller, B. CACTA transposons in Triticeae. A diverse family of high-copy repetitive elements . Plant Physiol. May 2003; 132(1):52-63.

Use of Global Alignment Programs

  • Cheung, A. Bayer, A. Zhang, G. Gresham, H. Xiong, Y. Regulation of virulence determinants in vitro and in vivo in Staphylococcus aureus . FEMS Immunol Med Micorbiol. Jan 2004; 40(1):1-9.

Use of local alignment programs

  • Goddard, J. Sumner, J. Nicholson, W. et. al. Survey of ticks collected in Mississippi for Rickettsia,Ehrlichia , and Borrelia species . J Vector Ecol. Dec 2003; 28:184-9
  • Yamada, Y. Makimura, K. et. al. Phylogenetic relationships among medically important yeasts based on sequences of mitochondrial large subunit ribosomal RNA gene . Mycoses. Feb. 2004; 47:24-8.
  • Zhang, W. Jayarao, B. Knabel, S. Multi-virulence locus sequence typing of Listeria monocytogenes. Appl Environ Microbiol. Feb. 2004;70:913-20.

Sources

  1. Thomas, M. "Sequence alignments" Power point presentation 2/10/2004
  2. EMBOSS instructional package. http://egg.isu.edu/inquiry
  3. www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps
  4. www.sanger.ac.uk/Software/EMBOSS
  5. http://egg.isu.edu/emboss/dotmatcher.html