Skip to content

Welcome courses Facilities research Members News Search
  You are not logged in Log in
You are here: Home » biocourses » Bioinformatics for Biologists - BIOS 599 - Spring 2004 » projects » A User's Guide to Prokaryotic Gene Prediction Using Glimmer on EGG

« October 2008 »
Su Mo Tu We Th Fr Sa
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  

A User's Guide to Prokaryotic Gene Prediction Using Glimmer on EGG

I. Introduction

This guide is intended to be an introduction to the prokaryotic gene prediction program, Glimmer. Glimmer uses a sequence of programs to locate putative coding regions in microbial DNA – eubacteria and archae. Glimmer ( G ene L ocator and I nterpolated M arkov M odel er ) distinguishes coding regions of an unknown genome from noncoding regions with around 99% accuracy. Much of the information for this guide was taken from www.tigr.org as well as the two published papers on Glimmer. (1, 2)

The traditional way to identify microbial genes has been to sequence an unknown region and then use BLAST or FASTA to search entries in a database. Although this method is accurate for homologous genes that have been identified, novel genes will not be easily found. Since microbial genomes typically contain around 90% coding sequence, we can rely on computational methods to differentiate the putative genes in each reading frame.

The genome of Haemophilus infuenzae was characterized using a computational method called GeneMark (3), which uses a Markov chain model to find putative genes. Glimmer improves gene prediction by using what is called an interpolated Markov Model (IMM). These IMMs produce more accurate results in finding microbial genes.   IMMs and the other computations Glimmer uses are explained in detail in the next section.  

II.   Glimmer Explained

Glimmer identifies potential genes using a combination of interpolated Markov models from 1st - 8th order, weighting in each model according to its predictive power. Markov chain models alone have been used reliably in the past for gene prediction. A fixed order Markov model predicts the next base in a sequence according to a fixed number of bases in the sequence. For example, a fifth order Markov model uses the five previous bases to predict the next base. A k th order Markov model for DNA sequences would have 4 k +1 probabilities (4096 for a 5 th order model). In order to estimate these probabilities, many occurrences of all possible k mers must be present.

The IMM in glimmer does this by combining probabilities from contexts (oligomers) of varying lengths to make predictions and only using those contexts for which there is sufficient data. The IMM uses a combination of probabilities obtained from several lengths of oligomers to make predictions, giving higher weights to oligomers that occur frequently.   This means that when the statistics on longer oligomers are inadequate to make good estimates, an IMM can use the shorter oligomers to make predictions. After creating IMMs for all six reading frames, glimmer then uses this model to score open reading frames (orfs).

Glimmer identifies all orfs longer than a certain length and scores it in all six reading frames. Those that score higher than a certain threshold are processed further. They are examined for overlaps of a specified length. If one is found, the overlap region is scored separately in all six reading frames and the highest score of the overlap is combined with the highest score for the putative gene. The output includes notations of these overlapping genes as well as ‘suspect' genes that should be examined manually. Examples of glimmer output can be obtained from the TIGR site above.

Some examples of research that was completed with the help of glimmer includes the complete genome sequence of Vibrio cholerae and Porphyromonas gingivalis . (4,5)

III.   How to use Glimmer on EGG

On the EGG server, Glimmer is broken up so that each step requires input.   This gives the user the opportunity to adjust parameters between steps. Please note that it is recommended to save the output files as you go.

Glimmer works in four steps (programs):

  1. “long-orfs” - The first program finds long open reading frames (ORFs) from a FASTA formatted sequence. This FASTA file should contain the most closely related sequence to the organism on which you are working. For instance, if you have a 20,000 bp piece of Salmonella DNA, use the closest Salmonella species for your input. If you have 2-3 million bases, you can use your uncharacterized DNA as the input to build the IMM that will be used to find potential genes in your input sequence. The output of this program is then used in the glimmer_extract program. The input parameters of this program allow the user to adjust the gene size, overlap length and overlap percentage.
    1. Circular Genome – This allows the user to specify whether or not the sequence is linear (e.g. phage DNA) or if it is circular. If it is circular, the program looks for genes that wrap around the end of the sequence.
    2. Minimum Gene Length – This parameter allows the user to specify the length of the smallest fragment to be considered a gene.   This length is measured from the first base of the start codon to the base immediately preceding the stop codon.   NOTE: By default the program will compute an optimal length for this parameter. Optimal is considered the value that produces the greatest number of long orfs, which increases the amount of data used for training the IMM.

    3. Overlap Length – This parameter dictates the smallest number of bases that overlap between two genes.   Overlaps shorter than this are considered a problem and not used.  

    4. Overlap Percentage – This is another lower limit on the number of bases that overlap and are considered a problem. Overlaps shorter than this percentage of both genes are not used.

    5. Additional Notes – The start codons used are atg and gtg. The stop codons used are taa, tag and tga. To change these values requires modifying the file gene.h. It is highly recommended that you get the help of the Bioinformatics team before you do this.   More information on start/stop codon changes can be found in the long_orfs readme file.

  2. “glimmer_extract” – This second program uses the coordinates of the long orfs specified above and then extracts and outputs the sequence of the orfs in a format that can be recognized by the next program, build_icm. The input parameters allow the user to skip the start codon and omit ouptut sequences of a certain length.
    1. Omit Start Codon – This parameter was the default setting for the previous version of glimmer. It can now be toggled to omit the first three bases in a long orf.
    2. Minimum Length of Sequences – This allows the user to take out sequences shorter than the specified number.
    3. Output File Name – This allows the user to specify the output file name.
  3. “build_icm” – This portion of the process takes the extracted output file above and creates and outputs an interpolated Markov model that will be used in final step, Glimmer. The only input parameter allows the user to choose whether or not to have the output in text format instead of binary format.

  4. “Glimmer” – This final step requires two input files: a sequence file in FASTA format and the collection of Markov models produces by the third step above. Glimmer produces an output file that lists potential genes and is discussed at length in section IV. Input parameters here are:

    1. Minimum Gene Length – This allows the user the opportunity to specify the minimum gene length.
    2. Regard Genome as Circular – This allows the user to specify whether or not the sequence is linear (e.g. phage DNA) or if it is circular. If it is circular, the program looks for genes that wrap around the end of the sequence.
    3. Overlap Length – This parameter dictates the smallest number of bases that overlap between two genes. Overlaps shorter than this are considered a problem and not used.
    4. Overlap Percentage – This is another lower limit on the number of bases that overlap and are considered a problem. Overlaps shorter than this percentage of both genes are not used.
    5. Threshold Score – This is the minimum in frame score for a sequence fragment to be a gene.
    6. First Codon or Ribosomal Binding Site – This parameter indicates whether the first possible start codon is used or not.   If it isn't, a subroutine (Choose_Start) calculates the hybridization energy of the ribosome pattern (the string is Ribosome_Pattern) and if it above the threshold, the start site is chosen. The user can input a specific ribosome-binding pattern in the next parameter if it is known. This entire portion of the program still has bugs and is not recommended for use.
    7. Use Independent Probability - This dictates whether fragments are additionally scored using an independent base probability.

IV. Glimmer Output and Interpretation

The output file that glimmer produces starts by showing what the user's input parameters were. After this, the report is arranged in columns that are explained below. Click output.pdf for an example output file for Glimmer. The example is from E. coli K12 genomic sequence.

  1. ID# - This column has the sequential number assigned to any region that scored higher than the minimum score entered by the user.
  2. Frame - This column has the reading frame of the orf.   Forward and reverse reading frames are represented as (F1, F2, F3) and (R1, R2, R3), respectively.
  3. Orf Start – This is the position of the first base of the orf.   In other words, the first base after the previous stop codon.
  4. Gene Start – This is the position of the first base of the start codon of the potential gene.
  5. Gene End – This is the position of the last base before the stop codon.
  6. Orf Length – This is the length of the orf calculated by 1 + | OrfStart – End|.
  7. Gene Length – This is the length of the orf calculated by 1 + | GeneSTart – End|.
  8. Gene Score – This is the score that glimmer calculates for the potential gene region. It is the probability in percent that the Markov model in the correct frame generated for this sequence. This value matches the highest value in the Frame Scores column.
  9. Frame Scores – This shows the probability in percent for each reading frame that the Markov model generated this sequence.
  10. Independent Score – This column is the percent probability that a model of independent probabilities for each base generated the gene sequence. This is the chance that the sequence was generated randomly.

When two genes overlap by the amount specified by the user, a line starting with *** is shown and followed by the scores for the overlap region. If the frame of the high score matches the reading frame of the longer gene, then the shorter gene is rejected.

At the end of the output there is a list of potential gene positions. Several comments in [] can be seen in this part of the output that note suspect genes.

  1. [Bad Olap A B C] - This means that gene A overlapped this and was shorter but scored higher on the overlap region. B is the length of the overlap and C is the score of this overlap region.
  2. [Shorter A B C] – This means that gene A overlapped this and was longer but scored lower on the overlap region. B is the length of the overlap and C is the score of this overlap region.
  3. [Shadowed by A] – This tells that this gene was completed inside of gene A's region, but in another frame.
  4. [Delay by A B C D] – This tells that this gene was potentially rejected because of an overlap with gene B, but if the start codon is postponed by A positions, it would be a valid gene. C is the length of the overlap that caused the rejection and D is the score in this gene's frame of that overlap region.
  5. [Weak] – This means that this sequence did not meet the scoring threshold, but if the independent model were ignored, it would score high enough. This will only be reported if the –w option is checked.
  6. [Vote] – This tells that this did not meet the scoring threshold, but many of its subranges had high enough scores to indicate it might be a gene.

V. Verification and Further Study of Glimmer Results

The potential genes that are produced by glimmer can be input into BLAST or FASTA to search for homologous genes in other organisms. This is a good first step in characterizing hypothetical proteins. There is a program that can be obtained from TIGR that performs this task for you. (http://www.tigr.org/~salzberg/glimmer/blastfasta.pl and http://www.tigr.org/~salzberg/glimmer/README)   This program requires some skill with Perl scripts and support from the bioinformatics team is recommended.  

There are several downfalls to using glimmer for characterizing genomes. The first is that if the start codons of the organism are not well understood and inputted appropriately into glimmer, detection of putative genes will not be as successful. If the genome has phage DNA inserted, that DNA should be removed from the genome and characterized using the most similar phage sequence instead of the bacterial genome and/or the closest known species. The last disadvantage of this program is that it only finds the genes that conform to the rules established by your IMM. Any exceptions or unusual genes, such as polycistronic genes will not be well characterized by glimmer.


VI. Bibliography

1.Salzberg, S. L., A. L. Delcher, S. Kasif, and O. White. 1998. Microbial gene identification using interpolated Markov models. Nucleic Acids Res . 26: 544-548

2.Delcher, A.L., D. Harmon, S. Kasif, O. White, and S. L. Salzberg. 1999. Improved microbial gene identification with Glimmer. Nucleic Acids Res . 27: 4636-4641

3.Borodovsky, M., and D. Mcininch. 1993. Parallel Gene Recognition for both DNA Strands Comp. Chem . 17: 123-133

4.Heidelberg, J.F., et al. , 2000. DNA Sequence of both chromosomes of the cholera pathogen Vibrio cholerae . Nature 406: 477-483

5.Nelson, K.E., et al ., 2003. Complete genome sequence of the oral pathogenic bacterium Prophyromonas gingivalis strain W83.   J. Bacteriology . 185 :5591-5601