********************************************************************** ********************************************************************** ** ** ** GENOMESCAN DOCUMENTATION ** ** ** ** ** ** Christopher Burge ** ** ** ** MIT ** ** Department of Biology ** ** 77 Massachusetts Ave., 68-222 ** ** Cambridge, MA 02139 ** ** ** ** cburge@mit.edu ** ** ** ********************************************************************** ********************************************************************** ______________________________________________________________________ ORGANIZATION OF THIS FILE 1. OVERVIEW OF GENOMESCAN 2. GENOMESCAN INPUT 3. GENOMESCAN OUTPUT 4. RUNNING GENOMESCAN WITH GENOMESCRIPT 5. GENOA FORMAT AND BLASTX2GENOA 6. TECHNICAL DETAILS 7. WEB PAGES 8. REFERENCES ______________________________________________________________________ 1. OVERVIEW OF GENOMESCAN GenomeScan is a program for identifying the exon-intron structures of genes in genomic DNA sequences from a variety of organisms, with a focus on human and other vertebrates. The algorithm combines two principal sources of information: 1) models of exon-intron and splice signal composition; and 2) sequence similarity information such as BLASTX hits. The input to the program consists of a genomic sequence previously masked with RepeatMasker, a parameter file for the appropriate organism and a 'Genoa file' containing a summary of available sequence similarity information. The program determines the most likely "parse" (gene structure) conditional on the given similarity information under a probabilistic model of the gene structural and compositional properties of genomic DNA for the given organism. The locations of all predicted exons and genes are printed to an output file (the text output) together with the corresponding predicted CDS (coding DNA) and peptide sequences and a summary of the similarity information used in the predictions. A graphical (PostScript) output is also created displaying the location of each predicted exon and of each BLASTX hit. Like Genscan, the model treats the general case in which the sequence may contain no genes, one gene, or multiple genes on either or both DNA strands, and partial genes as well as complete genes are considered. The most important restrictions are that only protein coding genes are considered (and not tRNA or rRNA genes, for example), that transcription units are assumed to be non-overlapping, and that all predicted genes must have at least modest similarity to a known protein. The probabilistic model used by GenomeScan is based on that used by Genscan and accounts for many essential features of gene structure such as gene density, the typical number of exons per gene, the distribution of exon sizes for different types of exons; and also many of the important compositional properties of genes, e.g., the reading frame-specific hexamer composition of coding regions, the (reading frame-independent) hexamer composition of introns and intergenic regions, and the position-specific composition of the translation initiation (Kozak) and termination signals, and of the TATA box, cap site and polyadenylation signals. Models of the donor and acceptor splice sites are used which capture potentially important dependencies (interactions) between positions in these signals. For human and vertebrate sequences, separate sets of model parameters are used which account for the many substantial differences in gene density and structure observed in distinct C+G% compositional regions of the human genome and the genomes of other vertebrates. The Genscan model is described in Burge and Karlin, 1997 (see REFERENCES) and in greater detail in my thesis (http://genes.mit.edu/chris). In addition to this documentation, GenomeScan is described in the following paper: R.-F. Yeh, L. P. Lim and C. B. Burge, 2001. Computational Inference of Homologous Gene Structures in the Human Genome. Genome Research (in press). ______________________________________________________________________ 2. GENOMESCAN INPUT After installing the GenomeScan package on a Unix system (not described in this document), typing "genomescan" at the prompt lists command line arguments: % genomescan usage: genomescan paramfile seqfile -g genoafile [optional arguments] The three essential arguments (the parameter file, sequence file and Genoa file) are described later. The optional arguments are as follows: OPTIONAL ARGUMENTS __________________ -v Add extra explanatory information to the text output. This information is helpful the first few times the program is run but soon becomes unnecessary (that's why its optional). -GC Consider potential 5' splice sites which have /GC as well as /GT as the first two bases of the intron (default: GT only) -F Consider potential genes on forward DNA strand only -R Consider potential genes on reverse DNA strand only By default, potential genes on either or both DNA strands are considered, the only restriction being that the gene structures in a parse must not overlap. -r Sets value of r parameter used in root-r heuristic (see Methods section of GenomeScan manuscript) -start Sets value of START_FACTOR parameter (see below) -stop Sets value of STOP_FACTOR parameter (see below) -cds By default, only predicted peptides are printed - these are useful for searching against protein databases, aligning to known proteins, running through motif finding programs such as Pfam (Sonnhammer et al., 2000), etc. If the -cds flag is set, predicted CDS (coding sequences) are also printed. These are useful for searching predicted genes against cDNA or EST databases, design of RT-PCR primers, etc. -ps Create PostScript format graphical output, diagramming the locations and DNA strand of all predicted exons/genes. The locations of BLASTX hits are also displayed. Exons on the "forward" (input) strand of the sequence are displayed above the sequence line; exons on the reverse strand are displayed below this line. The argument "psfname" is the name of the file for the PostScript output (should end in ".ps"). This argument is required whenever the "-ps" flag is used. The "scale" argument tells the program what scale to make the PostScript image - how many base pairs to represent per line. This number must be no greater than one fourth the length of the sequence because at most four lines fit on a page. If this argument is omitted, the program chooses a reasonable scale for the image. The PostScript output can be printed on a PostScript printer. It can be viewed using any of several PostScript interpreters such as ghostscript/ghostview, pageview, xpsview, etc. PostScript files can be converted to other formats such as EPSI, PDF, gif, jpeg, etc. using ghostscript and/or other available utilities. ESSENTIAL ARGUMENTS ___________________ *** The Parameter File *** The parameter file must follow a very specific format to be read by GenomeScan (same format as used by Genscan) and should not be modified. Separate parameter files are needed for different organisms. Currently available parameter files are: HumanIso.smat human/vertebrate sequences (also Drosophila) Arabidopsis.smat Arabidopsis thaliana sequences Maize.smat Zea mays (corn) sequences The full path of the parameter file is the first command line argument. You can type the full path every time you run the program or save some typing by using aliases. For example, assuming that the parameter files have been installed in /usr/lib/genomescan, put the following aliases in your .cshrc file (normally located in your home directory): alias genomevert genomescan /usr/lib/GENOMESCAN/HumanIso.smat alias genomearab genomescan /usr/lib/GENOMESCAN/Arabidopsis.smat alias genomemaiz genomescan /usr/lib/GENOMESCAN/Maize.smat That way (after you source .cshrc) you can simply type > genomevert SEQFILE ... to run the program on SEQFILE with the human/vertebrate parameters. *** The Sequence File *** The sequence file may be in either FastA or minimal GenBank format. These formats are described below, with examples of each. A sequence in FastA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The sequence data may be upper or lower case. All spaces, tabs, numbers or other non-alphabet characters are ignored by GENOMESCAN, with the exception of asterisks ("*"), which are treated as unknown nucleotides (N's). GenomeScan does not distinguish between special symbols indicating purine or pyrimidine nucleotides such as R, Y, etc., but treats all letters other than A, C, G, T as unknowns (N). It is usually an easy task to convert sequences stored in other formats such as Intelligenetics, EMBL, etc. to FastA format either by hand or using any of several standard utitilities such as ReadSeq. A sample FastA format sequence is shown below: >HC2667A BAC clone from human chromosome 5q22 GGATCCCAGCCTTTCCCCAGCCCGTAGCCCCGGGACCTCCGCGGTGGGCGGCGCCGCGCT GCCGGCGCAGGGAGGGCCTCTGGTGCACCGGCACCGCTGAGTCGGGTTCTCTCGCCGGCC TGTTCCCGGGAGAGCCCGGGGCCCTGCTCGGAGATGCCGCCCCGGGCCCCCAGACACCGG ...... GenomeScan may also be run on files in "minimal GenBank" format. This includes files in proper GenBank format as well as files which contain only partial GenBank annotation (see below). The main reason why GenomeScan has been written so as to accept GenBank annotated as well as unannotated (FastA) files is so that the predictive accuracy of the program can be easily measured on sequences with known gene locations. When run on a GenBank file containing a feature table, the program automatically compares its predictions to the annotated CDS (coding sequence) features, displays a summary of the annotated as well as predicted exons, and calculates some standard measures of predictive accuracy such as nucleotide- and exon-level sensitivity, specificity and so on. (The conventions used to calculate these statistics are those described by Burset and Guigo, 1996 - see REFERENCES). This makes it relatively easy to check the program's accuracy for any particular set of sequences for which the annotated CDS features are considered reliable and complete. Of course, the program does not actually use the annotation in any way to make its predictions: that would be silly. Therefore the set of predicted genes/exons will be identical for a sequence in GenBank format as for the same sequence in FastA format. GenBank format files must follow an elaborate set of formatting conventions devised by the National Center for Biotechnology Information (NCBI) in Washington. However, GenomeScan needs only a fraction of the complete GenBank annotation (see below) for its purposes. Only these lines need be present (in the proper order) in an input file which is to be read by the program. This "minimal GenBank" format is as follows. The LOCUS and ORIGIN lines must be present. The first line of the file must be the LOCUS line and this line must be in proper GenBank format. The ORIGIN line must begin with the word ORIGIN but has no other special restrictions. The sequence must begin on the line immediately after the ORIGIN line. All other lines normally present in GenBank files (e.g., ACCESSION, KEYWORDS, etc.) are optional. However, if a feature table is present, it must begin with a line: FEATURES Location/Qualifiers (with the same spacing/capitalization/etc. as in a GenBank file) and must end with a BASE COUNT line, followed by an ORIGIN line, and then the sequence. The feature table must be in correct GenBank feature table format (including spacing) - consult the NCBI GenBank format description or look at some real GenBank files if in doubt. All features other than those labeled "CDS" are ignored. All CDS features are read and the complete set of annotated coding exons are compared to the complete set of predicted coding exons. The format for the LOCUS line is: LOCUS SEQNAME seqlen bp seqtype taxgroup date where SEQNAME is the name of the sequence (any string), seqlen is the length of the sequence in base pairs (which must match the actual number of sequence characters in the file), and the last three strings describe the type of sequence (e.g., ds-DNA, cDNA), the taxonomic group code, and the date. The sequence may be upper or lower case and all spaces, tabs, numbers, etc. occurring after the ORIGIN line are ignored. A sequence in minimal GenBank format is given below: LOCUS HUMRASH 6453 bp ds-DNA PRI 15-MAR-1988 DEFINITION Human c-Ha-ras1 proto-oncogene, complete coding sequence. ACCESSION J00277 J00206 J00276 K00954 FEATURES Location/Qualifiers prim_transcript <1664..3744 /note="c-Ha-ras1 mRNA" CDS join(1664..1774,2042..2220,2374..2533,3231..3350) /note="c-Ha-ras1 p21 protein; NCBI gi: 190891." /codon_start=1 source 1..6453 /organism="Homo sapiens" BASE COUNT 946 a 2287 c 2113 g 1107 t ORIGIN 1 bp upstream of BamHI site. 1 ggatcccagc ctttccccag cccgtagccc cgggacctcc gcggtgggcg gcgccgcgct 61 gccggcgcag ggagggcctc tggtgcaccg gcaccgctga gtcgggttct ctcgccggcc ... .......... *** The Genoa File *** The Genoa file lists sequence similarity information and must follow a precise format described in section 5. ______________________________________________________________________ 3. GENOMESCAN OUTPUT By default, the text output of the program is directed to stdout, which means that if you simply run the program without redirecting or piping the output, it will be printed to the screen. The reason for this is so that the user can redirect the output to a file: > genomescan HumanIso.smat SEQFILE -g GENOAFILE > SEQFILE.out or pipe the text output through some sort of filtering program to put it in a more convenient form: > genomescan HumanIso.smat SEQFILE -g GENOAFILE | genomefilter Of course, it is your responsibility to create the filtering program. The format of the text output file is the same as that produced by Genscan so that filtering programs developed for Genscan will also work with GenomeScan. This format is described at the bottom of the output file if the verbose (-v) flag is used. The PostScript output is a diagram of the locations of predicted exons (red boxes) and BLASTX hits (green boxes) in the genomic sequence. ______________________________________________________________________ 4. GENOMESCRIPT GenomeScript is a Perl script which performs the following operations on an input genomic sequence: 1) Mask interspersed repetitive elements in the genomic sequence with RepeatMasker (A. Smit and P. Green). 2) Run Genscan on the masked genomic sequence and search the predicted peptides against the protein database specified as an argument to Genomescript using BLASTP with default parameters and E-value cutoff 1e-5. 3) Create a restricted protein database consisting of all proteins hit in step 2), then run BLASTX of the masked genomic sequence against this restricted database with parameters -G 20, -E 3, -e 0.05 (increased gap penalties and E-value cutoff 0.05). Convert the resulting BLASTX output to Genoa format using 'blastx2genoa'. 4) Run GenomeScan on the masked genomic sequence using the Genoa file from the previous step as input. Additional details can be found in the genomescript file itself. ______________________________________________________________________ 5. GENOA FORMAT - BLASTX2GENOA BLASTX output is converted to the 'Genoa' format required by GenomeScan using a Perl script 'blastx2genoa'. Genoa format is a simple representation of BLASTX (or other) similarity information in which each BLASTX hit is represented by a single line with 15 columns. A sample is shown below: Column1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ------- ---- -- --------- ----- - -- ---- ---- ---- ---- ----- ----- -- ------ Hit33.1 Pept nr AAC52323+ Mouse 1 26 1812 4639 4716 4690 1e-08 1e-08 0 BLASTX In the description below, the genomic sequence is the query and protein sequences involved in BLASTX hits are referred to as subject. Meanings of columns no. 1-15: 1) A label for the hit (arbitrary but must be unique in Genoa file) 2) Type of sequence hit (Pept = peptide for a BLASTX hit) 3) Name of database searched (for reference) 4) Locus name or accession number of subject protein (for reference) A + after the name indicates that the protein sequence begins with Methionine. A - after the name indicates that it does not. 5) Name of organism from which protein derived (for reference) The default value of this field is 'Unknown'. This field is not currently used by GenomeScan 6) Location of beginning of BLASTX hit in subject protein (aa) 7) Location of end of BLASTX hit in subject protein (aa) 8) Length of subject protein (aa) 9) Location of beginning of hit in query genomic sequence (nucleotide position on forward DNA strand) 10) Location of end of hit in query genomic sequence (nucleotide position on forward DNA strand) For hits to reverse strand reading frames, column 9 exceeds column 10. 11) 'Centroid' of BLASTX hit in query genomic sequence (defined below) 12) BLASTX E-value of hit 13) BLASTX E-value of 'parent' hit (see below) Often, the hit and the parent hit are one and the same. These columns differ when a 'parent' BLASTX hit is very long or has a multimodal score distribution and spawns multiple Genoa hits with different centroids. 14) Reading frame of hit. Positions are always numbered on the forward strand starting from 1. The conventions used by GenomeScan are as follows. Forward strand reading frames: A codon at position [x-2,x-1,x] has reading frame x modulo 3 For example, the codon at [59,60,61] has reading frame 1 Reverse strand reading frames: A codon at position [x,x-1,x-2] (forward strand coordinates) has reading frame (x modulo 3) + 3 For example, the codon at [20,19,18] has reading frame 5 15) Name of program used for alignments (for reference) Columns marked 'for reference' do not affect the output of the program - this information is present as a record of the source of the similarity information. ______________________________________________________________________ 6. TECHNICAL DETAILS This section describes some technical details of the GenomeScan algorithm which are likely to be of interest only to bioinformatics aficionados. 1) Definition of centroid and treatment of long hits and multimodal hits The centroid of a BLASTX hit is defined in terms of the cumulative score plot obtained by summing the raw BLOSUM match scores across the hit in question. Typically, human exons are represented by medium-sized BLASTX hits (between 15 and 100 amino acids/codons in length) which correspond reasonably well to the boundaries of the exon (human exons are typically between 15 and 100 codons in length). In this case, the goal is to position the centroid at the position within the hit which is most likely to be contained within the exon. Therefore, the centroid is defined using a steepest slope heuristic as the point C which has steepest slope in the cumulative score plot over a window of 15 codons centered at C. For very short hits (< 15 amino acids), differences in slope cannot be reliably measured, so the centroid is defined as the midpoint of the hit in the query genomic sequence. In some cases, the cumulative score plot may be multimodal. For our purposes, a score distribution is multimodal if it has (at least) one valley whose score is 40 or more less than the previous peak (raw BLOSUM score units), followed by a peak which is 40 or more greater than the low point of the previous valley. This situation has two typical causes: 1) the valley(s) correspond to poorly conserved regions of a large exon; or 2) the hit covers two (or more) closely spaced exons and the valleys represent introns. The heuristic used which handles both of these situations gracefully is to break the BLASTX hit (referred to as the 'parent' hit) into as many Genoa hits as there are modes of the distribution, where each mode corresponds to a peak of the score distribution which has a score at least 40 higher than the previous valley as described above. The E-values of these 'daughter' hits are derived from that of the parent hit in proportion to the heights of the modes as in the following example. Consider a BLASTX hit with E-value 1e-9 which has two modes, one with a peak height of 100 (relative to the previous valley), the other with a peak hieght of 50. In this case, the first subhit would be assigned an E-value of (1e-9)^(100/150) = 1e-6, while the second would have an E-value of (1e-9)^(50/150) = 1e-3. (The notation x^y indicates x to the y power.) The centroids of these subhits are defined as above, using the steepest slope or midpoint rule depending on size. Very long BLASTX hits (> 100 codons/amino acids) generally correspond to (rare) long exons or intronless genes (or pseudogenes). Using the steepest slope heuristic in this case would tend to concentrate the entire hit into a single point, which would not accurately represent the information contained in an extended region of BLASTX similarity. Therefore, very long hits are also broken into multiple daughter hits, using the following heuristic. First, the hit is 'trimmed' by moving the 5' boundary the genomic position where the cumulative score plot first exceeds a BLOSUM score of 50, and similarly at the 3' end to ensure that the new boundaries are internal to the coding region which is present. Next, this core region of the hit (of length L bases) is broken up into N subhits, where N is defined as the greatest integer less than L/75. The centroids of the subhits are then placed at N positions spaced equidistantly across this core region of the hit. The P-values of the daughter hits are set to p = P^(1/N), where P is the P-value of the parent hit and N is the number of daughter hits. These heuristics are based on reasonable principles but are not necessarily 'optimal' in any way. However, they have withstood fairly extensive testing and appear to work well in practice. All of the processing of BLASTX hits into Genoa hits, determination of multimodality, calculation of centroids, etc. is carried out by the script blastx2genoa. 2) Using BLASTX hits to identify initiation and termination codons Consider the example given in the GenomeScan paper in which residues 6-50 of a protein have a BLASTX match to genomic coordinates 116-250. In this case it stands to reason that an ATG located five codons upstream at position 101-103 in the genomic sequence (if one occurs there) has a higher likelihood of representing an initiation codon than some other randomly chosen ATG in the genomic sequence. This observation leads to the start codon heuristic used by GenomeScan which is applied to BLASTX hits which begin at positions X within 30 residues of the start of the target protein. In this heuristic, the probability of any parse which involves an initiation codon exactly X-1 codons upstream of the genomic location of the beginning of the BLASTX hit (as in the example above) is increased by the factor START_FACTOR, which may be set as a command-line argument to GenomeScan (the default value of this parameter, 1e6, was determined empirically). Some peptides in available protein databases derive from partial cDNAs (often 5'-truncated) and therefore do not represent complete proteins. To filter out these incomplete proteins, the above heuristic is applied only when the subject protein begins with Methionine (a reasonable but imperfect marker for completeness). These proteins are marked with a '+' in the Genoa file as in the example given in the previous section. Now suppose that residues 450-495 of a 500 amino acid protein match the genomic region 1200-1335. In this case it stands to reason that a stop codon triplet (TAA, TAG or TGA) located at position 1348-1350 in the genomic sequence (if one occurs there) is more likely to represent a stop codon than some randomly chosen genomic stop codon triplet. This observation leads to the stop codon heuristic used by GenomeScan which is applied to BLASTX hits which begin at positions X <= L - 30 in the subject protein (where L is the length of the protein). In this case, the probability of any parse which involves a stop codon exactly (L-X) codons downstream of the genomic location of the end of the BLASTX hit is increased by the factor STOP_FACTOR. This parameter can also be set as a command-line argument to GenomeScan, and the empirically determined default value is also 1e6. These start and stop codon heuristics are only applied to relatively strong BLASTX hits (E < 1e-6) since only relatively well conserved proteins are likely to give reliable start/stop locations. (Distant homologs will often have aquired insertions/deletions near the ends of the protein causing the spacing relative to the start/stop codons to change.) Again, these heuristics are based on reasonable principles, but are not necessarily 'optimal'. However, they appear to work quite well in practice and help to overcome the well-known weakness of ab initio gene prediction methods such as Genscan in terms of prediction of the first and last exons of a gene (Burge and Karlin, 1998 - see REFERENCES). 3) Using BLASTX hits to identify intronic regions Suppose that BLASTX hit B1 matches residues 101-150 of protein P to nucleotides 850-999 of the query genomic sequence, and BLASTX hit B2 matches residues 151-200 of protein P to nucleotides 2001-2150 of the genomic sequence. This sort of arrangement (adjacent hits in protein match nearby but distinct regions of genomic sequence) often indicates a pair of adjacent exons separated by an intron and provides the additional information that the region from approximately 1000-2000 in the genomic sequence is likely to be intronic (and therefore would not contain additional exons). This observation leads to the intron heuristic used by GenomeScan which is applied when certain conditions are met which make it very likely that the region between two BLASTX hits is intronic. Specifically, the conditions are that there must be two relatively strong (E < 1e-6) BLASTX hits in a region of genomic DNA which match adjacent or nearly adjacent regions of the same subject protein (the end of the upstream hit differs from the beginning of the downstream hit by 5 or fewer residues) but are not adjacent in the genomic sequence (> 60 bp apart, the minimum length of a human intron). In this case, a special type of Genoa hit (a Genoa 'intron hit') is generated by blastx2genoa which specifies that the region of the genomic sequence beginning 30 bp after the end of the upstream hit and ending 30 bp before the beginning of the downstream hit is likely to be intronic with P-value pI = 1 - (1 - pB1) x (1 - pB2), where pB1 and pB2 are the P-values of the BLASTX hits B1 and B2. The reasoning behind this formula is that the region is that the inference that the region is intronic is valid only if both flanking BLASTX hits are correct, an event which has probability (1 - pB1) x (1 - pB2) assuming independence between hits (and before application of the root-r heuristic). The P-values of Genoa intron hits are adjusted using the root-r heuristic as for normal Genoa hits. The 30 bp offset is used to ensure that the specified intronic region is very unlikely to overlap with either of the flanking exons. Internally, the GenomeScan program reduces the probabilities of parses which involve an exon in the region specified by a Genoa intron hit in a manner analogous to the way that regular Genoa hits reduce the probabilities of parses which do NOT contain overlapping exons (described in the Methods section of GenomeScan paper). A special notation in the Genoa file distinguishes Genoa intron hits from regular Genoa hits. 4) Treatment of multiple BLASTX / Genoa hits in a sequence It is common that multiple BLASTX hits to different homologous proteins will overlap essentially the same region of a genomic sequence. Since these hits generally provide redundant information, in these situations only the strongest (lowest E-value) hit is retained by GenomeScan - the other overlapping hits are simply ignored. Once this initial pruning of hits has been performed, many non-overlapping BLASTX / Genoa hits will usually remain. Multiple non-overlapping hits are handled in the GenomeScan model as follows. Consider the case where there are two non-overlapping Genoa hits, G1 and G2. Using the notation from the Methods section of the GenomeScan paper, P(phi,S|G1,G2) is defined as: (2a) (1-pG1)(1-pG2) P(phi,S) if the parse phi is inconsistent with both G1 and G2 (2b) (1-pG1)[(pG2/P(PhiG2)) + (1-pG2)] P(phi,S) if phi is consistent with G2 but not with G1 (2c) (1-pG2)[(pG1/P(PhiG1)) + (1-pG1)] P(phi,S) if phi is consistent with G1 but not with G2 (2d) [(pG1/P(PhiG1)) + (1-pG1)] [(pG2/P(PhiG2)) + (1-pG2)] P(phi,S) if phi is consistent with both G1 and G2 Equations 2a-2d can be derived using similar reasoning to that used to derive eq. (1) in the Methods section of the GenomeScan paper, and the assumption that the events HG1, HG2 are independent and one more subtle assumption, essentially that the Genoa hits influence the probabilities of parses independently of each other. Although there are reasons to suspect that these independence assumptions are not strictly correct, there are good reasons to make them anyway: 1) they may be approximately correct; 2) they appear to work well in practice; and 3) without them the model becomes computationally intractable. These assumptions are in this sense analogous to the Markov assumption made in HMM models that state k+1 depends only on state k. In typical applications (such as HMM gene models or profile HMMs for protein sequence alignment), there are almost always reasons to suspect that some longer range dependence exists, but to account for this would destroy the computational tractability of the Viterbi, forward and backward algorithms, which is one of the principal virtues of HMM models in the first place. ______________________________________________________________________ 7. WEB PAGES GenomeScan web page http://genes.mit.edu/genomescan Burge lab home page http://genes.mit.edu/burgelab ______________________________________________________________________ 8. REFERENCES R.-F. Yeh, L. P. Lim and C. B. Burge, 2001. Computational Inference of Homologous Gene Structures in the Human Genome. Genome Research (in press). The original GenomeScan paper. Burge, C. and Karlin, S. (1998) Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8, 346-354. A review of gene finding methods. Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94. The original Genscan paper. Burge, C. (1997) Identification of genes in human genomic DNA. PhD thesis, Stanford University, Stanford, CA. A detailed description of the models and algorithms underlying Genscan. Burset, M. & Guigo, R. (1996) Evaluation of gene structure prediction programs. Genomics 34, 353-367. The classic comparative study of gene prediction methods. ______________________________________________________________________ Copyright (c) 2000-2001, Christopher Burge ______________________________________________________________________