**********************************************************************
**********************************************************************
**                                                                  **
**      GENOMESCAN DOCUMENTATION                                    **
**                                                                  **
**                                                                  **
**              Christopher Burge                                   **
**                                                                  **
**              MIT                                                 **
**              Department of Biology                               **
**              77 Massachusetts Ave., 68-222                       **
**              Cambridge, MA  02139                                **
**                                                                  **
**              cburge@mit.edu                                      **
**                                                                  **
**********************************************************************
**********************************************************************

______________________________________________________________________

ORGANIZATION OF THIS FILE

    1.  OVERVIEW OF GENOMESCAN
    2.  GENOMESCAN INPUT
    3.  GENOMESCAN OUTPUT
    4.  RUNNING GENOMESCAN WITH GENOMESCRIPT
    5.  GENOA FORMAT AND BLASTX2GENOA
    6.  TECHNICAL DETAILS
    7.  WEB PAGES
    8.  REFERENCES

______________________________________________________________________


1. OVERVIEW OF GENOMESCAN

	GenomeScan is a program for identifying the exon-intron
structures of genes in genomic DNA sequences from a variety of
organisms, with a focus on human and other vertebrates.  The algorithm
combines two principal sources of information: 1) models of
exon-intron and splice signal composition; and 2) sequence similarity
information such as BLASTX hits.  The input to the program consists of
a genomic sequence previously masked with RepeatMasker, a parameter
file for the appropriate organism and a 'Genoa file' containing a
summary of available sequence similarity information.  The program
determines the most likely "parse" (gene structure) conditional on the
given similarity information under a probabilistic model of the gene
structural and compositional properties of genomic DNA for the given
organism.  The locations of all predicted exons and genes are printed
to an output file (the text output) together with the corresponding
predicted CDS (coding DNA) and peptide sequences and a summary of the
similarity information used in the predictions.  A graphical
(PostScript) output is also created displaying the location of each
predicted exon and of each BLASTX hit.  Like Genscan, the model treats
the general case in which the sequence may contain no genes, one gene,
or multiple genes on either or both DNA strands, and partial genes as
well as complete genes are considered.  The most important
restrictions are that only protein coding genes are considered (and
not tRNA or rRNA genes, for example), that transcription units are
assumed to be non-overlapping, and that all predicted genes must have
at least modest similarity to a known protein.

	The probabilistic model used by GenomeScan is based on that
used by Genscan and accounts for many essential features of gene
structure such as gene density, the typical number of exons per gene,
the distribution of exon sizes for different types of exons; and also
many of the important compositional properties of genes, e.g., the
reading frame-specific hexamer composition of coding regions, the
(reading frame-independent) hexamer composition of introns and
intergenic regions, and the position-specific composition of the
translation initiation (Kozak) and termination signals, and of the
TATA box, cap site and polyadenylation signals.  Models of the donor
and acceptor splice sites are used which capture potentially important
dependencies (interactions) between positions in these signals.  For
human and vertebrate sequences, separate sets of model parameters are
used which account for the many substantial differences in gene
density and structure observed in distinct C+G% compositional regions
of the human genome and the genomes of other vertebrates.  The Genscan
model is described in Burge and Karlin, 1997 (see REFERENCES) and in
greater detail in my thesis (http://genes.mit.edu/chris).  In addition
to this documentation, GenomeScan is described in the following paper:

R.-F. Yeh, L. P. Lim and C. B. Burge, 2001.  Computational Inference
of Homologous Gene Structures in the Human Genome.  Genome Research (in press).

______________________________________________________________________


2. GENOMESCAN INPUT

	 After installing the GenomeScan package on a Unix system (not
described in this document), typing "genomescan" at the prompt lists
command line arguments:

% genomescan

usage: genomescan paramfile seqfile -g genoafile [optional arguments]

The three essential arguments (the parameter file, sequence file and
Genoa file) are described later.  The optional arguments are as
follows:


OPTIONAL ARGUMENTS
__________________

-v      Add extra explanatory information to the text output.
        This information is helpful the first few times the program
        is run but soon becomes unnecessary (that's why its optional).

-GC     Consider potential 5' splice sites which have /GC as well as
        /GT as the first two bases of the intron (default: GT only)

-F      Consider potential genes on forward DNA strand only

-R      Consider potential genes on reverse DNA strand only

        By default, potential genes on either or both DNA strands are
        considered, the only restriction being that the gene
        structures in a parse must not overlap.

-r      Sets value of r parameter used in root-r heuristic
        (see Methods section of GenomeScan manuscript)

-start  Sets value of START_FACTOR parameter (see below)

-stop   Sets value of STOP_FACTOR parameter (see below)

-cds    By default, only predicted peptides are printed - these are
        useful for searching against protein databases, aligning to
        known proteins, running through motif finding programs such as
        Pfam (Sonnhammer et al., 2000), etc.  If the -cds flag is set,       
        predicted CDS (coding sequences) are also printed.  These are
        useful for searching predicted genes against cDNA or EST
        databases, design of RT-PCR primers, etc.

-ps     Create PostScript format graphical output, diagramming the
        locations and DNA strand of all predicted exons/genes.  The
        locations of BLASTX hits are also displayed.  Exons on the   
        "forward" (input) strand of the sequence are displayed above
        the sequence line; exons on the reverse strand are displayed
        below this line.

        The argument "psfname" is the name of the file for the 
        PostScript output (should end in ".ps").  This argument is
        required whenever the "-ps" flag is used.

        The "scale" argument tells the program what scale to make the
        PostScript image - how many base pairs to represent per line.
        This number must be no greater than one fourth the length of
        the sequence because at most four lines fit on a page.  If
        this argument is omitted, the program chooses a reasonable
        scale for the image.

        The PostScript output can be printed on a PostScript printer.
        It can be viewed using any of several PostScript interpreters
        such as ghostscript/ghostview, pageview, xpsview, etc.  
        PostScript files can be converted to other formats such as
        EPSI, PDF, gif, jpeg, etc. using ghostscript and/or other      
        available utilities.


ESSENTIAL ARGUMENTS
___________________


*** The Parameter File ***

The parameter file must follow a very specific format to be read
by GenomeScan (same format as used by Genscan) and should not be modified.

Separate parameter files are needed for different organisms.
Currently available parameter files are:

HumanIso.smat           human/vertebrate sequences (also Drosophila)
Arabidopsis.smat        Arabidopsis thaliana sequences
Maize.smat              Zea mays (corn) sequences

The full path of the parameter file is the first command line
argument.  You can type the full path every time you run the program
or save some typing by using aliases.  For example, assuming that the
parameter files have been installed in /usr/lib/genomescan, put the
following aliases in your .cshrc file (normally located in your home
directory):

alias genomevert genomescan /usr/lib/GENOMESCAN/HumanIso.smat
alias genomearab genomescan /usr/lib/GENOMESCAN/Arabidopsis.smat
alias genomemaiz genomescan /usr/lib/GENOMESCAN/Maize.smat

That way (after you source .cshrc) you can simply type

        > genomevert SEQFILE ...

to run the program on SEQFILE with the human/vertebrate parameters.

*** The Sequence File ***

The sequence file may be in either FastA or minimal GenBank format.
These formats are described below, with examples of each.

A sequence in FastA format begins with a single-line description,
followed by lines of sequence data.  The description line is
distinguished from the sequence data by a greater-than (">") symbol
in the first column.  The sequence data may be upper or lower case.
All spaces, tabs, numbers or other non-alphabet characters are
ignored by GENOMESCAN, with the exception of asterisks ("*"), which
are treated as unknown nucleotides (N's).  GenomeScan does not
distinguish between special symbols indicating purine or pyrimidine
nucleotides such as R, Y, etc., but treats all letters other than
A, C, G, T as unknowns (N).  It is usually an easy task to convert
sequences stored in other formats such as Intelligenetics, EMBL, etc.
to FastA format either by hand or using any of several standard
utitilities such as ReadSeq.

A sample FastA format sequence is shown below:

>HC2667A BAC clone from human chromosome 5q22
GGATCCCAGCCTTTCCCCAGCCCGTAGCCCCGGGACCTCCGCGGTGGGCGGCGCCGCGCT
GCCGGCGCAGGGAGGGCCTCTGGTGCACCGGCACCGCTGAGTCGGGTTCTCTCGCCGGCC
TGTTCCCGGGAGAGCCCGGGGCCCTGCTCGGAGATGCCGCCCCGGGCCCCCAGACACCGG
......

GenomeScan may also be run on files in "minimal GenBank" format.  This
includes files in proper GenBank format as well as files which contain
only partial GenBank annotation (see below). The main reason why
GenomeScan has been written so as to accept GenBank annotated as well
as unannotated (FastA) files is so that the predictive accuracy of the
program can be easily measured on sequences with known gene
locations. When run on a GenBank file containing a feature table, the
program automatically compares its predictions to the annotated CDS
(coding sequence) features, displays a summary of the annotated as
well as predicted exons, and calculates some standard measures of
predictive accuracy such as nucleotide- and exon-level sensitivity,
specificity and so on.  (The conventions used to calculate these
statistics are those described by Burset and Guigo, 1996 - see
REFERENCES).  This makes it relatively easy to check the program's
accuracy for any particular set of sequences for which the annotated
CDS features are considered reliable and complete. Of course, the
program does not actually use the annotation in any way to make its
predictions: that would be silly.  Therefore the set of predicted
genes/exons will be identical for a sequence in GenBank format as for
the same sequence in FastA format.

GenBank format files must follow an elaborate set of formatting
conventions devised by the National Center for Biotechnology
Information (NCBI) in Washington.  However, GenomeScan needs only a
fraction of the complete GenBank annotation (see below) for its
purposes.  Only these lines need be present (in the proper order) in
an input file which is to be read by the program.  This "minimal
GenBank" format is as follows.

The LOCUS and ORIGIN lines must be present.  The first line of the
file must be the LOCUS line and this line must be in proper GenBank
format.  The ORIGIN line must begin with the word ORIGIN but has no
other special restrictions.  The sequence must begin on the line
immediately after the ORIGIN line.  All other lines normally present
in GenBank files (e.g., ACCESSION, KEYWORDS, etc.) are optional.
However, if a feature table is present, it must begin with a line:

FEATURES             Location/Qualifiers

(with the same spacing/capitalization/etc. as in a GenBank file) and
must end with a BASE COUNT line, followed by an ORIGIN line, and then
the sequence.  The feature table must be in correct GenBank feature
table format (including spacing) - consult the NCBI GenBank format
description or look at some real GenBank files if in doubt.  All
features other than those labeled "CDS" are ignored.  All CDS features
are read and the complete set of annotated coding exons are compared
to the complete set of predicted coding exons.

The format for the LOCUS line is:

LOCUS SEQNAME seqlen bp seqtype taxgroup date

where SEQNAME is the name of the sequence (any string), seqlen is the
length of the sequence in base pairs (which must match the actual
number of sequence characters in the file), and the last three strings
describe the type of sequence (e.g., ds-DNA, cDNA), the taxonomic
group code, and the date.  The sequence may be upper or lower case and
all spaces, tabs, numbers, etc. occurring after the ORIGIN line are
ignored.  A sequence in minimal GenBank format is given below:

LOCUS       HUMRASH      6453 bp ds-DNA          PRI     15-MAR-1988
DEFINITION  Human c-Ha-ras1 proto-oncogene, complete coding sequence.
ACCESSION   J00277 J00206 J00276 K00954
FEATURES             Location/Qualifiers
     prim_transcript <1664..3744
                     /note="c-Ha-ras1 mRNA"
     CDS             join(1664..1774,2042..2220,2374..2533,3231..3350)
                     /note="c-Ha-ras1 p21 protein;  NCBI gi: 190891."
                     /codon_start=1
     source          1..6453
                     /organism="Homo sapiens"
BASE COUNT      946 a   2287 c   2113 g   1107 t
ORIGIN      1 bp upstream of BamHI site.
  1 ggatcccagc ctttccccag cccgtagccc cgggacctcc gcggtgggcg gcgccgcgct
 61 gccggcgcag ggagggcctc tggtgcaccg gcaccgctga gtcgggttct ctcgccggcc
... ..........


*** The Genoa File ***

The Genoa file lists sequence similarity information and must follow a
precise format described in section 5.


______________________________________________________________________


3. GENOMESCAN OUTPUT

By default, the text output of the program is directed to stdout,
which means that if you simply run the program without redirecting or
piping the output, it will be printed to the screen.  The reason for
this is so that the user can redirect the output to a file:

> genomescan HumanIso.smat SEQFILE -g GENOAFILE > SEQFILE.out

or pipe the text output through some sort of filtering program to
put it in a more convenient form:

> genomescan HumanIso.smat SEQFILE -g GENOAFILE | genomefilter

Of course, it is your responsibility to create the filtering
program. The format of the text output file is the same as that
produced by Genscan so that filtering programs developed for Genscan
will also work with GenomeScan.  This format is described at the
bottom of the output file if the verbose (-v) flag is used.  The
PostScript output is a diagram of the locations of predicted exons
(red boxes) and BLASTX hits (green boxes) in the genomic sequence.


______________________________________________________________________


4. GENOMESCRIPT

GenomeScript is a Perl script which performs the following operations
on an input genomic sequence:

1) Mask interspersed repetitive elements in the genomic sequence with
RepeatMasker (A. Smit and P. Green).

2) Run Genscan on the masked genomic sequence and search the predicted
peptides against the protein database specified as an argument to
Genomescript using BLASTP with default parameters and E-value cutoff
1e-5.

3) Create a restricted protein database consisting of all proteins hit
in step 2), then run BLASTX of the masked genomic sequence against
this restricted database with parameters -G 20, -E 3, -e 0.05
(increased gap penalties and E-value cutoff 0.05).  Convert the
resulting BLASTX output to Genoa format using 'blastx2genoa'.

4) Run GenomeScan on the masked genomic sequence using the Genoa file
from the previous step as input.

Additional details can be found in the genomescript file itself.

______________________________________________________________________


5.  GENOA FORMAT - BLASTX2GENOA

BLASTX output is converted to the 'Genoa' format required by
GenomeScan using a Perl script 'blastx2genoa'.  Genoa format is a
simple representation of BLASTX (or other) similarity information in
which each BLASTX hit is represented by a single line with 15 columns.
A sample is shown below:

Column1    2  3         4     5 6  7    8    9   10   11    12   13  14     15
------- ---- -- --------- ----- - -- ---- ---- ---- ---- ----- ----- -- ------

Hit33.1 Pept nr AAC52323+ Mouse 1 26 1812 4639 4716 4690 1e-08 1e-08  0 BLASTX


In the description below, the genomic sequence is the query and
protein sequences involved in BLASTX hits are referred to as subject.
Meanings of columns no. 1-15:

1) A label for the hit (arbitrary but must be unique in Genoa file)

2) Type of sequence hit (Pept = peptide for a BLASTX hit)

3) Name of database searched (for reference)

4) Locus name or accession number of subject protein (for reference)

   A + after the name indicates that the protein sequence begins with
   Methionine.  A - after the name indicates that it does not.

5) Name of organism from which protein derived (for reference)

   The default value of this field is 'Unknown'.  This field is not currently 
   used by GenomeScan

6) Location of beginning of BLASTX hit in subject protein (aa)

7) Location of end of BLASTX hit in subject protein (aa)

8) Length of subject protein (aa)

9) Location of beginning of hit in query genomic sequence (nucleotide position
   on forward DNA strand)

10) Location of end of hit in query genomic sequence (nucleotide position on 
    forward DNA strand)

    For hits to reverse strand reading frames, column 9 exceeds column 10.

11) 'Centroid' of BLASTX hit in query genomic sequence (defined below)   

12) BLASTX E-value of hit

13) BLASTX E-value of 'parent' hit (see below)

    Often, the hit and the parent hit are one and the same.  These columns
    differ when a 'parent' BLASTX hit is very long or has a multimodal score   
    distribution and spawns multiple Genoa hits with different centroids.

14) Reading frame of hit.  Positions are always numbered on the forward strand
    starting from 1.  The conventions used by GenomeScan are as follows.

    Forward strand reading frames:

       A codon at position [x-2,x-1,x] has reading frame x modulo 3

       For example, the codon at [59,60,61] has reading frame 1

    Reverse strand reading frames:

       A codon at position [x,x-1,x-2] (forward strand coordinates) has
       reading frame (x modulo 3) + 3

       For example, the codon at [20,19,18] has reading frame 5

15)   Name of program used for alignments (for reference)

Columns marked 'for reference' do not affect the output of the program
- this information is present as a record of the source of the
similarity information.

______________________________________________________________________


6.  TECHNICAL DETAILS

This section describes some technical details of the GenomeScan
algorithm which are likely to be of interest only to bioinformatics
aficionados.

1) Definition of centroid and treatment of long hits and multimodal hits

The centroid of a BLASTX hit is defined in terms of the cumulative
score plot obtained by summing the raw BLOSUM match scores across the
hit in question. Typically, human exons are represented by
medium-sized BLASTX hits (between 15 and 100 amino acids/codons in
length) which correspond reasonably well to the boundaries of the exon
(human exons are typically between 15 and 100 codons in length).  In
this case, the goal is to position the centroid at the position within
the hit which is most likely to be contained within the exon.
Therefore, the centroid is defined using a steepest slope heuristic as
the point C which has steepest slope in the cumulative score plot over
a window of 15 codons centered at C.  For very short hits (< 15 amino
acids), differences in slope cannot be reliably measured, so the
centroid is defined as the midpoint of the hit in the query genomic
sequence.  In some cases, the cumulative score plot may be multimodal.
For our purposes, a score distribution is multimodal if it has (at
least) one valley whose score is 40 or more less than the previous
peak (raw BLOSUM score units), followed by a peak which is 40 or more
greater than the low point of the previous valley.  This situation has
two typical causes: 1) the valley(s) correspond to poorly conserved
regions of a large exon; or 2) the hit covers two (or more) closely
spaced exons and the valleys represent introns.  The heuristic used
which handles both of these situations gracefully is to break the
BLASTX hit (referred to as the 'parent' hit) into as many Genoa hits
as there are modes of the distribution, where each mode corresponds to
a peak of the score distribution which has a score at least 40 higher
than the previous valley as described above.  The E-values of these
'daughter' hits are derived from that of the parent hit in proportion
to the heights of the modes as in the following example.  Consider a
BLASTX hit with E-value 1e-9 which has two modes, one with a peak
height of 100 (relative to the previous valley), the other with a peak
hieght of 50.  In this case, the first subhit would be assigned an
E-value of (1e-9)^(100/150) = 1e-6, while the second would have an
E-value of (1e-9)^(50/150) = 1e-3.  (The notation x^y indicates x to
the y power.)  The centroids of these subhits are defined as above,
using the steepest slope or midpoint rule depending on size.

Very long BLASTX hits (> 100 codons/amino acids) generally correspond
to (rare) long exons or intronless genes (or pseudogenes).  Using the
steepest slope heuristic in this case would tend to concentrate the
entire hit into a single point, which would not accurately represent
the information contained in an extended region of BLASTX similarity.
Therefore, very long hits are also broken into multiple daughter hits,
using the following heuristic.  First, the hit is 'trimmed' by moving
the 5' boundary the genomic position where the cumulative score plot
first exceeds a BLOSUM score of 50, and similarly at the 3' end to
ensure that the new boundaries are internal to the coding region which
is present.  Next, this core region of the hit (of length L bases) is
broken up into N subhits, where N is defined as the greatest integer
less than L/75.  The centroids of the subhits are then placed at N
positions spaced equidistantly across this core region of the hit.
The P-values of the daughter hits are set to p = P^(1/N), where P is
the P-value of the parent hit and N is the number of daughter hits.

These heuristics are based on reasonable principles but are not
necessarily 'optimal' in any way.  However, they have withstood fairly
extensive testing and appear to work well in practice.  All of the
processing of BLASTX hits into Genoa hits, determination of
multimodality, calculation of centroids, etc. is carried out by the
script blastx2genoa.

2) Using BLASTX hits to identify initiation and termination codons

Consider the example given in the GenomeScan paper in which residues
6-50 of a protein have a BLASTX match to genomic coordinates 116-250.
In this case it stands to reason that an ATG located five codons
upstream at position 101-103 in the genomic sequence (if one occurs
there) has a higher likelihood of representing an initiation codon
than some other randomly chosen ATG in the genomic sequence.  This
observation leads to the start codon heuristic used by GenomeScan
which is applied to BLASTX hits which begin at positions X within 30
residues of the start of the target protein.  In this heuristic, the
probability of any parse which involves an initiation codon exactly
X-1 codons upstream of the genomic location of the beginning of the
BLASTX hit (as in the example above) is increased by the factor
START_FACTOR, which may be set as a command-line argument to
GenomeScan (the default value of this parameter, 1e6, was determined
empirically).  Some peptides in available protein databases derive
from partial cDNAs (often 5'-truncated) and therefore do not represent
complete proteins.  To filter out these incomplete proteins, the above
heuristic is applied only when the subject protein begins with
Methionine (a reasonable but imperfect marker for completeness).
These proteins are marked with a '+' in the Genoa file as in the
example given in the previous section.
  
Now suppose that residues 450-495 of a 500 amino acid protein match
the genomic region 1200-1335.  In this case it stands to reason that a
stop codon triplet (TAA, TAG or TGA) located at position 1348-1350 in
the genomic sequence (if one occurs there) is more likely to represent
a stop codon than some randomly chosen genomic stop codon triplet.
This observation leads to the stop codon heuristic used by GenomeScan
which is applied to BLASTX hits which begin at positions X <= L - 30
in the subject protein (where L is the length of the protein).  In
this case, the probability of any parse which involves a stop codon
exactly (L-X) codons downstream of the genomic location of the end of
the BLASTX hit is increased by the factor STOP_FACTOR.  This parameter
can also be set as a command-line argument to GenomeScan, and the
empirically determined default value is also 1e6.  These start and
stop codon heuristics are only applied to relatively strong BLASTX
hits (E < 1e-6) since only relatively well conserved proteins are
likely to give reliable start/stop locations. (Distant homologs will
often have aquired insertions/deletions near the ends of the protein
causing the spacing relative to the start/stop codons to change.)
Again, these heuristics are based on reasonable principles, but are
not necessarily 'optimal'.  However, they appear to work quite well in
practice and help to overcome the well-known weakness of ab initio
gene prediction methods such as Genscan in terms of prediction of the
first and last exons of a gene (Burge and Karlin, 1998 - see
REFERENCES).

3) Using BLASTX hits to identify intronic regions

Suppose that BLASTX hit B1 matches residues 101-150 of protein P to
nucleotides 850-999 of the query genomic sequence, and BLASTX hit B2
matches residues 151-200 of protein P to nucleotides 2001-2150 of the
genomic sequence.  This sort of arrangement (adjacent hits in protein
match nearby but distinct regions of genomic sequence) often indicates
a pair of adjacent exons separated by an intron and provides the
additional information that the region from approximately 1000-2000 in
the genomic sequence is likely to be intronic (and therefore would not
contain additional exons).  This observation leads to the intron
heuristic used by GenomeScan which is applied when certain conditions
are met which make it very likely that the region between two BLASTX
hits is intronic.  Specifically, the conditions are that there must be
two relatively strong (E < 1e-6) BLASTX hits in a region of genomic
DNA which match adjacent or nearly adjacent regions of the same
subject protein (the end of the upstream hit differs from the
beginning of the downstream hit by 5 or fewer residues) but are not
adjacent in the genomic sequence (> 60 bp apart, the minimum length of
a human intron).  In this case, a special type of Genoa hit (a Genoa
'intron hit') is generated by blastx2genoa which specifies that the
region of the genomic sequence beginning 30 bp after the end of the
upstream hit and ending 30 bp before the beginning of the downstream
hit is likely to be intronic with P-value pI = 1 - (1 - pB1) x (1 -
pB2), where pB1 and pB2 are the P-values of the BLASTX hits B1 and B2.
The reasoning behind this formula is that the region is that the
inference that the region is intronic is valid only if both flanking
BLASTX hits are correct, an event which has probability (1 - pB1) x (1
- pB2) assuming independence between hits (and before application of
the root-r heuristic).  The P-values of Genoa intron hits are adjusted
using the root-r heuristic as for normal Genoa hits.  The 30 bp offset
is used to ensure that the specified intronic region is very unlikely
to overlap with either of the flanking exons.  Internally, the
GenomeScan program reduces the probabilities of parses which involve
an exon in the region specified by a Genoa intron hit in a manner
analogous to the way that regular Genoa hits reduce the probabilities
of parses which do NOT contain overlapping exons (described in the
Methods section of GenomeScan paper).  A special notation in the Genoa
file distinguishes Genoa intron hits from regular Genoa hits.

4) Treatment of multiple BLASTX / Genoa hits in a sequence

	It is common that multiple BLASTX hits to different homologous
proteins will overlap essentially the same region of a genomic
sequence.  Since these hits generally provide redundant information,
in these situations only the strongest (lowest E-value) hit is
retained by GenomeScan - the other overlapping hits are simply
ignored.  Once this initial pruning of hits has been performed, many
non-overlapping BLASTX / Genoa hits will usually remain.  Multiple
non-overlapping hits are handled in the GenomeScan model as follows.
Consider the case where there are two non-overlapping Genoa hits, G1
and G2.  Using the notation from the Methods section of the GenomeScan
paper, P(phi,S|G1,G2) is defined as:

(2a)	(1-pG1)(1-pG2) P(phi,S)	

if the parse phi is inconsistent with both G1 and G2

(2b)	(1-pG1)[(pG2/P(PhiG2)) + (1-pG2)] P(phi,S)

		if phi is consistent with G2 but not with G1

(2c)	(1-pG2)[(pG1/P(PhiG1)) + (1-pG1)] P(phi,S)

		if phi is consistent with G1 but not with G2

(2d)	[(pG1/P(PhiG1)) + (1-pG1)] [(pG2/P(PhiG2)) + (1-pG2)] P(phi,S)

		if phi is consistent with both G1 and G2

Equations 2a-2d can be derived using similar reasoning to that used to
derive eq. (1) in the Methods section of the GenomeScan paper, and the
assumption that the events HG1, HG2 are independent and one more
subtle assumption, essentially that the Genoa hits influence the
probabilities of parses independently of each other.  Although there
are reasons to suspect that these independence assumptions are not
strictly correct, there are good reasons to make them anyway: 1) they
may be approximately correct; 2) they appear to work well in practice;
and 3) without them the model becomes computationally intractable.
These assumptions are in this sense analogous to the Markov assumption
made in HMM models that state k+1 depends only on state k.  In typical
applications (such as HMM gene models or profile HMMs for protein
sequence alignment), there are almost always reasons to suspect that
some longer range dependence exists, but to account for this would
destroy the computational tractability of the Viterbi, forward and
backward algorithms, which is one of the principal virtues of HMM
models in the first place.


______________________________________________________________________


7. WEB PAGES


    GenomeScan web page

        http://genes.mit.edu/genomescan


    Burge lab home page

        http://genes.mit.edu/burgelab


______________________________________________________________________


8. REFERENCES

R.-F. Yeh, L. P. Lim and C. B. Burge, 2001.  Computational Inference
of Homologous Gene Structures in the Human Genome.  Genome Research (in press).

 The original GenomeScan paper.

Burge, C. and Karlin, S. (1998) Finding the genes in genomic DNA.
Curr. Opin. Struct. Biol.  8, 346-354.

 A review of gene finding methods.

Burge, C. and Karlin, S. (1997) Prediction of complete gene structures
in human genomic DNA. J. Mol. Biol. 268, 78-94.

 The original Genscan paper.

Burge, C. (1997) Identification of genes in human genomic DNA. PhD
thesis, Stanford University, Stanford, CA.

 A detailed description of the models and algorithms underlying Genscan. 

Burset, M. & Guigo, R. (1996) Evaluation of gene structure prediction
programs.  Genomics 34, 353-367.

 The classic comparative study of gene prediction methods.


______________________________________________________________________

Copyright (c) 2000-2001, Christopher Burge

______________________________________________________________________