HELP AND DOCUMENTATION

NAME 
  
  This field is necessary. A default name `query' is given in case client
  does not wish to give his/her own name to the query.

Paste your sequence
  
  The DNA sequence can be pasted into the text area. In case the Format of
  the sequence is any of the standard ones (EMBL, FASTA, GENBANK, etc.) then 
  `INPUT-FORMAT' should be selected to `FORMATTED' or to `NON-FORMATTED' in 
  case the input sequence is just nucleotide sequence.

UPLOAD-FILE
  
  File containing nucleotide sequence in any of the standard formats or a 
  non-formatted DNA sequence can be uploaded using this option. Users can-
  -not upload a sequence file and paste the sequence simultaneously to be 
  analyzed together.

INPUT-FORMAT
  
  The program recognizes any of the standard formats (EMBL, FASTA, GENBANK,
  etc.). It uses the ReadSeq program developed by Dr. Don Gilbert, Biology 
  Dept., Indiana University, to read the input sequence and can accept most
  commonly used standard sequence formats. Users should select `FORMATTED'
  if the input sequence is in any of these standard formats or `NON-FORMAT-
  -TED' if the input sequence is just the nucleotide sequence.

OUTPUT-FORMAT
  
  The server gives the result in either of the two formats-
  TABULAR or GRAPHICAL
  In TABULAR OUTPUT it gives the values of the Fourier spectra or Power at
  different frequencies in case of GENESCAN or m1, m2, m3 in case of ZCurve.
  In GRAPHICAL OUTPUT it gives the plot of Power vs frequency in case of 
  GENESCAN while in case of ZCurve it is a plot of m1 vs m3. In case of FTG
  the TABULAR OUTPUT is a table of Power at different frequencies while the
  GRAPHICAL OUTPUT gives a spectrum of Power vs frequency.

FROM_&_TO
  
  Users can select a particular region from the input sequence data to anal-
  -yze through this property. 
   
   Certain tips--
   
     If the user wants to do a complete sequence analysis but does not know
     the last base number, he/she can leave the `TO' field empty and give `1'
     in the `FROM' field.
     
     If the user wants to do analysis on the region 230-590 for eg. he/she
     should give `230' in the `FROM' field and `590' in the `TO' field.

     Do not leave `FROM' field empty.

ALGORITHM

  Basically there are three different algorithms and their three modificat-
  -ions. The Algorithms are -GENESCAN, LENGTHEN-SHUFFLE and FTG while their
  modifications are- GENESCAN-WINDOW, LENGTHEN-SHUFFLE-WINDOW and FTG-WINDOW.
  The WINDOW options of each of these algorithms are given for convenience.
  A long DNA sequence to be analyzed for multiple protein-coding regions can
  be analyzed using these options. Once protein coding regions are identified
  they can be confirmed for periodicity using the original versions of the
  algorithms.

    GENESCAN
    
      This algorithm uses a Fourier technique based on a distinctive feature
      of protein-coding regions, the 3-base periodicity. The signature of 
      this (also other) periodicity can be observed most directly through the
      Fourier analysis.

      A sequence of N nucleotides may be formally viewed as a symbol string,
      {xj, j=1,2,.....,N}, where xj is one of the four symbols A, T, G and C,
      and denotes the occurrence of that particular nucleotide in position j.
      One can define a binary indicator function or projection operator Ua  
      which selects the elements of the sequence that are equal to the symbol
      a, namely Ua(xj)=1 if xj is a and 0 otherwise. Using the operators UA,UT,
      UG,UC, successively on a DNA sequence yields four binary sequences, as
      illustrated below;

      Sequence GGATACACTTTAGAG
      Apply UA  001010100001010
      Apply UT  000100001110000
      Apply UG  110000000000101
      Apply UC  000001010000000

            Figure 1.

      Thus, any DNA sequence can be converted to four binary sequences, which
      can then be Fourier analyzed in the normal manner, to examine correlat-
      ions between the symbols. The total Fourier spectrum of the DNA sequence
      is the sum of these individual spectra, namely;
      
                                      -----------(1)

      where the discrete frequency f=k/N, with k=1,2,....N/2. Sa(f) is the par-
      -tial spectrum corresponding to the symbol a=A, G, C, or T. The average
      of the total spectrum, S^, can be calculated from the frequency of occur-
      -rence, þa of each symbol (a=A, T, G, C) as;

                                    -----------(2)

      For protein-coding sequences from a variety of organisms, the Fourier sp-
      -ectrum [equation(1)] reveals the characteristic periodicity of three as 
      a distinct peak at frequency f=1/3. No such `peak' above noise level is 
      apparent for non-protein coding sequences such as rRNA, intergenic spacers
      and introns, which have a flat Fourier spectrum devoid of any periodicity.
      In order to contrast signal-to-noise ratio of the peak at f=1/3, is given
      as;

                                                      -----------(3)

      P=4 is used as discriminator between coding and non-coding sequences.

      For a detailed description of the algorithm please refer the original
      paper (Tiwari et al., 1997).

         The academic version of the program is available for distribution and 
         can be accessed at http://202.41.10.146/GS.html

    LENGTHEN-SHUFFLE
    
      Due to the limited length (usually 100bp or so) of the window used in gene
      finding process, the application of the Fourier measure is without imp-
      -ressive success. For a longer sequence, >1024bp, it is easier to detect
      the periodicity by the FFT algorithm. This algorithm find a way to solve
      this problem.

      FORMAT OF Z CURVES:

      Consider a DNA sequence with N bases read from the 5-end to the 3-end.
      Begining from th first base, inspect the sequence one base at a time. Let
      the number of steps be denoted by n, i.e. n=1,2,....N. In the nth step,
      count the cumulative numbers of the bases A, C, G and T, respectively,
      occurring in the subsequence from the first to the nth base in the DNA
      sequence inspected. Denote the four positive integers  by An, Cn, Gn, and
      Tn, respectively. The Zcurve consists of a series of nodes Pn(n=1,2,....N)
      whose coordinates are denoted by xn, yn, and zn. It was shown that

             xn = 2(An + Gn) - n,
             yn = 2(An + Cn) - n,       n=0,1,2...........,N      -----------(4)
             zn = 2(An + Tn) - n,

      where A0=C0=G0=T0=0 and thus x0=y0=z0=0. The connection of nodes P0(i.e.
      the origin), P1, P2,...PN one by one by lines is defined as the Z curve
      of the DNA sequence inspected. We then define;

             §xn = xn - xn-1,
             §yn = yn - yn-1,           n=1,2,....N               ------------(5)
             §zn = zn - zn-1,

      where §xn, §yn and §zn can only have the values of 1 or -1. §xn is equal
      to 1 when the nth base is A or G(Purine), or -1 when the nth base is C or
      T(Pyrimidine); §yn is equal to 1 when the nth base is A or C(amino-type),
      or -1 when the nth base is G or T(keto-type); §zn is equal to 1 when the
      nth base is A or T(weak hydrogen bond), or -1 when the nth base is G or C
      (strong hydrogen bond). Therefore, DNA sequence can be decomposed into 3
      series of digital signals, consisting of 1 or -1, each of which has clear
      biological meaning. The first series of digital signals §xn represents
      the distribution of the bases of the purine/pyrimidine along the DNA 
      sequences. The second series §yn represents the distribution of the bases
      of the amino/keto types along the sequence. Similarly, the third series 
      §zn, represents the distribution of the bases of the strong/weak hydrogen
      bonds along the sequence.

      A LENTHEN-SHUFFLE FOURIER TRANSFORM:

      The relatively short DNA sequence D(<150bp) is first lengthened by repeating
      the sequence K times, where K=1200/D. Obviously a bogus periodicity of 
      D will be observed in the power spectrum of the FFT. To eliminate such a 
      bogus periodicity, and at the same time keep the periodicity of 3 uncha-
      -nged, the lengthened sequence is then shuffled M times with three cons-
      -ecutive bases as a unit. A typical value of M used here is 10,000.

      As mentioned above, based on the the format of the Zcurve, any DNA sequ-
      -ence can be transformed into three series of digital signals, §xn, §yn,
      and §zn, to which to apply the FFT algorithm. The power spectrum for
      each series is calculated as follows:

                    -------------(6)

      where PC(f) is the power spectrum associated with §Cn which represents
      §xn, §yn and §zn. Now three values are obtained, m1=Px(N/3), m2=Py(N/3) and 
      m3=Pz(N/3). The Fisher linear discriminant equation is used for making the 
      coding/non-coding decision.

      For a detailed description of the algorithm please refer the original
      paper (Yan et al., 1998).

         The program is available on request from C.-T. Zhang ctzhang@tju.edu.cn

    FTG

      FTG algorithm combines the properties of both GENESCAN and LENGTHEN-SHUF-
      -FLE algorithms to improve the accuracy of gene prediction. The drawback
      of the GENESCAN algorithm is that the spectrum of any short DNA sequence
      say <150bp, a window-size generally used, is not clear thereby the peri-
      -odicity becomes incoherent. Though the LENGTHEN-SHUFFLE algorithm tries
      to address this problem, it has the drawback of not indicating the type
      of periodicity that a DNA sequence has. FTG tries to overcome these limi-
      -tations by combining the essential parts of the two algorithms- GENESCAN,
      and LENGTHEN-SHUFFLE so as to amplify the three-base periodicity of the 
      DNA sequence.

      FTG takes short nucleotide sequences of length D (where D<=1200) and then
      amplifies it by repeating it K times (where K=(1200/D)+1). The bogus per-
      -iodicity of D is removed by shuffling the sequence M times (where M=10000)
      Now the extended DNA sequence is considered as a symbol string, {xj, j=1,2
      ....,N}, where xj is one of the four symbols A, C, G and T, and denotes 
      the occurrence of that particular nucleotide in position j. Now defined a 
      binary indicator function or projection operator Ua which selects the 
      elements of the sequence that are equal to the symbol a, namely Ua(xj)=1
      if xj is a and 0 otherwise. Using the operators UA, UT, UG and UC successively
      on a DNA sequence yields four binary sequence, as illustrated in Figure 1.
      The four binary sequences obtained can then be Fourier analyzed in the
      normal manner. Fourier spectrum of the DNA sequence is calculated using
      equation 1, while the average spectrum is calculated by using equation 2.
      Peak at f=1/3 is obtained and signal-to-noise ratio of the peak at f=1/3
      is obtained using equation 3.

    GENESCAN-WINDOW, LENGTHEN-SHUFFLE-WINDOW and FTG-WINDOW

      The user can analyze long DNA sequences using these WINDOW option of each
      algorithm. User has to specify Step-size and Window length for these three
      algorithm. The program takes overlapping Window separated by the Step-size
      and analyzes each Window and outputs the values for that Window. The adva-
      -tage that this option offers is to analyze the whole input sequence con-
      -tinuously window-by-window, saving precious time.

    Step-Size_&_Window

      Consider a DNA sequence of 10000 bp length. Now I want to analyze the
      sequence with overlapping windows of 150 bp which are overlapping each 
      other. Suppose I want to take window after every 5 bp. Then I would give
      a Step-size of 5 and Window of 150 in the submission form.


      1--ACGTGCTAGCTGATGCTAGTGC---100--CATCGACTAGCATCAGCTACAGCTACGATCAGCACTGATC----10000
      |----------------------|
       <-------------Window-length----1st Window--------->
           |----------------------|
            <-------------2nd Window-------------------------->
      |--|
    Step-size
                 |----------------------|
                  <------------------3rd Window--------------------->


      It is useful to remember that results will be much more reliable for a 
      smaller Step-size.

    TABULAR_OUTPUT

         GENESCAN

            The tabular output option for GENESCAN outputs Power for different 
            frequencies. The average of the Peaks for a spectrum is also output
            along with Peak-to-noise ratio at f=1/3. Trivia such as nucleotide
            composition and dinucleotide content are also computed by the program.

            

         LENGTHEN-SHUFFLE

            The LENGTHEN-SHUFFLE algorithm outputs m1, m2, m3 as table along with 
            the sequence composition analysis report.

            

         FTG

            The tabular output option for FTG outputs Peaks at different freque- 
            -ncies. The average of the Peaks for a spectrum is also output along
            with Peak-to-noise ratio at f=1/3. Nucleotide composition and dinuc-
            -leotide content are also computed by the program.

            

         GENESCAN-WINDOW

            For each window analyzed the GENESCAN-WINDOW option outputs the Peak
            or Power at f=1/3. The start-point in the table refers to the position
            of first base of the window. Normal sequence analysis of the DNA seq-
            -uence is also given.

            

         LENGTHEN-SHUFFLE-WINDOW

            For each window of the DNA sequence m1, m2, and m3 are computed and
            output by the program for this option. Position in this table refers
            to the first base position of the window in the input DNA sequence.

            

         FTG-WINDOW

            The position in the table refers to the first base of the window and
            the corresponding Power refers to the Peak at f=1/3 for that window.
            Normal nucleotide compositional analysis is also given.

            

    GRAPHICAL_OUTPUT

         GENESCAN

            A plot of Power[S(f)] vs frequency(f) is output with this option.
            The periodicity of three can be visible as a peak at f=0.33(1/3).
            Average peak of the spectrum and Peak at f=1/3 is given.

            

            A periodicity of 10 can be visible as a peak at f=0.10(1/10). Similarly
            other periodicity can be observed with this plot.

         LENGTHEN-SHUFFLE

            Plots of m3 vs m1, m3 vs m2, m2 vs m1,  are output with this option. 

            

         FTG

            This plot is quiet similar to the plot obtained from GENESCAN option
            except that FTG option works best for short nucleotide sequences.

            

         GENESCAN-WINDOW

            The Window version of the GENESCAN algorithm gives a plot of Peak at
            f=1/3 for each window vs sequence length. The red horizontal line is 
            the default threshold for the coding/non-coding decision. If the line
            of the spectrum goes above this threshold then the region is considered
            coding.

            

         LENGTHEN-SHUFFLE-WINDOW

            Three graphs of sequence length vs m1, m2, and m3, respectively are
            given. In addition a plot of m3 vs m1 for different window is given.

            
            
            

         FTG-WINDOW

            This graph is quiet similar to that of GENESCAN-WINDOW graph except
            it gives a plot of Power at f=1/3 for overlapping windows of size less
            less than 1200 (ideally less than 150).