# $Id: Humangenefinding.txt,v 1.4 1995/07/17 16:06:37 colin Exp colin $ Notes on finding H. sapiens Genes. A number of parameters should be set on genefinder, since the default values apply to C. elegans, and appropriate values for H. sapiens are quite different. These parameters apply to the intron lengths, relative weighting, and cutoff values. -maxintronlength 10500 (Human introns can be quite long, longer values ok) -minintronlength 70 (Minimum intron length in humans ~70 nucleotides) -penaltycluster 50 (Intron lengths seem to peak at ~120 =70+50 ) If you are using amino acid or codon tables, genefinder does not work very well. There does not appear to be enough information to distinguish coding sequence very well. (You can use the information program to compute the expected score from a single codon in sequence with the tables you choose to use) If you deccide to use amino acid tables anyway, I would suggest changing the intron penalty parameters suggested below to -penaltyfactor 1.0 -cpenalty 4.6 (This reduces the penalty for an intron somewhat to scale with the reduced score given to coding regions) Experimentally we determined that the optimal values for penalty parameters for intron lengths were different for codon tables, amino acid tables or n-mer tables. The best results with n-mer tables was produced with values of: -penaltyfactor 0.85 -cpenalty 6.35 Unfortunately, genes which have a short exons separated by short introns will not be predicted by genefinder with these parameters, due to the large constant penalty (cpenalty) assigned to each intron. If a large tract of genomic sequence is being analyzed (>10kb) then better predictions will be made using the -norm flag to compute the likelihood relative to the genomic sequence instead of random sequence. As the genomic sequence contains some coding region, a correction factor is used to adjust the likelihood. The suggested flags are: -norm -corrfactor 1.325 If hexamer tables are used, very large regions of sequence are required to estimate the null values. In this case it may be appropriate to use the random sequence. For this case a correction factor of 0.8 actually works better in many cases than the default of 1.0, by reducing the false positive rate. It may be desirable to replace repeat regions with 'X' to prevent predictions from occuring there. With our test set of genes there was a small overall improvement in gene prediction after repeat regions were eliminated from the input sequence. Recommendations: For very large sequence (>100kb) analysis, use hexamer tables genefinder -norm -corrfactor 1.325 -tablenamefile humtables.hex -cpenalty 6.35 -penaltyfactor 0.85 -maxintronlength 10500 -minintronlength 70 -penaltycluster 50 For smaller sequences (10kb-100kb), use pentamer tables genefinder -norm -corrfactor 1.325 -tablenamefile humtables.pent -cpenalty 6.35 -penaltyfactor 0.85 -maxintronlength 10500 -minintronlength 70 -penaltycluster 50 For small sequence (<10kb) use hexamer tables without normalizing genefinder -corrfactor 0.8 -tablenamefile humtables.hex -cpenalty 6.35 -penaltyfactor 0.85 -maxintronlength 10500 -minintronlength 70 -penaltycluster 50