Version: $Id: Making_Tables,v 1.2 1995/06/29 23:45:05 colin Exp colin $ How to Make Genefinder Tables: 1) Create .gene and sequence files from initial genbank files use awk script "exhead" to produce a file containing each header, use awk script "exseq" to produce sequence file use "makegene" to convert file of headers into .gene format Note: For this and other examples "$ " is the prompt, don't type it. Ex: $ exhead gbpri.seq.z > human.hdr $ exseq gbpri.seq.z > human.seq $ makegene < human.hdr > human.g (suggestion - process records in stream? - integrate tools) Each gene record must be checked for validity of CDS and type of data. New Method: $ gzip -dc gbpri.seq.z | makegene -species sapiens \ -basename human [-nosplice] [-mrnaonly] doublecheck that only the correct species got selected with $ grep " ORGANISM " human.hdr | uniq 2) Using .gene and sequence files run "tables" with several dummy initial tables: atg, codon, intron3 and intron5 The header configurations of the dummy tables are listed below. Ex: $ for i in cod atg dinuc intron3 intron5 do cp dum.$i humtab.$i done $ tables human.seq human.g humtab.cod humtab.atg humtab.intron3 \ humtab.intron5 Any erroneous coding sequences which "tables" complains about should either be corrected or removed from the gene file, and tables rerun until there are no error messages. Incomplete coding sequences produce POSSIBLE ERROR messages when should be checked but can be ignored after the data is verified. 3) A dinucleotide frequency table should be created using "tables" with a large tract of genomic sequence, taken from the region of interest if possible. This table will be used with "randseq" to create a few random sequences with a total length of at least 1,000,000 with the same dinucleotide frequency as the genomic sequence for normalizing. A corresponding rand.g gene file should be created as well, with entries for each random strand in each direction. Ex: $ tables genome.seq genome.g humtab.dinuc $ randseq humtab.dinuc 87467 500000 > rand.seq $ randseq humtab.dinuc 5723 500000 >> rand.seq $ echo "random87467 random87467 * random87467.C random87467.C * random5723 random5723 * random5723.C random5723 *" > rand.g 4) Use "tables" again with the random sequence file created in step 3 and a different set of dummy tables to produce the normalizing tables (headers for normalizing tables are given below) Ex: $ tables rand.seq rand.g rantab.trinuc rantab.atg rantab.intron3 \ rantab.intron5 5) Create tablenamefile (using table files reference from where genefinder will be run), then use genefinder on the random sequences to create histogram file for normalizing scores (Will need dummy histogram file, given below) move new histogram file where dummy histogram file was. Ex: $ echo "tables/humtab.cod tables/humtab.intron3 tables/humtab.intron5 tables/humtab.atg tables/rantab.trinuc tables/rantab.intron3 tables/rantab.intron5 tables/rantab.atg tables/ref.hist" > humtabfiles $ genefinder -histfile new.hist -genefile rand.g -tablenamefile \ humtabfiles rand.seq $ mv new.hist tables/ref.hist 6) Check table validity by using genefinder to locate known genes. If the known gene was in the file used to create the tables then self sample correction should be used. Ex: $ genefinder -selfsample -genefile known.g -tablenamefile humtabfiles\ known.seq Note: selfsample is a new flag, older versions should instead have the tablenamefile named "newnemfiles" i.e. $ mv humtabfiles newnemfiles $ genefinder -genefile known.g -tablenamefile newnemfiles known.seq 7) Tune genefinder? (Alter intron penalty function/ parameters maxintronlen, minintronlen, functional form is still hardcoded as well as cPenalty=2.0 and clustering length=15) 8) Use genefinder on new data. TABLE FILE HEADERS: ==> newnem.atg <== siteType: atg refSeqs: genes freqType: within classDef: unique startOff: -9 endOff: 11 numSymbs: 1 maxSymb: 5 numForced: 3 forcedPos: 0 1 2 jump: 1 * ==> newnem.cod <== siteType: codon refSeqs: spliced freqType: within classDef: unique startOff: 0 endOff: 2 numSymbs: 3 maxSymb: 5 numForced: 0 forcedPos: jump: 3 class: 124 122 class: 121 123 74 72 71 73 class: 49 47 46 class: 48 class: 99 97 96 98 class: 114 112 111 113 44 42 class: 64 62 61 63 class: 39 37 36 38 class: 89 87 86 88 class: 109 107 class: 106 108 116 class: 59 57 class: 56 58 class: 34 32 class: 31 33 class: 84 82 class: 81 83 class: 119 117 class: 118 class: 69 67 66 68 41 43 class: 94 92 91 93 * ==> newnem.dinuc <== siteType: dinuc refSeqs: all freqType: within classDef: overlap startOff: 0 endOff: 1 numSymbs: 2 maxSymb: 5 numForced: 0 forcedPos: jump: 1 class: 6 7 8 9 class: 11 12 13 14 class: 16 17 18 19 class: 21 22 23 24 * ==> newnem.intron3 <== siteType: intron3 refSeqs: genes freqType: within classDef: unique startOff: -25 endOff: 5 numSymbs: 1 maxSymb: 5 numForced: 2 forcedPos: -2 -1 jump: 1 * ==> newnem.intron5 <== siteType: intron5 refSeqs: genes freqType: within classDef: unique startOff: -5 endOff: 25 numSymbs: 1 maxSymb: 5 numForced: 2 forcedPos: 1 2 jump: 1 * RANDOM SEQUENCE (NORMALIZING TABLES) HEADERS ==> ranseq.atg <== siteType: atg refSeqs: all freqType: within classDef: unique startOff: -9 endOff: 11 numSymbs: 1 maxSymb: 5 numForced: 3 forcedPos: 0 1 2 jump: 1 * ==> ranseq.intron3 <== siteType: intron3 refSeqs: all freqType: within classDef: unique startOff: -25 endOff: 5 numSymbs: 1 maxSymb: 5 numForced: 2 forcedPos: -2 -1 jump: 1 * ==> ranseq.intron5 <== siteType: intron5 refSeqs: all freqType: within classDef: unique startOff: -5 endOff: 25 numSymbs: 1 maxSymb: 5 numForced: 2 forcedPos: 1 2 jump: 1 * ==> ranseq.trinuc <== siteType: intron refSeqs: all freqType: within classDef: overlap startOff: 0 endOff: 2 numSymbs: 3 maxSymb: 5 numForced: 0 forcedPos: jump: 1 class: 124 122 class: 121 123 74 72 71 73 class: 49 47 46 class: 48 class: 99 97 96 98 class: 114 112 111 113 44 42 class: 64 62 61 63 class: 39 37 36 38 class: 89 87 86 88 class: 109 107 class: 106 108 116 class: 59 57 class: 56 58 class: 34 32 class: 31 33 class: 84 82 class: 81 83 class: 119 117 class: 118 class: 69 67 66 68 41 43 class: 94 92 91 93 *