Fundamentals of Sequence Analysis, 1998-1999
Problem set 8:  Formatting data for publication.

If you get stuck, refer to the OpenVMS and GCG resources in the 
class home page.

References:
 
  See the GCG and EGCG documentation.

Problem group 1.  Plasmid Maps

1A.  Produce a map of pBR322 (GB_SY:Synpbr322) showing all restriction
     sites that are present only once and all major features.


First find the single cutter restriction sites and make a tick file
that will be named synpbr322.tick

  $ mapsort/infile=gb_sy:synpbr322/plasmid/enzyme=*/once/default

Second, edit the documentary information into a file called
synpbr322.ranges, here are the contents of one such file: 

Cloning vector pBR322, features from comments
     Name     From       To   Strand  Color  FromSymbol  ToSymbol  Style ..

! big chunks, no direction implied

pSC101           1     1643      +     Blue       |          |     Block
Tn3           3148     4361      +     Blue       |          |     Block

! Protein coding regions

tet             86     1276      +     Black     |          >     Range
rop           1915     2106      +     Black     |          >     Range
bla           4153     3293      -     Black     |          >     Range

! little chunks, some have a direction, but are too small to put arrows on

! 3 promoters
P1              33       27      -     Green      |          |     Range
P2              43       49      +     Green      |          |     Range
P3            4194     4188      -     Green      |          |     Range
!  too small to see on this scale
!sig_peptide   4153     4085      -     Black     |          |     Range


! binding sites

echinomycin     24       27      +     Red        |          |     Tick
echinomycin     43       49      +     Red        |          |     Tick
echinomycin     53       56      +     Red        |          |     Tick
echinomycin     67       70      +     Red        |          |     Tick
echinomycin     80       83      +     Red        |          |     Tick
echinomycin    411      414      +     Red        |          |     Tick
echinomycin    469      472      +     Red        |          |     Tick
echinomycin   4268     4271      +     Red        |          |     Tick
echinomycin   4280     4283      +     Red        |          |     Tick
echinomycin   4285     4288      +     Red        |          |     Tick
echinomycin   4296     4299      +     Red        |          |     Tick
echinomycin   4311     4314      +     Red        |          |     Tick
echinomycin   4317     4320      +     Red        |          |     Tick
echinomycin   4331     4334      +     Red        |          |     Tick
dnaA          2439     2447      +     Red        |          |     Tick
rep_origin    2535     2535      +     Red        |          |     Tick

Third, list both files in synpbr322.fil:

..
synpbr322.tick
synpbr322.ranges

Lastly, plot it:

$  plasmidmap/infile=@synpbr322.fil/noboldranges/noboldblocks/shaded=0

and here is the result:



Problem group 2.  Multiple sequence formatting

Align all Troponin C entries in SwissProtein.


Use LOOKUP with the word TROPONIN.  Then edit the resulting file to
leave only the Troponin C entries.  Run pileup: 

 $ pileup/infile=@lookup.list/out=tropc.msf

here is part of the MSF file showing aligned residues 51 to 100:

            51                                                 100
Tpca_Homam  GVKISEKNLQ EVIAETDEDG SGELEFEEFV ELAAKFLIEE DEEAL....K 
Tpcb_Homam  GVKISEKNLQ EVISETDEDG SGELEFEEFV ELAAKFLIEE DEEAL....K 
Tpc2_Ponle  GVKISEKNLQ QVIAETDEDG SGELEFEEFV ELAAKFLIEE DEEAL....K 
Tpc1_Homam  GVKISDRHLQ EVISETDEDG SGEIEFEEFA ALAAKFLSEE DEEAL....K 
Tpc1_Ponle  GVKISERHLQ QVISETDEDG SGEIEFEEFA ELAAKFLSEE DEEAL....K 
Tpc1_Balnu  GIKVSSTSFK QIIEEIDEDG SGQIEFSEFL QLAAKFLIEE DEEAM....M 
Tpc2_Balnu  GQAYNAQTLK ELIDEVDADG SGMLEFEEFV TLAAKFIIDD DAEAM....A 
 Tpc_Tactr  GQTFEEKDLK DLIAEIDQDG SGELEFEEFM ALAARFLVEE DAEAM....Q 
Tpcc_Human  GQNPTPEELQ EMIDEVDEDG SGTVDFDEFL VMMVRCMKDD SKGK....SE 
Tpcc_Rabit  GQNPTPEELQ EMIDEVDEDG SGTVDFDEFL VMMVRCMKDD SKGK....SE 
Tpcc_Mouse  GQNPTPEELQ EMIDEVDEDG SGTVDFDEFL VMMVRCMKDD SKGK....SE 
Tpcc_Chick  GQNPTPEELQ EMIDEVDEDG SGTVDFDEFL VMMVRCMKDD SKGK....TE 
Tpcc_Cotja  GQNPTPEELQ EMIDEVDEDG SGTVDFDQFL VMMVRCMKDD SKGK....TE 
Tpcs_Chick  GQNPTKEELD AIIEEVDEDG SGTIDFEEFL VMMVRQMKED AKGK....SE 
Tpcs_Melga  GQNPTKEELD AIIEEVDEDG SGTIDFEEFL VMMVRQMKED AKGK....SE 
Tpcs_Mouse  GQTPTKEELD AIIEEVDEDG SGTIDFEEFL VMMVRQMKED AKGK....SE 
Tpcs_Rabit  GQTPTKEELD AIIEEVDEDG SGTIDFEEFL VMMVRQMKED AKGK....SE 
Tpcs_Human  GQTPTKEELD AIIEEVDEDG SGTIDFEEFL VMMVRQMKED AKGK....SE 
  Tpcs_Pig  GQTPTKEELD AIIEEVDEDG SGTIDFEEFL VMMVRQMKED AKGK....SE 
Tpcs_Ranes  GQTPTKEELD AIIEEVDEDG SGTIDFEEFL VMMVRQMKED AQGK....SE 
 Tpc_Halro  GQNPTEKDLQ EMIEEVDIDG SGTIDFEEFC LMMYRQMQAQ EEAKIPEREE 
 Tpc_Brala  GMSISREELQ QMIDEVDEDA SGTIDFEEFL EMMARAMQDS ERE....IPD 
 Tpc_Patye  GLLVKDDKIK DWSDEMDEEA TGRLNCDAWI QLFERKLKED LDER...... 



2A.  Format the .MSF file using Pretty showing the consensus and differences.

$ pretty/infile=tropc.msf{*}/outfile=tropc.pretty/consens -
   /differences="-"/plural=12.5/default

Here is part of the result, showing aligned residues 51 to 100:

                       51                                                 100
Tropc.Msf{Tpca_Homam}  -vkisekn-q e--a-t---- --el-----v e-aakf-i-- deeal....k 
Tropc.Msf{Tpcb_Homam}  -vkisekn-q e--s-t---- --el-----v e-aakf-i-- deeal....k 
Tropc.Msf{Tpc2_Ponle}  -vkisekn-q q--a-t---- --el-----v e-aakf-i-- deeal....k 
Tropc.Msf{Tpc1_Homam}  -vkisdrh-q e--s-t---- --e------a a-aakf-s-- deeal....k 
Tropc.Msf{Tpc1_Ponle}  -vkiserh-q q--s-t---- --e------a e-aakf-s-- deeal....k 
Tropc.Msf{Tpc1_Balnu}  -ikvssts-k q--------- --q---s--- q-aakf-i-- deeam....m 
Tropc.Msf{Tpc2_Balnu}  --aynaqt-k el-----a-- --ml-----v t-aakfii-- daeam....a 
 Tropc.Msf{Tpc_Tactr}  --tfeek--k dl-a---q-- --el------ a-aa-f-v-- daeam....q 
Tropc.Msf{Tpcc_Human}  --n--p---q em-------- ---------- v--v-c---- skg-....s- 
Tropc.Msf{Tpcc_Rabit}  --n--p---q em-------- ---------- v--v-c---- skg-....s- 
Tropc.Msf{Tpcc_Mouse}  --n--p---q em-------- ---------- v--v-c---- skg-....s- 
Tropc.Msf{Tpcc_Chick}  --n--p---q em-------- ---------- v--v-c---- skg-....t- 
Tropc.Msf{Tpcc_Cotja}  --n--p---q em-------- -------q-- v--v-c---- skg-....t- 
Tropc.Msf{Tpcs_Chick}  --n--k---d a--------- ---------- v--v-q---- akg-....s- 
Tropc.Msf{Tpcs_Melga}  --n--k---d a--------- ---------- v--v-q---- akg-....s- 
Tropc.Msf{Tpcs_Mouse}  --t--k---d a--------- ---------- v--v-q---- akg-....s- 
Tropc.Msf{Tpcs_Rabit}  --t--k---d a--------- ---------- v--v-q---- akg-....s- 
Tropc.Msf{Tpcs_Human}  --t--k---d a--------- ---------- v--v-q---- akg-....s- 
  Tropc.Msf{Tpcs_Pig}  --t--k---d a--------- ---------- v--v-q---- akg-....s- 
Tropc.Msf{Tpcs_Ranes}  --t--k---d a--------- ---------- v--v-q---- aqg-....s- 
 Tropc.Msf{Tpc_Halro}  --n--ek--q em-----i-- ---------c l--y-q-qaq eea-ipere- 
 Tropc.Msf{Tpc_Brala}  -msisr---q qm-------a ---------- e--a-a-q-s ere....ip- 
 Tropc.Msf{Tpc_Patye}  -llvkd-kik dws--m---a t-rlnc-a-i q-fe-k---- lder...... 
            Consensus  GQ-PT-EEL- -IIEEVDEDG SGTIDFEEFL -MM-R-MKED ---K-----E 


2B.  Format the .MSF file using PrettyPlot.  Make the consensus Black, 
     identity Green, similarity Blue, and differences Red.  Also, turn
     off the boxes around similar sequence.

$! set up your GCG graphics first!!!
$ prettyplot/infile=tropc.msf{*}/consens -
  /CCOLOR/CIDE=green/CSIM=blue/COTher=Red/CCONS=black -
  /nobox/default

Here is part of the result, showing aligned residues 51 to 100:



2C.  Format the .MSF file using PrettyBOX.  Show a consensus, using
     output lines of 50 characters with no "block" spacing on the
     line.  Otherwise, use the default settings.

$! set up GCG graphics to put the result into a Postscript file
$ postscript laserwriter tropc_box.ps
$ prettybox/infile=tropc.msf{*}/consens/block=50/line=50/default

Here is part of the result, showing aligned bases 51 to 100:



Problem group 3.  Single sequence formatting

3A.  Format GB_IN:Dmhish1 for publication showing:

     1.  Protein translation under the DNA (3 letter form)
     2.  Forward sequence only (no reverse sequence)
     3.  Dots every 10 bases, above the DNA
     4.  Number at the ends of the dot line
     5.  Two blank lines between each group of lines
     50  bases per line


Begin by rearranging the requirements into a "top down" order
corresponding to what is desired in the final file.  Then translate
each requirement to the corresponding menu letter (being careful of
case!). 

     3.  Dots every 10 bases                                     -> b
     4.  Number at the ends of the dot line                      -> b-> B
     2.  Forward sequence only (no reverse sequence)             -> c
     1.  Protein translation under the DNA (3 letter form)       -> f
     5.  Two blank lines between each group of lines             -> ii

The header of the file shows the protein from 106 to 876.  Here is the
log of the run, responses in bold, and this {return} means
just hit the "return" (or possibly, "enter") key. 

$ publish/infile=gb_in:dmhish1

Publish arranges sequences for publication.  It creates a text file
that you can modify to your own needs with a text editor. 

                  Begin (* 1 *) ?  {return}
                End (*   944 *) ?  {return}

  Please select the lines in the order in which
  you want them to appear in the figure.
      a) number line        :                10
      b) dot scale line     :                 .         .
      c) the sequence itself:        GAATTCACGATCGATCGTAG
      D) dash scale line    :     1  ---------+---------+  20
      e) the complement     :        CTTAAGTGCTAGCTAGCATC
      f) translation        :        GluPheThrIleAspArg
      g) translation        :        E  F  T  I  D  R
      h) tagged blank line  :   ###
      i) blank line         :
      j) 2nd sequence (diff):                C     G
      k) 2nd sequence (all) :        GAATTCACCATCGAGCGTAG
      l) match line         :        |||||||| ||||| |||||

  Select the lines in the order you wish them to appear
  and then press .  Use uppercase to identify the
  lines that you want numbered at the ends
                       (* cDefii *)  Bcfii
 What number is the first symbol in the row 1 ( * 1 *) ?  
 Please enter the ranges of translation for
 translation line number 1
 using the original coordinates of the sequence.
                  Begin (*    1 *) ?  106
                    End (*  944 *) ?  876
 Get another range from this sequence
 (* Yes *) ?  No
 How many symbols per block (* 60 *)? 50
 How many blocks do you want on each line (* 1 *) ? {return} 
 What should I call the output file (* Dmhish1.publish *) ?  {return}

Here is the formatted result:

 
 
      1           .         .         .         .         .   50
         AGTGTTAAAGTGCTCTCCTCCTCGATTCTCATCAGAGCAAAGGAGGTTGG         
                                                                    
                                                                    
     51           .         .         .         .         .   100
         TAGGCAGCGCGCGAGCCATTTTTAACAGAAAAAAAGTGTTCTCAGTGAAA         
                                                                    
                                                                    
    101           .         .         .         .         .   150
         AAAAGATGTCTGATTCTGCAGTTGCAACGTCCGCTTCCCCAGTGGCTGCC         
              MetSerAspSerAlaValAlaThrSerAlaSerProValAlaAla         
                                                                    
                                                                    
    151           .         .         .         .         .   200
         CCACCAGCGACAGTTGAGAAGAAAGTGGTCCAAAAAAAGGCATCTGGATC         
         ProProAlaThrValGluLysLysValValGlnLysLysAlaSerGlySe         
                                                                    
                                                                    
    201           .         .         .         .         .   250
         TGCTGGCACAAAGGCAAAGAAAGCCTCTGCGACGCCGTCACATCCGCCAA         
         rAlaGlyThrLysAlaLysLysAlaSerAlaThrProSerHisProProT         
                                                                    
                                                                    
    251           .         .         .         .         .   300
         CTCAGCAAATGGTGGACGCTTCCATTAAAAATTTAAAGGAACGTGGCGGT         
         hrGlnGlnMetValAspAlaSerIleLysAsnLeuLysGluArgGlyGly         
                                                                    
                                                                    
    301           .         .         .         .         .   350
         TCATCACTTCTGGCAATCAAAAAATATATCACTGCCACTTATAAATGCGA         
         SerSerLeuLeuAlaIleLysLysTyrIleThrAlaThrTyrLysCysAs         
                                                                    
                                                                    
    351           .         .         .         .         .   400
         CGCCCAAAAGTTAGCGCCATTCATCAAGAAGTACTTAAAATCGGCCGTGG         
         pAlaGlnLysLeuAlaProPheIleLysLysTyrLeuLysSerAlaValV         
                                                                    
                                                                    
    401           .         .         .         .         .   450
         TCAATGGAAAGCTTATTCAAACTAAGGGAAAGGGTGCATCTGGATCTTTC         
         alAsnGlyLysLeuIleGlnThrLysGlyLysGlyAlaSerGlySerPhe         
                                                                    
                                                                    
    451           .         .         .         .         .   500
         AAACTGTCGGCCTCTGCCAAGAAGGAAAAGGATCCGAAGGCAAAGTCGAA         
         LysLeuSerAlaSerAlaLysLysGluLysAspProLysAlaLysSerLy         
                                                                    
                                                                    
    501           .         .         .         .         .   550
         GGTTTTGTCTGCTGAGAAAAAAGTTCAAAGCAAGAAGGTAGCCTCTAAGA         
         sValLeuSerAlaGluLysLysValGlnSerLysLysValAlaSerLysL         
                                                                    
                                                                    
    551           .         .         .         .         .   600
         AGATTGGTGTCTCCTCCAAAAAAACTGCCGTTGGGGCTGCTGACAAAAAG         
         ysIleGlyValSerSerLysLysThrAlaValGlyAlaAlaAspLysLys         
                                                                    
                                                                    
    601           .         .         .         .         .   650
         CCCAAAGCTAAGAAGGCTGTGGCTACCAAAAAGACTGCCGAAAATAAGAA         
         ProLysAlaLysLysAlaValAlaThrLysLysThrAlaGluAsnLysLy         
                                                                    
                                                                    
    651           .         .         .         .         .   700
         AACTGAGAAGGCAAAAGCCAAGGATGCCAAGAAAACTGGAATCATAAAGT         
         sThrGluLysAlaLysAlaLysAspAlaLysLysThrGlyIleIleLysS         
                                                                    
                                                                    
    701           .         .         .         .         .   750
         CGAAGCCCGCCGCAACAAAGGCGAAAGTGACTGCAGCGAAGCCAAAGGCT         
         erLysProAlaAlaThrLysAlaLysValThrAlaAlaLysProLysAla         
                                                                    
                                                                    
    751           .         .         .         .         .   800
         GTAGTAGCGAAAGCGTCAAAGGCAAAGCCAGCGGTGTCTGCAAAACCCAA         
         ValValAlaLysAlaSerLysAlaLysProAlaValSerAlaLysProLy         
                                                                    
                                                                    
    801           .         .         .         .         .   850
         AAAGACGGTGAAGAAAGCATCGGTTTCTGCTACCGCCAAGAAGCCGAAAG         
         sLysThrValLysLysAlaSerValSerAlaThrAlaLysLysProLysA         
                                                                    
                                                                    
    851           .         .         .         .         .   900
         CGAAGACTACGGCTGCCAAAAAGTAAATTGTGAAAAAGTGCAGTATTTGG         
         laLysThrThrAlaAlaLysLysEnd                                 
                                                                    
                                                                    
    901           .         .         .         .      944
         TACATGTTCGCAATTAAAATTTTAGATTTATGATTTATAGATCT        
                                                             



Problem group 4.  Moving graphics files to your own machine

4A.  Configure GCG graphics to use the CGM, EPSF, or HPGL
     graphics driver, then regenerate the PlasmidMap graphic from
     problem 1, being careful to use the /FONT=0 qualifier, and move
     the file so produced to your Macintosh or Windows machine.  
     Which programs do you have, if any, which could open/import the
     file?  Which pieces were you able to modify?

See the lecture for the list of tools/platforms which are known to
work with these file types.  Assuming that you finally found a tool
which would open/import the file, you should have been able to modify
the fonts/sizes of the text, and the various line properties, but arcs
and circles should have been composed of myriad small line segments,
and the "arrowheads" on their ends would have been found to be
separate line segments.