Fundamentals of Sequence Analysis, 1998-1999

Lecture 8. Formatting data for publication


Introduction

Today we're going to go over some of the tools that you can use to format your data for publication. We'll start with a review of GCG graphics, then discuss methods for moving and modifying graphics file, then conclude with a look at several programs which you may use to generate publication quality graphics.


Review of GCG graphics formats

The first thing that I want to do is to review in some detail the way the GCG graphics system works. All GCG programs that output graphics write the instructions for the graphic to a common set of graphics routines. The graphics driver that the routines use must be preconfigured by the user before the program is run. The GCG graphics system supports 8 classes of graphics devices, which are:

  GKS
  HPGL
  POSTSCRIPT
  REGIS
  SIXEL
  TEKTRONIX
  XWINDOWS
  CGM        Not a standard GCG driver

Within each class of graphics device there are several specific pieces of hardware or graphics formats that are supported. For instance, GCG Postscript supports the following:

 LaserWriter
 Lzr1200
 LN03-ScriptPrinter
 LPS20
 ColorScript-100
 EPSF (single page encapsulated postscript format)

Modifying graphics files on a Macintosh or Windows machine

Unfortunately, and this is a big, big problem, there is no direct path from GCG graphics to either the Macintosh PICT or Windows metafile WMF formats. This means that if you want to modify a graphic on your Macintosh or Windows machine at the "object" level, you will need some specialized piece of software to convert one of the supported output formats to PICT.

Let me explain further just where the limitations lie. Let's say that you have a copy of Versaterm-Pro or some other Tektronix emulator. Most of these have an option to save the graphics screen to a PICT file. Unfortunately, the GCG graphics driver for Tektronix devices uses "stroke fonts" to represent letters. That is, rather than saying to the terminal "put this text at this position" it sends a series of "move,draw,move,draw" commands, resulting in a large number of line segments that draw the text. That graphics driver also does similar things with otherwise atomic graphics objects, such as circles. This means that when you open that PICT file it will contain a humongous number of short line segments, and no editable text or atomic graphics objects other than lines. On our system there are three GCG graphics drivers which will preserve some atomic graphics objects, most notably, text as text.


Graphics drivers which retain objects as objects

EPSF format

There are some commercial programs, for instance, Adobe Illustrator, that can open and manipulate encapsulated postscript (EPSF) files, and if you have one, that would be your method of choice for moving GCG graphics to your system. Encapsulated postscript differs from plain Postscript primarily in that it restricts the graphic to be within a box - which makes it possible for other programs to output a file that says, essentially "this region on the page is a graphic specified by this EPSF code" which it inserts verbatim. This isn't an option with regular Postscript, because the pasted in Postscript would usually contain page control information which would conflict with the outer, wrapper Postscript. Transfer EPSF files using ASCII mode in FTP.


HPGL format

HPGL is a page description language more or less of the same type as Postscript. If you still have access to a Macintosh with hypercard installed, then you may use the HP2PICT hypercard stack to translate the HPGL file into PICT format. Otherwise, use Thorsten Lemke's GraphicConverter program. By selecting the Convert More option on the file menu you may easily convert HPGL to PICT while retaining the object information in the file. On Windows, Corel Draw 6.0 can import an HPGL file, and translate it on the fly, so that you will then be able to edit it. Transfer HPGL files in BINARY mode in FTP.


CGM format

The only standard GCG graphics drivers that leave text as text are those for the HPGL and Postscript formats. Because it is so difficult to read these into Windows and Macintosh programs for modification, I wrote a driver for the computer graphics metafile (CGM) format, which also represents graphical images as a collection of objects. Support for this format is extremely variable though. Some versions of some programs work, some don't. In most cases you have to explicitly ask for extra graphic file support at installation or the relevant CGM files won't be installed. Support within application "suites" tends to be the same for every program in the package. The CGM file produced by this driver must be transferred in binary FTP mode.

CGM driver compatibility table
Platform Software Graphics Text
Mac Canvas 3.0 good good
Mac Canvas 5.0 good broken
Mac GraphiConverter 2.5 broken broken
Windows 95/98/NT MS Office 97 broken broken
Windows 95/98/NT MS Office 97 SR-1 good good
Windows 95/98/NT Lotus SmartSuite 97 good good
Windows 95/98/NT Corel Draw 6.0 good good

The importance of using the hardware font

Whenever you use the CGM, HPGL, or Postscript drivers to make a file which you intend to convert eventually to an object graphics file on another platform, it is essential that you add the /FONT=0 qualifier to the GCG command. This forces the graphics driver to use the "hardware" font, instead of one of the 21 stroke fonts. That is, it says, "use text, not a series of lines, to represent each character." For instance, in the following example, we show that the text strings "5,000" and "10,000" are present in the first output file (for font 0), but not in the second (for font 1).

$ postscript epsf killme.ps
$ frames/infile=gb_in:dmwhite/font=0/default
$ search killme.ps ",000"
2 0 (5,000) ashow
2 0 (10,000) ashow
$ frames/infile=gb_in:dmwhite/font=1/default
$ search killme.ps ",000"
%SEARCH-I-NOMATCHES, no strings matched

When GCG graphics are good enough

Many times GCG graphics are good enough and do not require any further modification on a Macintosh or Windows machine before publication. In this case, configure your illustration using only the GCG tools, and print the result to a laser printer. Color output can be obtained from our color laser printer, and also from the slide maker.


Printing to the color laser printer

In the following example, a color graphic is sent to the color laser printer. Remember, there is a charge for each page printed on this device, so it is a very good idea to suppress the flag page (the one which says who sent the job.)

$ postscript ColorScript-100 |print/queue=saf_magicolor_ps/noflag
$ frames/infile=gb_in:dmwhite/default

Making a slide

Making a slide is a bit more work. The first method, using an intermediate postscript file, is known to work reliably.

A color postscript file is created, moved to a disk with more free space (vardisk:, if you don't do so you will likely exceed your disk quota on usrdisk:), converted to a Tiff file with Ghostscript, and then transferred to a user's directory on slidemaker, the Windows NT computer which drives the LaserGraphics Personal LFR Plus film recorder. Later, use the WinRascol program on slidemaker to actually render the slide. Note that the .tif extension uses only one f, otherwise WinRascol and other applications may not automatically recognize it as a Tiff.

$ postscript ColorScript-100 slide.ps
$ frames/infile=gb_in:dmwhite/default
$ copy slide.ps vardisk:
$ set default vardisk:
$ gs "-sDEVICE=tiff24nc" -
     "-sOutputFile=slide.tif" -
     "-dORIENT1=false" "-r200" -
     "-dNOPAUSE" "-dBATCH" slide.ps
$ anonftp slidemaker
     cd "yourname X1234"
     binary
     put slide.tif slide.tif
     exit
$ delete slide.tif;
$ delete slide.ps;

The second method is to direct graphics output to an HPGL file, which the WinRascol program can render directly. This method is simpler than the one just described, and since it makes much smaller files, it isn't necessary to work on a different disk. However, it has not been tested extensively, so we don't know how reliable it is.

$ hpgl HP7550 SLIDE.HPGL a4
$ frames/infile=gb_in:dmwhite/default
$ anonftp slidemaker
     cd "yourname X1234"
     binary
     put slide.hpgl slide.hpgl
     exit
$ delete slide.hpgl;

Editing graphics using only GCG tools

You don't always have to move a graphics file to your Macintosh or Windows macine to modify it. All of the GCG graphics programs can be induced to emit a FIGURE file, instead of sending a plot directly to a target device, by adding the /FIGure qualifier to the command line. The resultant FIGURE file is a text file which contains instructions for drawing, scaling, pen size, font size, and so forth. The format of this file is documented in the manual and the online documentation. The program FIGURE can then process this file and send it to a graphics device. So you may produce a final plot by cycling between modifying a plot by editing its FIGURE file and viewing the plot with the FIGURE program. Once you understand how these files work, it is often quite easy to modify them to show what you want. For instance, you can overlap plots from different programs by using an editor to merge the two figure files, and then adjusting the Viewport command for each piece. To learn more about the commands inside a FIGURE file, consult the manual or online help system.

$ genhelp figure

Here is an example of how to make, and then view, a FIGURE file.

$ tektronix versaterm tt
$ mapplot/infile=gb_in:dmwhite/enzyme=ecori -
  /begin=1000/end=2000/figure/default
$ figure/infile=mapplot.figure/nomark

Text lines which will appear in the plot are marked by .plottext or .pt, so now edit the file (it doesn't matter which editor), and redraw. Here we change the title line, and delete the bottom two text lines.

$ edit mapplot.figure
$ diff mapplot.figure
************
File MAPPLOT.FIGURE;5
   13   .pt New title
   14   .relm 0 -2.6
******
File MAPPLOT.FIGURE;4
   13   .pt (Linear) MAPPLOT of: Gb_In:Dmwhite ck: 9858, 1000 to: 2000  March  5, 1999 10:09.
   14   .relm 0 -2.6
************
************
File MAPPLOT.FIGURE;5
   56   .relm 0 -2
******
File MAPPLOT.FIGURE;4
   56   .pt Enzymes that do not cut:
   57   .relm 0 -2
************
************
File MAPPLOT.FIGURE;5
   59   .relm 0 -2
******
File MAPPLOT.FIGURE;4
   60   .pt NONE
   61   .relm 0 -2
************

$ figure/infile=mapplot.figure/nomark

Each GCG graphics driver writes to an abstract device with platen coordinates (0,0) in the lower left corner, and (150,100) in the upper right corner. Use the Viewport (.vp) command to change the region of this device which will be drawn, Window (.wn) to change the coordinates within this window to be used by all subsequent drawing commands. For instance, a final figure which consisted of four plots in a "plate" organization, would usually have four Viewport commands in it, one for each plate. Here we modify the Viewport in the example FIGURE file to decrease the size of the plot, and then display the modified plot:

$ edit mapplot.figure
 using the editor, change

.vp 20 120 0 100

  to

.vp 40 80 0 100

$ figure/infile=mapplot.figure/nomark

In this example we make a plate with two figures in it. Usually the second plate would be from a different source than the first!

$ edit mapplot.figure
 using the editor, duplicate the entire contents from
.vp on down, and modify the .vp and .pt lines

.vp 20 60 0 50
.pt First title

.vp 90 130 50 100
.pt Second title

$ figure/infile=mapplot.figure/nomark


Drawing circular DNA

Many of you will at some point have to draw a plasmids or other circular DNA for a meeting or a paper. The GCG program PlasmidMap is the program to use for this task. The first thing you need when running this program is a "tick file" which is a text file with a specific format. Generate a tick file, or at least the beginnings of one, using the MapSort program with the /PLAsmid qualifier. This example is for pBluescript II ks(+)/LIC.

$ mapsort/infile=gb_sy:cvkslic/once/plasmid/default
$ extract/head=22 CVKSLIC.TICK
 (Circular) (Plasmid) MAPSORT of: Cvkslic  check: 3063  from: 1  to: 2979

LOCUS       CVKSLIC      2979 bp    DNA   circular  SYN       16-MAY-1995
DEFINITION  Ligation-independent cloning vector pBluescript II KS(+)/LIC,
            complete sequence.
ACCESSION   U25267
NID         g806875
KEYWORDS    beta-lactamase; ligation-independent cloning region; NarI . . . 

 Mismatch: 0  MinCuts = 1  MaxCuts: 1
 With 229 enzymes: * 

                         March  8, 1999 09:43

     Name     From       To   Strand  Color  FromSymbol  ToSymbol  Style ..

AccI           753      753      .    Green       .          .     Tick
AflIII        1171     1171      .    Green       .          .     Tick
AhdI          2064     2064      .     Blue       .          .     Tick
AloI           174      174      .    Black       .          .     Tick
AlwNI         1587     1587      .     Blue       .          .     Tick
ApaI           771      771      .     Blue       .          .     Tick

Next create a ranges file using the information present in the genbank entry. This is usually done manually, with a text editor. Lastly, create a file cvkslic.fil that has comments, a ".." separator, and then lists first the tick file and then the ranges file. Finally, render the plasmid with a PlasmidMap command.

$ create cvkslic.range
Any comments we want in the range file
     Name     From       To   Strand  Color  FromSymbol  ToSymbol  Style ..
f1Origin         3      459      .     Blue       [          ]     Range
ProT7          626      645      +     Red        >          >     Range
Ligate         708      738      .     Green      |          |     Range
ProT3          809      790      -     Red        >          >     Range
lacZ           956      835      -     Black      >          >     Range
beta-lactamase 2851    1991      -     Black      >          >     Range
^Z
$ create cvkslic.fil
any commends we want in the master file
..
cvkslic.tick
cvkslic.range
^Z
$ plasmidmap/infile=@cvkslic.fil/noboldranges

Notice that in this first example the arrows or other shape drawn at the ends of short ranges are truncated by the label for the range. For instance, see ProT3. The /noboldranges qualifier instructs the program not to try to "bold" ranges, which GCG graphics can only do by drawing multiple lines, which tends to make the plot look blurry rather than bold. The various lines of text in the plot are read from the tick file.

Here is a more complicated example of the use of PlasmidMap from the GCG manual:

$ plasmidmap/infile=@gendocdata:pgamma2.fil -
  /init=gendocdata:pgamma2.init

The init file contains all of settings which would normally have been entered on the command line.

$ extract/head=10 gendocdata:pgamma2.init
PLASMIDMAP command line initializing file.

This is the file we use to create Figure 2 of the PLASMIDMAP document

1/3/89 ..

! Switches:

/BOLDCircle          ! draws a thick circle
/BOLDMajorTicks      ! draws diamond-shaped major scale ticks
/BOLDRanges          ! draws thick ranges
/NOSORTRanges        ! sorts the ranges by size

The PlasmidMap program has many command line options that will let you configure the output just about any way you please. The one serious limitation the program has is that since GCG graphics cannot shade areas except by hatching with many drawn lines, it isn't possible to shade blocks satisfactorily. However, as described above, by directing the output through the appropriate drive you can move the graphic file to a Windows or Macintosh for final touch up.


Formatting aligned sequences

Aligned sequences may be formatted with any of several programs. The first option is the GCG Pretty program, which read in an alignment from an MSF file, formats it, and writes the result to a text file.

$ pretty/infile=class:azurin.msf{*}/consens -
  /differences="-"/out=azurin.pretty/default
$ extract/head=20 azurin.pretty   
Plurality: 2.00  Threshold: 1.00  AveWeight 1.00  AveMatch 0.54  AvMisMatch -0.40

PRETTY of: Class:Azurin.Msf{*}   March  5, 1999 11:34  ..

                        1                                                   50
 Azurin.Msf{H81_Neigo}  n-aa------ n--------q -s-a--e--- t-----tq-- as----l-ia 
  Azurin.Msf{H8_Neime}  n-aa------ n--------q -s-a--e--- t-----tq-- as----l-ia 
Azurin.Msf{Azur_Alcde}  q-ea------ a----l--mv ------q--- h---v-k-a- va-------- 
Azurin.Msf{Azur_Alcfa}  a-d-s--g-- s------s-v ---t--e--- ------k--- aa----v-vs 
Azurin.Msf{Azur_Alcsp}  ----d-ag-- ----dk---t -s----q--- ----p-k-a- ---------- 
Azurin.Msf{Azur_Borbr}  ----d-agt- ----dk-a-e -s----q--- ------k--r ---------- 
Azurin.Msf{Azur_Pseae}  ----d-qg-- ------na-t ------q--- --s-p-n--- ---------s 
Azurin.Msf{Azur_Psede}  ----d-qg-- ----s-na-t ---a--t--- --s-p----- ---------- 
Azurin.Msf{Azur_Psefb}  --kt----t- --s----a-e ---a--t--- e-t-s----- ------l-is 
Azurin.Msf{Azur_Psepu}  --k-----t- --s------a ------t--- e-t-s----- ------l-is 
Azurin.Msf{Azur_Psefc}  --k-----t- --s-d--a-e ------t--- d---s-n--- ---------- 
Azurin.Msf{Azur_Psefd}  --k-d---t- --s------t ------t--- --t-s----- ---------s 
 Azurin.Msf{Azu1_Metj}  g---d--a-- a------n-d ------e--- ---------- ------l-i- 
 Azurin.Msf{Azu2_Metj}  s-et--t-g- t-t-s-rs-s -pa--ae--- --e-k-h--- tg-------a 
             Consensus  ECSVT-ESND QMQFNTK-I- VDKSCK-FTV NLKHTGSLPK NVMGHNWVLT 

Since the output of Pretty is just a text file, it may be moved to a Macintosh or Windows machine and further modified there with any word processor. In particular, graphics are commonly added above, below, or even over the aligned sequences.


PrettyPlot, an improved Pretty

PrettyPlot is the second option for formatting aligned sequences. It is a much enhanced version of Pretty. It sends its output through the GCG graphics system so that an assortment of formatting options, such as boxing or coloring, can be accomplished. One problem you will see with Pretty is that the names are not usually what you want - they look like " Azurin.Msf{H81_Neigo}". These may be reduced to just "H81_Neigo", which is typically the desired format, by using the EGCG program PrettyPlot with the qualifiers /TEXT/SHORT, which redirects the output to a text file, and forces the use of short sequence names.

$ prettyplot/infile=class:azurin.msf{*}/consens/differences="-" -
  /outfile=azurin.pretty/short/text/noplot/default
$ extract/head=20 azurin.pretty
PRETTYPLOT of: Class:Azurin.Msf{*}   March  5, 1999 12:34  ..
Plurality: 7.50  Threshold: 1.00  AveWeight 1.00  AveMatch 0.54  AvMisMatch -0.40


            1                                                   50
 H81_Neigo  n-aat----- n------d-q -s-a--e--- t---t-tq-- as----l-ia 
  H8_Neime  n-aat----- n------d-q -s-a--e--- t---t-tq-- as----l-ia 
Azur_Alcde  q-eat----- a----l-emv ------q--- h---v-k-a- va-------t 
Azur_Alcfa  a-d-s--g-- s------s-v ---t--e--- ----t-k--- aa----v-vs 
Azur_Alcsp  --s-d-ag-- ----dk-e-t -s----q--- ----p-k-a- ---------t 
Azur_Borbr  --s-d-agt- ----dk-a-e -s----q--- ----t-k--r ---------t 
Azur_Pseae  --s-d-qg-- ------na-t ------q--- --s-p-n--- ---------s 
Azur_Psede  --s-d-qg-- ----s-na-t ---a--t--- --s-p-s--- ---------t 
Azur_Psefb  --ktt---t- --s----a-e ---a--t--- e-t-s-s--- ------l-is 
Azur_Psepu  --k-t---t- --s----d-a ------t--- e-t-s-s--- ------l-is 
Azur_Psefc  --k-t---t- --s-d--a-e ------t--- d---s-n--- ---------t 
Azur_Psefd  --k-d---t- --s----e-t ------t--- --t-s-s--- ---------s 
 Azu1_Metj  g-s-d--a-- a------n-d ------e--- ----t-s--- ------l-it 
 Azu2_Metj  s-ett-t-g- t-t-s-rs-s -pa--ae--- --e-k-h--- tg-------a 
 Consensus  EC-V--ESND QMQFNTK-I- VDKSCK-FTV NLKH-G-LPK NVMGHNWVL- 

Which is pretty much the same display as before. PrettyPlot can do much more. In this example the output is sent through the GCG graphics system, and similar regions are surrounded by boxes. You would rarely want to show differences and boxes in the same plot.

$ prettyplot/infile=class:azurin.msf{*}/consens/short

There are two ways to obtain a color plot of the same data. To color by composition, use the /DOCOLORS qualifier and then specify which sets of residues are to be green,red, blue, red, cyan, yellow, or violet, or accept the defaults. To color by similarity to the consensus sequence, use the /CCOLORS qualifier and then specify colors for residues which are identical, similar, or different from the consensus. The color of the consensus sequence itself may also be set.

$ prettyplot/infile=class:azurin.msf{*}/consens/short -
  /docolors/default

$ prettyplot/infile=class:azurin.msf{*}/consens/short -
  /ccolors/default

$ prettyplot/infile=class:azurin.msf{*}/consens/short -
  /ccolors/cconsens=black/cident=red/csimi=blue -
  /cothers=cyan/default


Shading alignments

Rather than drawing boxes around groups of letters, sometimes sequence alignments are shaded. It is easier to see the regions of similarity in a shaded figure than in either of the two preceding formats, especially from a distance, as in a slide show. Conversely, it is usually more difficult to read the characters in the shaded alignment. Black and white shaded figures do not photocopy well, and color shaded figures photocopy with even lower fidelity, both are usually impossible to read accurately in a second or third generation copy. Plain or boxed alignments photocopy as well as regular text.

Use the EGCG program Prettybox to make a shaded figure. It is similar in many ways to the preceding two programs. The output from Prettybox does not go through the GCG graphics system. Instead it creates a Postscript output file directly. However, the name of the output file is still set through the GCG Postscript command. In the following example, the output is directed to a file, later it would be sent to a printer or through GhostScript to convert it to some other format.

$ postscript laserwriter azurin.ps

Note that the only part of this that Prettybox uses is the final parameter - the file name or print command. The next command makes a shaded plot. The /seqname=partial forces it to use short sequence names, /orient=p sets the orientation to portrait, and /number=t puts the numbers on top of the columns, rather than on the sides.

$ prettybox/infile=class:azurin.msf{*} -
  /consens/seqname=partial/orient=p/begin=1/end=40 -
  /number=t/default

For all of the "pretty" line of programs, you can modify conditions needed for a consensus. The way a consensus is calculated is this:

  1. For each column each possible consensus character is tested (20 for amino acids, 4 for nucleic acids).
  2. The test character is compared with every character in that column by looking up the value in a comparison matrix determined by those two indices.
  3. If that value is greater than the value set by the /THReshold qualifer then that row counts as one vote for the test character.
  4. After all possible consensus characters are tested the one with the highest vote total is considered the consensus, so long as the total number of votes for it is at least as high as set by the /PLUralility qualifier.
  5. The default value for plurality is different in each of these three programs. N is the number of sequences in the alignment, the default plurality values are:

Reformatting alignments for input to a spreadsheet program

Some people like to use Excel or other spreadsheet to do column formatting, but in order to move an MSF file into Excel you must first convert it to a format that that program can read. To do this, use the locally written program Delimit, which can convert any text file to comma or tab delimited text. I don't suggest that you use TAB delimiting though, as it will be hard to verify that you formatted it as you wanted via a TYPE command. The search command removes all of the header information from the MSF file, leaving only the aligned sequence.

$ search/remaining/out=azurin.txt class:azurin.msf "//"
$ delimit
azurin.txt          input file to delimit
azurin.comma        name of output file
                    comma delimits (default)
12                  delimit skips N characters per line
$ extract/head=10 azurin.comma
//

            ,1, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,5,0
 H81_Neigo  ,N,C,A,A,T,V,E,S,N,D, ,N,M,Q,F,N,T,K,D,I,Q, ,V,S,K,A,C,K,E,F,T,I, ,T,L,K,H,T,G,T,Q,P,K, ,A,S,M,G,H,N,L,V,I,A,
  H8_Neime  ,N,C,A,A,T,V,E,S,N,D, ,N,M,Q,F,N,T,K,D,I,Q, ,V,S,K,A,C,K,E,F,T,I, ,T,L,K,H,T,G,T,Q,P,K, ,A,S,M,G,H,N,L,V,I,A,
Azur_Alcde  ,Q,C,E,A,T,I,E,S,N,D, ,A,M,Q,Y,N,L,K,E,M,V, ,V,D,K,S,C,K,Q,F,T,V, ,H,L,K,H,V,G,K,M,A,K, ,V,A,M,G,H,N,W,V,L,T,
Azur_Alcfa  ,A,C,D,V,S,I,E,G,N,D, ,S,M,Q,F,N,T,K,S,I,V, ,V,D,K,T,C,K,E,F,T,I, ,N,L,K,H,T,G,K,L,P,K, ,A,A,M,G,H,N,V,V,V,S,
Azur_Alcsp  ,E,C,S,V,D,I,A,G,N,D, ,Q,M,Q,F,D,K,K,E,I,T, ,V,S,K,S,C,K,Q,F,T,V, ,N,L,K,H,P,G,K,L,A,K, ,N,V,M,G,H,N,W,V,L,T,
Azur_Borbr  ,E,C,S,V,D,I,A,G,T,D, ,Q,M,Q,F,D,K,K,A,I,E, ,V,S,K,S,C,K,Q,F,T,V, ,N,L,K,H,T,G,K,L,P,R, ,N,V,M,G,H,N,W,V,L,T,
Azur_Pseae  ,E,C,S,V,D,I,Q,G,N,D, ,Q,M,Q,F,N,T,N,A,I,T, ,V,D,K,S,C,K,Q,F,T,V, ,N,L,S,H,P,G,N,L,P,K, ,N,V,M,G,H,N,W,V,L,S,

At this point you would use ASCII ftp to move the file to your Macintosh or PC. If you go this route, you might want to try using /BLOCK=200 and /LINESIZE=200 or something like that, to remove all blocking columns, or spaces, and put each sequence into a single line of the output file.


Using Alscript to make illustrations

Alscript is a program that can reformat aligned sequences for publication. It is in many ways similar to PrettyPlot and PrettyBox, but is a bit more general. It is possible to add arbitrary comments and do nearly arbitrary font and color replacements using Alscript.

Alscript takes the instructions in an ALS file, and uses them to format a BLC file, producing a postscript file. Use the program MSF2BLC to convert a sequence alignment from MSF format to BLC format. As with Figure you will run through several cycles in which you modify the ALS file using an editor, use that to make a postscript file, and then print or preview the postscript file to see if you need to make further changes.

There are several example ALS files in the ALSCRIPT directory, along with their resultant postscript output files. The postscript manual for the program is also in this directory. Print it out - you will need to refer to it. This manual is also available on the web at but access to this site tends to be a bit slow.

If you publish any figures that were constructed with Alscript you must cite: Barton, G. J. (1993), ALSCRIPT: A Tool to Format Multiple Sequence Alignments, Protein Engineering, Volume 6, No. 1, pp. 37-40. The following example is from the ALSCRIPT distribution. The ALS file in this example is 307 lines long, and so is not shown here.

$ msf2blc
myfile.msf
myfile.blc
$ copy alscript:ipns.ALS []
$ copy alscript:ipns.BLC []
$ alscript ipns.ALS
$ printg/noflag ipns.ps


Formatting single sequences for publication

For formatting single DNA or RNA sequences the tool you would most often start with is Publish or the EGCG variant of it, Epublish. The main difference between the two is that the latter supports command line control and the former doesn't. Both programs will show two aligned DNA sequences. However, Epublish will also show the translations from both and the matches between them, whereas Publish only will show the first set of translations. If the plot is to show translations you must know the exon positions before starting the program, since when it prompts for them you won't be able to go look them up, unless you open another session or spawn a subprocess. The output file produced by Publish is plain text so you can easily merge the results from several runs of the program using only a text editor.

$ epublish/infile=gb_in:dmwhite/begin=11201/end=11300/symbols=50
cDefii

11228
11299
no
1

$ type dmwhite.publish

         GGGTCCAATTACCAATTTGAAACTCAGTTTGCGGCGTGGCCTATCCGGGC         
  11201  ---------+---------+---------+---------+---------+   11250
         CCCAGGTTAATGGTTAAACTTTGAGTCAAACGCCGCACCGGATAGGCCCG         
                                    PheAlaAlaTrpProIleArgAl         
                                                                    
         GAACTTTTGGCCGTGATGGGCAGTTCCGGTGCCGGAAAGACGACCCTGC        
  11251  ---------+---------+---------+---------+---------+   11300
         CTTGAAAACCGGCACTACCCGTCAAGGCCACGGCCTTTCTGCTGGGACG        
         aAsnPheTrpProEndTrpAlaValProValProGluArgArgProCys          

You may put the individual lines in any order and repeat them several times. For instance, to put the translation on top and to put more space between lines, and to lose the reverse sequence, use:

$ epublish/infile=gb_in:dmwhite/begin=11201/end=11300/symbols=50
fcDiiiii

11228
11299
no
1

$ type dmwhite.publish


                                    PheAlaAlaTrpProIleArgAl         
         GGGTCCAATTACCAATTTGAAACTCAGTTTGCGGCGTGGCCTATCCGGGC         
  11201  ---------+---------+---------+---------+---------+   11250
                                                                    
                                                                    
                                                                    
                                                                    
                                                                    
         aAsnPheTrpProEndTrpAlaValProValProGluArgArgProCys          
         GAACTTTTGGCCGTGATGGGCAGTTCCGGTGCCGGAAAGACGACCCTGC        
  11251  ---------+---------+---------+---------+---------+   11300
                                                                    
                                                                    
                                                                    
                                                                    
                                                                    


Next week we'll cover RNA folding. Are there any questions?