Fundamentals of Sequence Analysis, 1998-1999
Problem set 5:  Assembling sequences.

If you get stuck, refer to the OpenVMS and GCG resources in
the class home page. 

References:
 
 See the GCG and EGCG manuals.


Problem group 1.  Dealing with ABI sequences

Create a subdirectory and set your default directory to it. 
Issue the command: 

  $ copy class:*example*.* []

You will now see two files with ugly names.

1A.  How do you fix these names?
1B.  What might have caused these awful names?


Issue the command:   $ fix_abi_here

You should see this happen:
was: $ABI_EXAMPLE.M13F$5N$BIN
is:  ABI_EXAMPLE_M13F.ABI
was: $ABI_EXAMPLE.M13F$5NS$EQ$5NTXT
is:  ABI_EXAMPLE_M13F.SEQ

The "$" and "$5N" result from the encoding of case shifts in
the original Macintosh file name using the allowed
characters on OpenVMS.  You can avoid this problem by using
only single case names on the Macintosh, and avoiding spaces
and other characters illegal on OpenVMS. 


Configure your terminal for GCG graphics.  Issue these commands:

  $ abiprintout/infile=ABI_EXAMPLE_M13F.ABI;1/begin=180/end=200
  $ abiprintout/infile=ABI_EXAMPLE_M13F.ABI;1/begin=180/end=200/pnt=500
  $ abiprintout/infile=ABI_EXAMPLE_M13F.ABI;1/begin=180/end=200/pnt=FIT

1B.  How do the two plots differ?

The first one is narrower then the second.  The /pnt
qualifier can be used to expand a region of a plot, smaller
/pnt -> wider output.  The third plot in this case shows a
simpler way to expand (or compress) a region so that it will
just fit into one line.  Note that the program will not
under any circumstances put more than 200 bases on a single
line. 

Basically this question is just a ruse to get you to
actually try the abiprintout program. 


1C.  What two commands could be used to get the sequence
         into GCG format?
     What two commands could you use to extract a subsequence 
         (ie, trim off the cruddy sequence on the ends)?


Sequence file, trace file -> GCG sequence

  $ ABITOGCG ABI_EXAMPLE_M13F.SEQ output_in_gcg_format.seq

  $ ABI2FTR/infile=ABI_EXAMPLE_M13F.ABI -
          /out=output_in_gcg_format.seq/noftr/nolnk -
          /begin=1/end=300

Trim off bad sequence on the ends (here, keep 20 <-> 250) :

  $ reformat/infile=output_in_gcg_format.seq/begin=20/end=250

  $ ABI2FTR/infile=ABI_EXAMPLE_M13F.ABI -
          /out=output_in_gcg_format.seq/noftr/nolnk -
          /begin=20/end=250


Problem group 2.  Sequence assembly

(When you are done remember to delete the files created
during this exercise!) 

Create a sequencing project (assuming that bluescript was
the only vector used) and use gelenter to put into it these
files: 

   class:test*.seq
   class:bad*.seq
   class:rest*.seq

Assemble it.

2A.  What was wrong with bad0002.seq?
     What was wrong with bad0001.seq?


Bad0002 had some vector contamination, which you *should*
have removed with 

 $ gelmerge/excise/nomerge

That is, assuming you told GELSTART to watch out for
GB_SY:CVKSLIC (= bluescript). 

Bad0001 is a hybrid insert, but you cannot tell that from
the fragment assembly programs.  All you know is that
gelmerge put it in its own contig. 


2B.  Is there a difference in the quality of the TEST* and
     REST* sequences? 

Absolutely.  There are roughly 3 times as many mismatches in
the REST* sequences.  You can see this in the output from
GelStatus. 


2C.  Ignoring the BAD0001 contig, there are two contigs
     covering 1164 and 1569 bases (the pieces were
     shotgunned from a fragment of size 3000). Should you
     continue with shotgun sequencing for this insert? 

No.  At this point you have sequenced the gene to 4X length.
There are only 267 unsequenced bases left in the insert,
spread over 1, 2, or 3 gaps. The output of GelAnalyze tells
you that the probability that the gap(s) you observe are
real is 0.77.  Even if you did another 20 random sequences
(6X length) this probability would only drop to 0.67.  That
is, there would still be a 2/3 probability that you would
not close the last gap.  Since the remaining regions are so
small, and there is so little to gain by adding more random
sequences, you should start directed sequencing out from the
ends of the contigs. 

In this instance, you would be immediately rewarded, since
the true positions of the contigs (which you wouldn't
normally know) are: 

  2000 - 2022        gap  (22)
  2023 - 3565      contig
  3566 - 3631        gap  (66)
  3632 - 4773      contig
  4774 - 5000        gap  (227)

Note 1:  These numbers don't add up to 267 because the input
         fragments were corrupted with indels.

Note 2:  Here you see why you *always* want to do directed
         sequencing on the ends of a fragment.  Had you done
         that you would have known exactly how many gaps
         were left in this project (assuming that all gaps
         were real). 



2D.  Anything else you might want to do?

Besides using GelAssemble to fix up the sequences, you would
probably want to consider directed sequencing over the
regions which are represented by just a single sequence. 
For most purposes the minimum accuracy which you should
accept is achieved by ensuring that every region is
sequenced at least once in each direction.  The genome
sequencing sites typically sequence each site a total of
8-10 times.