Fundamentals of Sequence Analysis, 1995-1996
Problem set 5:  Assembling sequences.

If you get stuck, refer to the OpenVMS and GCG resources in the 
class home page.

References:
 
 See the GCG and EGCG manuals.


Problem group 1.  Dealing with ABI sequences

Create a subdirectory and set your default directory to it.  Issue the
command:

  $ copy class:*example*.* []

You will now see two files with ugly names.

1A.  How do you fix these names?


Issue the command:   $ fix_abi_here

You should see this happen:
was: $ABI_EXAMPLE.M13F$5N$BIN
is:  ABI_EXAMPLE_M13F.ABI
was: $ABI_EXAMPLE.M13F$5NS$EQ$5NTXT
is:  ABI_EXAMPLE_M13F.SEQ


Configure your terminal for GCG graphics.  Issue these commands:

  $ abiprintout/infile=ABI_EXAMPLE_M13F.ABI;1/begin=180/end=200
  $ abiprintout/infile=ABI_EXAMPLE_M13F.ABI;1/begin=180/end=200/pnt=500

1B.  How do the two plots differ?

The first one is narrower.  The /pnt qualifier can be used to expand a
region of a plot, smaller /pnt -> wider output.

Basically this question is just a ruse to get you to actually try the
abiprintout program.


1C.  What two commands could be used to get the sequence into GCG format?
     What two commands could you use to extract a subsequence (ie, trim off 
     the cruddy sequence on the ends)?


Sequence file, trace file -> GCG sequence

  $ ABITOGCG ABI_EXAMPLE_M13F.SEQ output_in_gcg_format.seq

  $ ABI2FTR/infile=ABI_EXAMPLE_M13F.ABI/out=output_in_gcg_format.seq/noftr/nolnk -
          /begin=1/end=300

Trim off bad sequence on the ends (here, keep 20 <-> 250) :

  $ reformat/infile=output_in_gcg_format.seq/begin=20/end=250

  $ ABI2FTR/infile=ABI_EXAMPLE_M13F.ABI/out=output_in_gcg_format.seq/noftr/nolnk -
          /begin=20/end=250


Problem group 2.  Sequence assembly

(When you are done remember to delete the files created during this exercise!)

Create a sequencing project (assuming that bluescript was the only vector 
used) and use gelenter to put into it these files:

   class:test*.seq
   class:bad*.seq
   class:rest*.seq

Assemble it.

2A.  What was wrong with bad0002.seq?
     What was wrong with bad0001.seq?


Bad0002 had some vector contamination, which you *should* have removed with

 $ gelmerge/excise/nomerge

That is, assuming you told GELSTART to watch out for GB_SY:CVKSLIC  
(= bluescript).

Bad0001 is a hybrid insert, but you cannot tell that from the fragment
assembly programs.  All you know is that gelmerge put it in its own contig.


2B.  Is there a difference in the quality of the TEST* and REST* sequences?

Absolutely.  There are roughly 3 times as many mismatches in the REST* 
sequences.  You can see this in the output from GelStatus.


2C.  Ignoring the BAD0001 contig, there are two contigs covering 1164 and 
     1569 bases (the pieces were shotgunned from a fragment of size 3000).
     Should you continue with shotgun sequencing for this insert?

No.  At this point you have sequenced the gene to 4X length.  There are
only 267 unsequenced bases left in the insert, spread over 1, 2, or 3 gaps.
The output of GelAnalyze tells you that the probability that the gap(s)
you observe are real is 0.77.  Even if you did another 20 random sequences
(6X length) this probability would only drop to 0.67.  That is, there would
still be a 2/3 probability that you would not close the last gap.  Since
the remaining regions are so small, and there is so little to gain by
adding more random sequences, you should start directed sequencing out from
the ends of the contigs. 

In this instance, you would be immediately rewarded, since the true 
positions of the contigs (which you wouldn't normally know) are:

  2000 - 2022        gap  (22)
  2023 - 3565      contig
  3566 - 3631        gap  (66)
  3632 - 4773      contig
  4774 - 5000        gap  (227)

Note 1:  These numbers don't add up to 267 because the input fragments
         were corrupted with indels.

Note 2:  Here you see why you *always* want to do directed sequencing on
         the ends of a fragment.  Had you done that you would have known
         exactly how many gaps were left in this project (assuming that all
         gaps were real). 



2D.  Anything else you might want to do?

Besides using GelAssemble to fix up the sequences, you would probably want
to consider directed sequencing over the regions which are represented by 
just a single sequence.