Fundamentals of Sequence Analysis, 1998-1999 Problem set 5: Assembling sequences. If you get stuck, refer to the OpenVMS and GCG resources in the class home page. References: See the GCG and EGCG manuals. Problem group 1. Dealing with ABI sequences Create a subdirectory and set your default directory to it. Issue the command: $ copy class:*example*.*  You will now see two files with ugly names. 1A. How do you fix these names? 1B. What might have caused these awful names? Issue the command: $ fix_abi_here You should see this happen: was: $ABI_EXAMPLE.M13F$5N$BIN is: ABI_EXAMPLE_M13F.ABI was: $ABI_EXAMPLE.M13F$5NS$EQ$5NTXT is: ABI_EXAMPLE_M13F.SEQ The "$" and "$5N" result from the encoding of case shifts in the original Macintosh file name using the allowed characters on OpenVMS. You can avoid this problem by using only single case names on the Macintosh, and avoiding spaces and other characters illegal on OpenVMS. Configure your terminal for GCG graphics. Issue these commands: $ abiprintout/infile=ABI_EXAMPLE_M13F.ABI;1/begin=180/end=200 $ abiprintout/infile=ABI_EXAMPLE_M13F.ABI;1/begin=180/end=200/pnt=500 $ abiprintout/infile=ABI_EXAMPLE_M13F.ABI;1/begin=180/end=200/pnt=FIT 1B. How do the two plots differ? The first one is narrower then the second. The /pnt qualifier can be used to expand a region of a plot, smaller /pnt -> wider output. The third plot in this case shows a simpler way to expand (or compress) a region so that it will just fit into one line. Note that the program will not under any circumstances put more than 200 bases on a single line. Basically this question is just a ruse to get you to actually try the abiprintout program. 1C. What two commands could be used to get the sequence into GCG format? What two commands could you use to extract a subsequence (ie, trim off the cruddy sequence on the ends)? Sequence file, trace file -> GCG sequence $ ABITOGCG ABI_EXAMPLE_M13F.SEQ output_in_gcg_format.seq $ ABI2FTR/infile=ABI_EXAMPLE_M13F.ABI - /out=output_in_gcg_format.seq/noftr/nolnk - /begin=1/end=300 Trim off bad sequence on the ends (here, keep 20 <-> 250) : $ reformat/infile=output_in_gcg_format.seq/begin=20/end=250 $ ABI2FTR/infile=ABI_EXAMPLE_M13F.ABI - /out=output_in_gcg_format.seq/noftr/nolnk - /begin=20/end=250 Problem group 2. Sequence assembly (When you are done remember to delete the files created during this exercise!) Create a sequencing project (assuming that bluescript was the only vector used) and use gelenter to put into it these files: class:test*.seq class:bad*.seq class:rest*.seq Assemble it. 2A. What was wrong with bad0002.seq? What was wrong with bad0001.seq? Bad0002 had some vector contamination, which you *should* have removed with $ gelmerge/excise/nomerge That is, assuming you told GELSTART to watch out for GB_SY:CVKSLIC (= bluescript). Bad0001 is a hybrid insert, but you cannot tell that from the fragment assembly programs. All you know is that gelmerge put it in its own contig. 2B. Is there a difference in the quality of the TEST* and REST* sequences? Absolutely. There are roughly 3 times as many mismatches in the REST* sequences. You can see this in the output from GelStatus. 2C. Ignoring the BAD0001 contig, there are two contigs covering 1164 and 1569 bases (the pieces were shotgunned from a fragment of size 3000). Should you continue with shotgun sequencing for this insert? No. At this point you have sequenced the gene to 4X length. There are only 267 unsequenced bases left in the insert, spread over 1, 2, or 3 gaps. The output of GelAnalyze tells you that the probability that the gap(s) you observe are real is 0.77. Even if you did another 20 random sequences (6X length) this probability would only drop to 0.67. That is, there would still be a 2/3 probability that you would not close the last gap. Since the remaining regions are so small, and there is so little to gain by adding more random sequences, you should start directed sequencing out from the ends of the contigs. In this instance, you would be immediately rewarded, since the true positions of the contigs (which you wouldn't normally know) are: 2000 - 2022 gap (22) 2023 - 3565 contig 3566 - 3631 gap (66) 3632 - 4773 contig 4774 - 5000 gap (227) Note 1: These numbers don't add up to 267 because the input fragments were corrupted with indels. Note 2: Here you see why you *always* want to do directed sequencing on the ends of a fragment. Had you done that you would have known exactly how many gaps were left in this project (assuming that all gaps were real). 2D. Anything else you might want to do? Besides using GelAssemble to fix up the sequences, you would probably want to consider directed sequencing over the regions which are represented by just a single sequence. For most purposes the minimum accuracy which you should accept is achieved by ensuring that every region is sequenced at least once in each direction. The genome sequencing sites typically sequence each site a total of 8-10 times.