Fundamentals of Sequence Analysis, 1998-1999

Lecture 1. Computing Basics

Class organization

This class is offered by the Sequence Analysis Facility (SAF). The SAF provides computational tools for analyzing nucleic acid and protein sequences, and support for users of these tools. The computers, printers, and my office are located across the hall in 158 Braun.

The purpose of this class is to provide enough training so that those who have completed it will able to use and understand the tools the SAF has to offer.

This class is informal - it has no units and there will be no grades.

Homework will be assigned but not collected. Answers to each homework will be provided the following week.

Please interrupt me during the lecture if I say something that is confusing or unclear.

All class material will be provided through our Web server, located at http://seqaxp.bio.caltech.edu/. The FAQ and OVERVIEW sheets in the DOCUMENTATION sections will get you going faster than anything else - they contain the important points distilled from other manuals. Probably you don't want to print these documents, since they are strung together with hypertext links and would lose that as plain text.

The last two lectures are special topics - RNA folding and Web based tools. If there are any special topics you want covered instead, please let me know.

Computing basics

I'm going to start the instruction part of this class by reviewing some of the essentials that you'll need to use the SAF, and of course, to do the homework and follow this class. This may be review for some of you, but you might pick up some shortcuts that you weren't aware of.

The SAF has an assortment of computers, but the primary one for sequence analysis work is Seqaxp, a Digital 2100 4/200 server, running the OpenVMS operating system. We use this combination because it is fast, reliable and relatively easy to use.

To use SEQAXP you will need an account, which you may access via telnet by providing your Username and Password. You can submit a request for an account by filling out the form on our web site, at URL: http://seqaxp.bio.caltech.edu/www/mail_account.html

We have to charge for facility usage to recover costs. The current rate is $3.00/hour, or 5 cents/minute for connect time. There is an idle job killer that removes unused sessions - so you won't get charged a lot should you forget to logout. There is no charge for accessing our Web pages.

Getting there

You will need a terminal emulator capable of at least VT100 emulation and some form of graphics (Tektronix or Regis). (Warning - freeware like NCSA telnet and MacIP tends to be a lot less functional than commercial programs like Versaterm pro.)

Suggested freeware:

telnet, ssh, or ftp (tcp/ip transport): connect to seqaxp.bio.caltech.edu
cterm or LAT: connect to SEQAXP

Command basics

OpenVMS is a command based operating system. That means that you have to TELL it what to do by typing commands in at a terminal or terminal emulator. It is also a multitasking, multiuser operating system, and some of the commands reflect this. All of these things make using it a bit different than using a Macintosh or Windows machine. Practically speaking, it is easier to do something once using the Graphics User Interface you'd find on the Web, Mac, or Windows, but it is much easier to automate repetitious tasks using command lines. It's also much easier to write a command line program than a GUI program, and if you dig down underneath the various web sites and email servers at bottom what you'll usually find is a command line program.

Some programs make use of the keypad area on the keyboard. They have defined commands for each button press there. For instance, the editors do this, as does the MAIL utility. For these to work, your terminal emulator must be configured correctly, in particular, it should emulate a VT100 or VT200 series terminal, and should communicate this to Seqaxp when you connect. Use:

$ SHOW TERMINAL

to see what Seqaxp thinks your emulator is, and make sure it agrees with the emulator settings. The "delete previous character" symbol on OpenVMS is "del", not "backspace", be sure your terminal emulator sends the former, or everytime you try to delete a character you will instead move the cursor to the beginning of the line.

If there is one golden rule for using OpenVMS, it is "when in doubt, type HELP".

$ HELP

This will list the various help categories, and at the bottom, the available help libraries. Move through this information by entering words at the prompts, or by specifying the full path at the command line.
For instance:

$ HELP HINTS
HINTS

   Type the name of one of the categories listed below to obtain a list
   of related commands and topics.  To obtain detailed information on a
   topic, press the RETURN key until you reach the "Topic?" prompt and then
   type the name of the topic.

   Topics that appear in all uppercase are DCL commands.


  Additional information available:

  Batch_and_print_jobs  Command_procedures    Contacting_people
  Creating_processes    Developing_programs   Executing_programs
  Files_and_directories Logical_names         Operators_in_expressions
  Physical_devices      Security   System_management     Terminal_environment
  User_environment

HINTS Subtopic?
^Z
$ HELP @LYNX LYNX
LYNX

   NAME
   lynx - a general purpose distributed information browser for the World
   Wide Web



  Additional information available:

  SYNOPSIS   DESCRIPTION           OPTIONS    COMMANDS   NOTES
  ACKNOWLEDGMENTS       AUTHORS

@LYNX LYNX Subtopic?

The golden rule for using the SAF is, check the SAF "Software Documentation" web pages.

There you will find Frequently Asked Question documents for both OpenVMS and GCG usage, and many of the HELP files have been indexed, so that you can search them by keyword.

This is what an OpenVMS command looks like:

$ verb/qualifier parameter/qualifier

The VERB tends to be the English word you'd expect for a particular operation, like COPY or SEARCH. Commands are not case sensitive, that is, you can use upper or lower case and it doesn't care. However, parts of parameters and qualifiers can be case sensitive. If you see this, include the part that is case sensitive in double quotes.

$ search/exact login.com "type"

It's a good idea to remember what the parts of the command line are called because the error messages use these terms. (These are all bad commands:)

$ foobar          bad verb
$ dir/foobar      bad qualifier
$ dir ^foobar     bad parameter

Here is another command example which shows the names and dates of any files having a D in the name:

$ dir/date *d*.*
CONFIDENCE.;1       22-APR-1998 09:12:10.24
DEC.KBD;3           24-AUG-1995 16:20:33.46

"*" is a wildcard - match anything, "%" matches any one character. Commands can be recalled and edited. Use arrow keys to do that. Use Recall (only the first 4 letters are required) to recall commands by name, ie,

$ reca d
$ reca/all

To control your process or terminal use these control keys. In each case, hold down control (shown here as a caret) and the key. The most commonly used control keys are:

^Y

^C ^A ^T ^U ^H (backspace) ^E ^S ^Q ^O


Common commands

HELP LOGOUT COPY CREATE DELETE DIRECTORY EDIT PIPE PURGE PRINT RENAME SEARCH SET [DEFAULT, PROCESS/PRIO=3]
SHOW [DEFAULT, QUOTA, ENTRY]
SUBMIT TYPE


File information

Files have the general form:

node::disk:[directory.subdir]name.extension;version_number

Everything defaults if not specified. We only have one node, and usually your files will all be on your login disk, so you can usually get by with one of these forms:

[.subdirectory]name.extension
name.extension

It is best to organize file your files in directories. On OpenVMS you have a "default directory", which is "where you are":

$ show default
$ set default [.subdir]  move into the subdirectory
$ set default [-]        move up from a subdirectory
$ set default SYS$LOGIN  move to the login directory

File protections

$ dir/prot/owner   
                          show file ownership and protection
$ set file/prot=(s:rwed,o:rwed,g:re,w) filename
                          set the protection on this file

Do the homework if you really want to understand these.


Configuring your process

When you login a procedure called "LOGIN.COM" is run automatically if it is present in your home directory. You can use this to customize your environment, define various commands and other information. This is described in some detail in the homework and the OpenVMS beginner's FAQ.


Data transfer

Use ASCII mode FTP for sequence files. If they originate on other systems make sure that they have been formatted as a series of short lines, less than 132 characters each. If the file contains only sequence (specifically, no comments), you can use the CHOPUP command, then the REFORMAT command to convert the result of just about any transfer mode into a valid GCG format file.

Use BINARY mode FTP transfer for a few things like ABI sequencer traces or CGM graphics files.

Make sure names are consistent with OpenVMS usage (not more than one period and one semicolon, best if one case). The PC/Mac FTP program, such as FETCH on the Mac, may let you enter any name, but the resulting name on the OpenVMS side will be horrific, full of dollar signs and 5n's and so forth.


GCG basics

GCG stands for Genetics Computer Group, which is a small company that branched off from the University of Wisconsin, and was recently purchased by Oxford Molecular. http://www.gcg.com/. The GCG package is arguably the best of the commercial Molecular biology software packages in terms of completeness, cost and support. Up through version 8.1 they also provided full source code so that local debugging and modification of their programs was possible, but at 9.0 they changed the terms, so we elected to stay at version 8.1.

EGCG is a set of programs that were written by an assortment of people, primarily in Europe. These programs are built on top of the GCG code. When GCG changed the license terms at 9.0 it made it impossible to upgrade the EGCG set, so that too is stuck at 8.1.

$ GENHELP      help on GCG programs by program name
$ GENMANUAL    help on GCG programs by topic
$ EGENHELP     help on EGCG programs by program name
$ EGENMANUAL   help on EGCG programs by topic

These are best accessed from the SAF Webserver Software Documentation page, which has links to these, as well as indices for each.


GCG graphics

Use SETPLOT or the specific graphic command, usually one of TEKTRONIX/REGIS/POSTSCRIPT/XWINDOWS/CGM to configure graphics BEFORE you use them. Confirm the setting with SHOWPLOT, test them with PLOTTEST.

$ setplot
+--------------------->  displaying all of 12 option(s)  <---------------------+
|ColorX     Color X Windows Graphics Window                                    |
|Versaterm  Tektronix 4105 mode on Versaterm                                   |
|Tek4014    Tektronix 4014, for NCSA Telnet                                    |
|PCSmartTermRegis VT340 mode for PC SmartTerm 340                              |
|DECTermRegisRegis VT340 mode for DECTerm                                      |
|PStoFILE   Print postscript -> file                                           |
|PStoLaser  Print postscript -> local laserwriter (no flag page)               |
|PStoMAIN   Print postscript - > Braun 158 printer (flag page)                 |
|PS2toLaser Print postscript at 2/page -> local laserwriter(no flag page)      |
|PS2toMain  Print postscript at 2/page -> Braun 158 printer (flag page)        |
|COLORPS    Print color postscript -> Braun 158 (noflag page, _NOT_FREE_)      |
|CGM        Print through the CGM driver to a file CGM.OUT                     |
+------------------------------------------------------------------------------+
Enter a command. Choices are:
         
<up-arrow> and <down-arrow> scroll the list
<return> makes GCG use the selected device
Q quits without doing anything
C creates and edits a new device
(you can't delete from the site file)
         
V views the selection (use C to edit a copy)

or use $ TEKTRONIX VERSATERM-TEK4105 TT

$ showplot
 Plotting Configuration set to:

       Language: TEKTRONIX
         Device: VERSATERM-TEK4105
  Port or Queue: TERM:

$ plottest
PlotTest plots a test pattern to see if your plotter is configured
properly.  The test pattern uses every GCG graphics feature.  It should
resemble the example test pattern in the PROGRAM MANUAL.

  Process set to plot with VERSATERM-TEK4105 attached to TERM:
  using the TEKTRONIX graphic interface.

 When your VERSATERM-TEK4105 attached to _Seqaxp$Nty1166: is ready, press <Return>.

GCG commands look like any other OpenVMS command. Most of them have only qualifiers - no parameters, except that if there is a single input file, it can usually be passed as either.

$ reformat filename
$ reformat/infile=filename

GCG and EGCG options are generally entered on the command line as qualifiers. Mandatory ones will be prompted for from within the program if they are not present on the command line. Optional ones will not be, that is, if you want to use an optional qualifier for a GCG program, it MUST go on the command line.

$ reformat                it will prompt for a filename

The /CHECK qualifier may be used in any GCG or EGCG program to get a quick list of command line options

$ reformat/check

Many local modifications are not documented in GENHELP etc., and are only evident if you do a /CHECK on the command!!! For instance, the /begin and /end qualifiers on REFORMAT are only present at our site.

The/DEFAULT qualifier may be used to force GCG or EGCG programs to supply default values for mandatory qualifiers. For instance, it forces /begin/end to be the start and end of the sequence.

The locations of some important files are pointed to by logical names, which are a sort of shortcut for pointing to directories. For instance, the logical names for databases are: GB,PIR,SW,NRL_3D,EPD and can be referenced like:

$ fetch gb:X02974      accession_number
$ fetch gb:dmwhite     name

Some of the matrices and other accessory data are in GENRUNDATA, GENMOREDATA, etc, all of which are subsumed under GCGDATADIRS. So to find all comparison matrices, for instance:

$ dir GCGDATADIRS:*.cmp

There are a bunch of "gotchas" when using GCG, mostly having to do with syntax. Rather than repeat them here, have a look at the GCG beginner's FAQ (http://seqaxp.bio.caltech.edu/www/GCG_BEGINNERS_FAQ.HTML), see especially the section "Confusing parts of the GCG system".


nonGCG basics

The SAF also has dozens of other programs, many from Unix, each with its own type of interface. In general you have to read their documentation to use them properly. In addition, there are DCL scripts wrapped around many programs, so that you don't actually ever see the "real" program. For instance, when you run BLAST on seqaxp, the prompts you see all come from such a script.

Next week we'll cover sequence alignment.


Pico demonstration

Pico is the text editor which comes with the Pine mail program. It is very easy to use, but not particularly powerful. Start it like this:

$ pico killme.txt
UW PICO(tm) 2.5                 File: killme.txt




                             [ New file ]
^G Get Help  ^O WriteOut  ^R Read File ^Y Prev Pg  ^K Cut Text  ^C Cur Pos
^X Exit      ^J Justify   ^W Where is  ^V Next Pg  ^U UnCut Text^T To Spell

Then follow the directions on the screen to learn how to use the commands, which are mostly just control key combinations. For instance, press ^G to read the help file.


EDT demonstration

There are many editors on OpenVMS, of these, EDT is somewhat easier to learn and use than is TPU, which is actually the default editor. Here is how to use EDT:

$ edit/edt
[EOB]




Input file does not exist

The trick is to know how to get to the keypad help, which is accomplished by pushing the second key from the left on the top row of the numeric keypad. On a PC, it is labeled "/", on a Macintosh "=", and on a Digital keyboard "PF2". Doing so brings up this screen.


+-----------------------------------+     +-----------------------------------+
|   ^    |  DOWN  |       |        |      |        |        | FNDNXT | DEL L  |
|   |    |   |   | <----  |  ----> |      | GOLD   |  HELP  |        |        |
|   |    |   |   |  LEFT  |  RIGHT |      |        |        |  FIND  | UND L  |
|   UP   |   v   |        |        |      +-----------------------------------+
+-----------------------------------+     |  PAGE  |  SECT  | APPEND | DEL W  |
DELETE      Delete character              |        |        |        |        |
LINEFEED    Delete to beginning of word   | COMMAND|  FILL  | REPLACE| UND W  |
BACKSPACE   Backup to beginning of line   +-----------------------------------+
CTRL/A      Compute tab level             | ADVANCE| BACKUP |  CUT   | DEL C  |
CTRL/D      Decrease tab level            |        |        |        |        |
CTRL/E      Increase tab level            | BOTTOM |  TOP   | PASTE  | UND C  |
CTRL/K      Define key                    +-----------------------------------+
CTRL/R      Refresh screen                |  WORD  |  EOL   |  CHAR  |        |
CTRL/T      Adjust tabs                   |        |        |        | ENTER  |
CTRL/U      Delete to beginning of line   |CHNGCASE| DEL EOL| SPECINS|        |
CTRL/W      Refresh screen                +--------------------------|        |
CTRL/Z      Exit to line mode             |      LINE       | SELECT |        |
                                          |                 |        |  SUBS  |
Press a key for help on that key.         |    OPEN LINE    | RESET  |        |
To exit, press the spacebar.              +-----------------------------------+

If at this point you touch one of the keypad keys, more help will be shown. For instance, touch the key shown as DEL L/UND L to see:


DEL L - (PF4)

Deletes text from the cursor position to the end of the current line, including
the line terminator.  If the cursor is positioned at the beginning of a line,
the entire line is deleted.  The deleted text is saved in the delete line
buffer.

UND L - (GOLD PF4)

Inserts the contents of the delete line buffer directly to the left of the
cursor.

To return to the keypad diagram, press the return key
To exit from HELP, press the spacebar
For help on any other keypad key, press the key

Most keyboards currently on the market fuse the DEL W/DEL C keys, and on these keyboards, pressing that fused key usually results in a DEL C action.

To get out of EDT, press ^Z to get to line mode. There are a variety of commands available in line mode, but most of you will not use most of them. If you are interested, enter HELP to find out more. In any case, you will need to know exactly two line mode commands, which are

EXIT         leave editor,  save  changes
QUIT         leave editor, do not save changes