Contents

1 Preface 1

2 Introduction 4

3 Sequence input, editing and sequence library use 20

4 Managing sequencing projects 30

5 Analysing sequences to find genes 77

6 Searching for motifs in nucleic acid sequences 86

7 Using patterns to analyse nucleic acid sequences 95

8 Searching for restriction sites..................................................................103

9 Statistical and structural analysis of nucleotide sequences..................................109

10 Translating and listing nucleic acid sequences...............................................119

11 Statistical and structural analysis of protein sequences.....................................125

12 Searching for motifs in protein sequences 130

13 Using patterns to analyse protein sequences 138

14 Comparing sequences 153

1. Preface

This latest update contains changes to the way we access the sequence libraries and describes how we now provide facilities to browse through and search the PROSITE library. I believe that the way we handle the sequence libraries is one of the strengths of the package and yet I don't believe many groups take advantage of it. Most other packages provide "keyword" type searches by actually scanning through the contents of the sequence libraries which naturally takes a considerable time. By using precomputed indexes for text, authors and species we give instantaneous searches. In addition because we leave each library in the format in which it arrives on the distribution tape, cdrom, etc we avoid reformatting and save a great deal of disk space and processing time.

The new release of the manual describes improvements to our routines for using these indexes. We have added a species (taxon) search and have added OR and NOT operators.

A powerful part of the package - its ability to search for patterns in nucleic acid and protein sequences - includes the facility to use the PROSITE library. In previous editions of the manual this was poorly described. We have now improved the documentation, but more importantly have added a new way of accessing the PROSITE data. The latest EMBL CDROM includes text indexes for PROSITE and this has allowed us to provide searching and browsing routines for this valuable resource.

Rodger Staden, January 1994.

1.1 Preface to third edition (November, 1993)

As with the previous update of the manual the changes to this, the third edition, are mostly concerned with additions to the sequence assembly programs. The following new features have been added, some in response to dealing with Alu-rich human DNA.

We now include a routine to automatically find and fill single stranded regions using previously hidden poor quality data. Another routine finds single stranded regions and selects oligos and templates for custom primer experiments that will help fill the "holes" and extend contigs. Both operations were previously possible from the contig editor, but these are fully automated and hence save time. We have added a routine to find editing errors at the end of a sequencing project by checking that the original readings contain evidence for every pair of adjacent bases in the final consensus. The "find internal joins" function has been improved so that all joins, including those where small contigs are wholly contained in large ones, will be found. The consensus calculation has been changed to allow any level of majority to prevail. We have added a repeat search to the assembly program which can find and tag all direct or inverted repeats. One test for the correct assembly of repetitive DNA is to check if the full length of each reading (including the poor quality hidden data) aligns well. Two functions provide for this. One is a sub-option of the assembly routine which is used to screen readings prior to entry into the database, and the other checks assembled data and displays any suspect alignments. A further test for correct assembly, and which can also suggest primer walking experiments, is to examine forward and reverse readings from the same templates. Several ways of doing this, including a graphical display have been added. A natural counterpart to such screenings is a to be able to disassemble readings, and a comprehensive option for this purpose has been provided. One strategy for assembling Alu-rich DNA is to find the Alu containing sequences prior to assembly, and then to deal with them separately and carefully after the other data has been put together. We have devised a new program (REP) for locating and tagging the extent of ALU sequence at both ends of gel readings. Prompted by the mass of alternative alignments found when assembling human DNA we have altered the assembly routine to restrict the amount of output produced.

The tRNA search routine in NIP has been improved. LIP is a new program for searching and extracting multiple entries from all forms of sequence library. Changes to the manual include more information about postscript files, a rewrite of our methods for organising the sequence libraries, and an explanation of the fate of readings during vector screening and assembly. There are still programs in the package, including MEP and NIPF, for which chapters in the manual have not yet been written, but they will have to wait for a further edition. Please read the preface to the first edition before proceding.

1.2 Preface to second edition (November, 1992)

This second edition of the manual contains only minor revisions. The changes are mostly to do with managing sequencing projects which is the subject on which we are currently concentrating our efforts. We have replaced our previous Developing Assembly Program DAP with another developing assembly program BAP that can assemble Bigger projects. Although this new program can handle 8000 readings as opposed to the miserly 1000 of the previous version, it actualy uses its space more efficiently over the course of a project. It contains a mechanism for preventing simultaneous use (and hence corruption) of databases. In addition it is approximately four times faster during assembly and five times faster when looking for "internal joins". It now contains a routine for selecting primers and templates during the "walking" stage of a project . The "find internal joins" function now calls up the contig joining editor with the two contigs aligned in the window and the editor has also been speeded up. Numerous other changes have also been made but we still regard BAP as temporary, and are actively working on its replacement which we believe will overcome the limitations that BAPs aged structure has imposed on it. We have also included routines for converting ABI 373A and Pharmacia A.L.F. data to our new trace file format, for automatically marking poor quality regions of readings from these machines and for converting DAP databases to BAP databases.

Other changes include providing a postscript option for saving graphics output, and facilities for using the author and freetext indexes of the sequence libraries. The sequence library indexes are very useful and allow rapid searching. The freetext index is derived from ALL the text in the annotations - not just the keywords. We have also added a new repeat examining routine in NIP and a new repeat listing option in SIP.

1. 3 Preface to first edition (March 1992)

It could be said that this manual is long overdue, for, apart from the extensive online help available from within the programs, it is the first printed guide to using a package that has been around for longer than I care remember. On the other hand, to misquote a cliche much used by reviewers, it could be said that this manual fills a much needed gap, in that I believe the best way to learn about computer programs is to use them. Those who are prepared to experiment and play with programs will discover far more than any manual of reasonable size can hope to convey. However the manual serves to give users an overview of what is available and a starting point for their exploration of the programs.

One of my objectives was to be able to distribute the manual on floppy disk so that each site using the programs could print as many copies as they need. We had to balance the quality of the graphics and the sophistication of the layout, against the ease of producing updates and the availability of software, and decided to to use the WORD4 program running on the Apple Macintosh. The graphics figures reproduced in the manual are far below the quality seen on the terminal screen, and in some cases should be viewed as merely schematic.

In future editions we will add chapters on other programs in the package and expand the Notes sections to give more information about the theory and algorithms used. We welcome comments and suggestions for improvements.

I thank Brian Pashley for transforming my original documents into, what I hope will be, a useful manual.

2. Introduction

Table of contents

1. Introduction

2. Materials

2.1 Versions

2.2 Terminals

2.3 Digitizers

2.4 Sequencing machines

3. User interfaces

3.1 The xterm and VAX interface

3.2 The X interface

3.3 Use of the bell

3.4 Printing and saving results in files

3.5 Use of feature tables

3.6 Use of graphics

3.7 The active region

3.8 Files of file names

4. Character sets

4.1 Character sets for finished sequences

4.2 Symbols used in gel readings

5. Sequence formats

5.1 Personal sequence files

5.2 Sequence libraries

6. Conventions used in text

7. Notes

1. Introduction

In this chapter we give an overview of the chapters on the "Staden Package" of programs. Here we describe the equipment required and outline the scope of the package and the user interfaces. In the next chapter we cover character sets, sequence formats and sequence library access.

Most of the chapters are self-contained but users are strongly advised to read sections 3 to 7 in chapter 1, as to do so will save a lot of time.

The main programs in the package are as follows: GIP Gel input program

SAP Sequence assembly program

BAP Sequence assembly program

NIP Nucleotide interpretation program

PIP Protein interpretation program

SIP Similarity investigation program

MEP Motif exploration program

NIPL Nucleotide interpretation program (library)

PIPL Protein interpretation program (library)

SIPL Similarity investigation program (library)

XBAP Sequence assembly program

XNIP Nucleotide interpretation program

XPIP Protein interpretation program

XSIP Similarity investigation program

XMEP Motif exploration program GIP uses a digitiser for entry of DNA sequences from autoradiographs. SAP, BAP and XBAP handle everything relating to assembling and editing gel readings. NIP provides functions for analysing and interpretting individual nucleotide sequences. PIP provides functions for analysing and interpretting individual protein sequences. MEP analyses families of nucleotide sequences to help discover new motifs. NIPL performs pattern searches on nucleotide sequence libraries. PIPL performs pattern searches on protein sequence libraries. SIP provides functions for comparing and aligning pairs of protein or nucleotide sequences. SIPL searches nucleotide and protein sequence libraries for entries similar to probe sequences. The programs whose names begin with a letter X are X11 (see below) versions of the programs. For example XNIP is an X11 version of NIP.

2. Materials

2.1 Versions.

The programs run on UNIX machines and there are also old, unsupported versions for Apple Macintosh and for VAX computers using the VMS operating system.
2.1.1 UNIX version.
The UNIX version is being distributed for SPARCstations running SUNOS and Solaris, DECstation 5000/240s running ULTRIX, DEC ALPHAs running OSF, and Silicon Graphics R3000, and R4000 machines. We recommend machines with at least 32 megabytes of memory, a gigabyte of disk, and a colour monitor. We also use Exabyte tape drives for archiving, and a cdrom drive for handling the sequence libraries.
2.1.2 Other UNIX versions.
Users of UNIX machines other than those listed above will require a FORTRAN compiler and ANSI C. When operated directly on the workstation screen all UNIX versions require X11 release 4 or above.

2.2 Terminals.

The programs can also be operated via a serial port using Tektronix terminals, PC's running MS-Kermit, or Apple Macintoshs running Versaterm Pro. The UNIX versions can also be run from X teminals or microcomputers running X emulators.

2.3 Digitizers.

The gel reading input program uses a sonic digitizer called a GRAPHBAR GP7 made by Science Accessories Corp., 200 Watson Blvd., Stratford, CT 06497, USA. When ordering specify that the device should be set to use metric units.

2.4 Sequencing machines and film readers.

The programs can handle data produced by the Applied Biosystems Inc. 373A and Pharmacia A.L.F fluorescent sequencing machines. They can also use data from the Amersham film reader. Traces can be displayed from within the assembly program for all three machines. For the Amersham film reader the original digitised image can also be viewed from within the assembly program.

3. User Interfaces

The programs have two user interfaces. The first runs under the terminal emulator xterm and the second runs directly under X. On the VAX, at present only the xterm interface is available, but on UNIX systems either interface can be used. The xterm version of the package will operate on the workstation screen, X terminals, Tektronix terminals, PC's or Macintoshes (see above). When run on the workstation screen the programs have separate text and graphics windows, each of which can be moved, resized and iconized, and the text windiow can be scrolled in both directions. The versions that run directly under X can only be used on the workstation screen, X terminals or using an X emulator. They produce separate text and graphics windows, an independent, constantly available help window and a separate dialogue window. All input is controlled by mouse selection and dialogue boxes.

3.1 The xterm interface

The user interface is common to all programs. It consists of a set of menus and a uniform way of presenting choices and obtaining input from the user. This section describes: the menu system; how options are selected and other choices made; how values are supplied to the program; how help is obtained, and how to escape from any part of a program. In addition it gives information about saving results in files and the use of graphics for presenting results.
3.1.1. Menus and option selection
Each program has several menus and numerous options. Each menu or option has a unique number that is used to identify it. Menu numbers are distinguished from option numbers by being preceded by the letter m (or M, all programs make no distinction between upper and lower case letters). With the exception of some parts of program SAP, the menus are not hierachical, rather the options they each contain are simply lists of related functions and their identifying numbers. Therefore options can be selected independently of the menu that is currently being shown on the screen, and the menus are simply memory aides. All options and menus are selected by typing their option number when the programs present the prompt

"? Menu or option number ="

To select a menu type its number preceded by the letter M. To select an option type its number. If users type only "return" they will get menu m0 which is simply a list of menus. If users select an option they will return to the current menu after the function is completed. Where possible, equivalent or identical options have been given the same numbers in all programs, and so users quickly learn the numbers for the functions they employ most often.

3.1.2 Execution and dialogue
All inputs requested by the program (apart from file names) have default values. In addition most of the analytical functions have a default path through which they will pass, so when users select an option, in many cases the program will immediately perform the operation selected without further dialogue. However if users precede an option number by the letter d (e.g. D17), they will force the program to offer dialogue about the selected option before the function operates, hence allowing them to change the value of any of its parameters. In addition, alternative suboptions will be made available.
3.1.3 Help
Help about each option can be obtained by preceding the option number by the symbol ? when users are presented with the prompt "? Menu or option number", (e.g. ?17 gives help on the option 17), but there are two further ways of obtaining help. Whenever the program asks a question users can respond by typing the symbol ? and they will receive information about the current option. In addition, option number 1 in all the programs will give help on all of a programs functions.
3.1.4. Quitting
To exit from any point in a program users type ! for quit. If a menu is on the screen this will stop the program, otherwise they will be returned to the last menu.
3.1.5. Making selections
Questions and choices are dealt with in three ways. Where there are choices that are not obvious opposites, or there are more than two choices, "radio buttons" and "check boxes" are used.
3.1.5.1. Choosing between opposites.
Obvious opposites such as "clear screen" and "keep picture" are presented with only the default shown. For example in this case the default is generally "keep picture" so the program will display: "Keep picture (y/n) (y) =" and the picture will be retained if the user types Y or y or only return. If the user types N or n the picture will be cleared. Anything other than these or ? or ! will cause the question to be asked again.
3.1.5.2. Choosing one from many.
Radio buttons are used when only one of a number of choices can be made at any one time. The choices are presented arranged one above the other, each choice with a number for its selection, and the default choice marked with an X. For example when the user is reading a new sequence file the following choices of format are offered. Select sequence file format

1 Staden

2 EMBL

X 3 GenBank

4 PIR

5 GCG

6 FASTA

? Selection (1-5) (3) = Any single option can be selected by typing the option number, and the default option, (here shown as 3), is also obtained by typing only "return". Again help can be obtained by typing ? and quit by typing !.

3.1.5.3. Choosing at least one from many.
Check boxes are used when any number of a set of choices can be made (i.e. the choices are not exclusive). Choices are made by typing choice numbers. Each choice can be considered as a switch whose setting is reversed when it is selected. Choices that are currently switched on are marked with an X. The user quits from making selections by typing only "return". For example in the routine that plots base composition users can elect to plot the frequencies of any combination of bases, e.g. only A, or A+T, or A+T+G etc. The following check box is offered to the user: X 1 T

2 C

X 3 A

4 G

? Selection (1-4) ( ) = As shown this will plot the A+T composition. To switch off T select 1, to switch on C select 2, etc, to quit, having set the bases required type only "return".

3.1.6. Input of numerical values
All input of integer or decimal numbers is presented in a standard way with the allowed range shown in brackets and the default value also in brackets. For example: ? Window (5-31) (11) = In this example users could type any number between 5 and 31, or "return" only, or ! or ? (see above). Any other input will cause the program to ask the question again. Typing only "return" gives the default value (here 11).
3.1.7. Input of character strings
Character strings are requested using informative prompts of the form: ? Search string = Or where possible the prompt will be preceded by a default value as in: Default search string = atatatata

? Search string = Question mark (?) or ! will get help or quit. Where appropriate, for example when a whole list of strings have been defined one after the other, typing return only will be a signal to the program that input is complete.

3.2. The X interface

This interface deals with all the types of interactions described above but options are selected using pulldown menus and all inputs are via appropriately styled dialogue boxes and buttons. Default values are accepted by clicking on an "OK" button, or typing return on the keyboard. Values are changed by overtyping the defaults. Quit is available from each dialogue via a "CANCEL" button. Help is constantly available via a "HELP" button in the main dialogue window. Details such as requesting dialogue when an option is selected are dealt with using a button labelled "execute with dialogue" which toggles to "execute".

3.3. Use of the bell

The programs use the bell to indicate that a task is completed. When the bell sounds, the programs will wait until return is typed. Users can quit from these points by typing ! but no help is available.

3.4. Printing and saving results in files

A few of the functions in the programs automatically write their textual results to disk files, but for most functions users can choose whether results appear on the terminal screen or go to a file. For these functions the normal, or default, place for results to appear is on the screen, and users need to decide before the function is selected if they want to redirect the results to a file. In all programs the option "Redirect output" gives control over whether results appear on the screen or go to a file. When a program is started results will be sent to the screen. If the option "Redirect output" is selected users will be given the choice of redirecting either text or graphics to a file or of creating a postscript file for the graphics. The program will then ask users to supply a file name. If users elect to redirect output, from that point on ,all results will be sent to the file until the option is selected again, in which case the "redirection file" will be closed, and results will again appear on the screen. If these files contain textual results they can be looked at from within the programs by using option "List a text file". Once the program is left users can employ an appropriate system command to print the files. There is no function within the programs to direct files to a printer. If users elect to create a postscript file for the graphics the graphics will also appear on the screen. If they redirect graphics the graphics commands (in Tektronix codes) will only go to the file and will not appear on the screen

3.5. Use of feature tables

One particular use of redirection should be noted. The programs can use EMBL/GenBank feature tables as input for directing translation of DNA to protein, etc, but the tables must be stored in separate text files, and cannot be read directly from the sequence libraries. The only routines that can read the sequence libraries are those available under "Read a sequence". So to create a text file containing the feature table for a particular library entry users must redirect text output to disk, and then use the "Read a sequence" to display the appropriate feature table. The feature table will be written to the file, and then the file can be used for controlling translation etc. Note however that the redirection mechanism is a general function and it therefore does not add the required header and tail to saved files. To make the files useable as feature tables they need, as a minimum, a line at the top with the word FEATURES starting in column 1, and two empty lines at the end of the file!

3.6. Use of graphics

The analytical programs including NIP, PIP and SIP present the results of many of their analyses graphically.
3.6.1. The drawing board and plot positions
The position at which the results for any function appear on the screen is defined relative to a notional users "drawing board" of dimension 10,000 by 10,000. This drawing board fills the screen and results are drawn in windows defined using symbols x0,y0 and xlength,ylength, where x0,y0 is the position of the bottom left hand corner of the window, and xlength is the width of the window and ylength the height of the window. The window positions for each option are read from a file when a program is started. If required, individual users can have their own set of plot positions, and also the positions can be redefined from within the programs using the option "Reposition plots".
3.6.2. The plot interval
For those analyses that draw continuous lines to represent results (for example a plot of base composition) the user is asked to supply the "Plot interval". All the analyses produce a value for every point along the sequence but often it is unnecessary to actually plot the values for all the points. The plot interval is simply the distance between the points shown on the screen. If the user selects a plot interval of 1, every point will be plotted; a plot interval of 3 will show every third point.
3.6.3. The window length
The word "window" is used in a further way by the programs. Most of the functions that analyse the content of a sequence (the simplest such routine plots the base composition) perform their calculations over a segment of the sequence of a certain length, display the result, then move on by 1 position, and recalculate. The fixed size of segment over which a calculation is performed is called a "window" and the segment size is the "window length". Many analytical functions request "? Window length =", or more frequently "? Odd window length =". An odd number is used so that when a result is displayed for a particular window position it is derived from an equal number of points either side of the windows' midpoint.
3.6.4. Use of the cross hair
All programs that produce graphical output provide a function for using a cross hair to examine the plots. After the cross hair function is selected the cross will appear in the graphics window and can be steered around using the mouse or directional keys. Special keyboard characters hit while the function is in operation produce the following results. For all programs the letter s (for sequence) will show the local sequence around the cross hair position. For the sequence comparison programs that show a dot matrix the two sequences will be displayed above one another. For the sequencing project management programs all the aligned sequences in the contig will be displayed. For the sequence comparison programs the letter m (for matrix) will show a matrix in which all identical characters for a window around the cross hair are marked. The punctuation symbol , will show the local position in sequence units, but leave the cross hair on the screen, whereas the space bar and any other non-special character will show the local position and exit the cross hair function. Further special characters are defined in the chapter on managing sequencing projects.
3.6.5 Drawing scales on plots
All the programs have a function "Draw a ruler" which will allow users to add scales to the axes of graphical plots. The scale can be positioned anywhere on the plot.
3.6.6 Saving graphics
There several alternative methods of obtaining hard copy of the graphics output. The best way of saving the graphics is to use the "Redirect output" function to open a postscript file which will then contain a copy of all plots that appear on the screen. This of course requires the file to be opened before the plots are drawn. Most machines have screen dump utilities. Another way of obtaining hard copy of graphical results is to use a micro computer as a terminal. On the Macintosh we use the terminal emulator versa termPro. This allows graphics to be saved as Macintosh files that can be annotated and printed using Mac drawing and painting programs. Alternatively graphics can be redirected to a file and printed using a laser printer with tektronix capability (see "Printing and saving results in files").

3.6.7 Postscript files

As is stated above the best way to save graphical results (not the traces from sequencing machines, but all graphical results from analytical programs) is to redirect output to a postscript file. There are two ways of controlling the appearance of the plots from the postscript files. The features under the users control include: landscape or portrait mode, which parts of the plot to save, and the line width. These can be specified when the "redirect output" function is selected or later by editing the postscript file. The section of the plot to be saved is specified using the coordinate system defined in section 3.6.1. Line widths are defined in units of seventy seconds of an inch (eg a line width of 5 is 5/72 of an inch). Editing the postscript file is not as daunting as it might sound as those things that can be changed are defined and annotated at the top of the file.

3.7. The active region

All the analytical programs use an "active region" for most of their functions. This is simply the current section of the sequence over which the analysis will be applied. When a sequence is first read in the active region will be set to its whole length, but the user can restrict the scope of analytical functions by use of an option called "Define active region". However some functions such as "List the sequence" are always given access to the whole sequence and will allow the user to define a limited range after they have been selected.

3.8. Files of file names

A useful device that is employed by many of the programs is that of "files of file names". If a program needs to perform the same operation in turn on each of 20 files, the user should not have to type in 20 file names. Instead the user types in the name of a single file which contains the names of the other 20 files. This single file is a file of file names. They are used, for example, to process batches of gel readings, or to compare a sequence against a library of motifs.

4. Character Sets

There are two types of character sets employed by the programs: those for finished sequences and those used during sequencing projects.

4.1 Character sets for finished sequences

The analytical programs will operate with uppercase or lowercase sequence characters. For nucleic acids T and U are equivalent. For proteins the standard 1 letter codes are used. The analytical programs also use IUB symbols for redundancy in back translations and for sequence searches. The symbols are shown in table 2.1

A,C,G,T

R (A,G) 'puRine'

Y (T,C) 'pYrimidine'

W (A,T) 'Weak'

S (C,G) 'Strong'

M (A,C) 'aMino'

K (G,T) 'Keto'

H (A,T,C) 'not G'

B (G,C,T) 'not A'

V (G,A,C) 'not T'

D (G,A,T) 'not C'

N (G,A,C,T) 'aNy' Table 1.1 The NC-IUB characters used by the analytical programs

4.2 Symbols used in gel readings

The information stored about a sequence reading has to show the original sequence, recording any doubts about its interpretation, and also, where possible, allow the changes made during editing to be indicated. Lowercase characters are used by the sequence project management programs for recording readings, and uppercase symbols are used when changes are made during editing. Alternatively the reverse convention can be used. Any other characters in a sequence are treated as dash (-) characters. The symbols are shown in table 2.2.

5. Sequence Formats

The data formats for the programs that deal with sequencing projects are described in the chapter on managing sequencing projects. All analytical programs can read sequences stored in several formats. We distinguish between two sources of input namely: "sequence libraries" and "personal files".

Symbol Meaning

c Definitely c

t " t

a " a

g " g

1 Probably c

2 " t

3 " a

4 " g

d " c Possibly cc

v " t " tt

b " a " aa

h " g " gg

k " c " c-

l " t " t-

m " a " a-

n " g " g-

r a or g

y c or t

5 a or c

6 g or t

7 a or t

8 g or c

- a or g or c or t

A a set by auto edit or corrected by user

C c set by auto edit or corrected by user

G g set by auto edit or corrected by user

T t set by auto edit or corrected by user

* padding character placed by auto assembler

else = - Table 2.2 The symbols used to record gel readings

5.1 Personal sequence files

The programs can read sequences from files in PIR, EMBL, GenBank, GCG, FASTA and Staden formats. Staden format means text files with records of up to 80 characters; all spaces are removed; lines with ";" in the first position are treated as comments and will be displayed when the file is read but not included in the sequence; if the first line of data contains a 20 character header of the form <---abcdefghij-----> it too will not be included in the processed sequence. This last facility allows the programs to read consensus sequences created by the sequence project management programs. Files in FASTA and PIR format can contain any number of entries (which the user selects by entry name), but all other formats are expected to contain only one sequence. If they contain more only the first will be read.

5.2 Sequence libraries

Users may not appreciate the fact that because the sequence libraries are so large, programs need to use indexes to provide rapid retrieval of individual entries. An index is a list of entry names and pairs of offsets. For each entry name the offsets define the position at which its sequence and annotation start in the large file. The index, which is in any case relatively small, is arranged so that it can be searched quickly - for example the EMBL CDROM index is sorted alphabetically. When the user supplies an entry name the program rapidly finds it in the index file and then uses the associated offsets to locate the entry in the larger sequence files.

The EMBL, SWISSPROT, GenBank, PIR and NRL3D sequence libraries all use different formats. The EMBL CDROM has useful indexes for entry name and accession number, brief descriptions, authors, species (taxon) and freetext. These indexes point to the data for each entry, and can be used to extract the data for any entry quickly and to do rapid text searches. We use the EMBL CDROM as our main source for sequence libraries.

The VAX version of our package used PIR format which meant we had to convert all libraries other than PIR into that format. This required, at least temporarily, having space for two copies of the libraries, and a lot of cpu time. The UNIX version of our package avoids this by leaving all libraries in their distributed format. However, to provide rapid access and searching, we create EMBL CDROM style indexes for each of them. These indexes are relatively small and the package contains programs and scripts to make them for all the libraries listed above. The analysis programs in the package can read and search all the libraries once the indexes are created. The indexing programs index the data in its distributed form: WE DO NOT REFORMAT OR COPY THE LIBRARIES but simply create indexes to the original files. Obviously this saves a lot of disk space, and for those content to use only EMBL and SWISSPROT from the CDROM, almost no disk space is required and the indexes are on the disk ready for use (see below about the division lookup file).

Below we describe how the files and programs are organised to allow all the different library formats to be used, and also how the libraries are installed and, if necessary, indexes created. This information is only of interest to those involved in installing the sequence libraries, and to be honest we have not yet found a simple way of describing the file organisation!

5.2.1 Introduction to the EMBL CDROM indexing method

The EMBL CDROM indexing system includes the following files.

1. Division files. These are the files that contain the sequence entries. An entry includes annotation followed by sequence. For EMBL there are currently 15 division files divided, roughly speaking, by taxonomy.

2. Division lookup files. These files associate division files stored on disk with numbers stored inside the index files. For example the division file for mammalian data is coded as 4 in the indexes. The division lookup file allows us to store the data files anywhere we wish. At present it has 15 records and the divisions are numbered 1 to 15.

3. The indexes. Actually there are several indexes, but for our purposes, the main index is the entryname index that associates entrynames with byte offsets into division files (coded as division numbers). The accession number, author and freetext indexes are divided into "target" and "hit" files and provide references into the entryname index. A further file is called the "brief directory or index", and this contains one record for each entry which includes its entryname, primary accession number, sequence length and an 80 character description.

5.2.2 Organisation of the sequence library files.

The programs that use the libraries need to know which libraries are available, what their format is, where their data files are, where their division lookup file is, and where their indexes are. To achieve this the libraries are defined at several levels using a treelike structure and a number of additional files. Figure 2.1 shows a schematic of the relationship of the files. Each of the three levels in the tree corresponds to a particular type of file. The first level is a list of available libraries and their format types. For each library, the second level names the indexes and the division lookup file. Finally the division lookup file associates the division numbers with real files on the disk. Notice that EMBL has 15 divisions and SWISSPROT only 1. The details for the other libraries are not shown. Below we outline the contents of these three levels of file. Environment variables are used for most of the files.

Figure 2.1 The organisation of the files used by the analytical software to locate and process the sequence libraries.

Level 1: The file listing the available sequence libraries.

This file contains a list of the available libraries. It defines the library type (EMBL/SWISSPROT, GenBank or PIR) so that the software knows how to read the data files; it names a file that contains the next level of information about each library, and it provides messages to appear on the users screen.

A EMBLFILES EMBL nucleotide library ! comment

C GENBFILES GenBank nucleotide library

A SWISSFILES SWISSPROT protein library

B PIRFILES PIR protein library

B NRL3DFILES NRL3D protein library

D PROSITEDATFILES Prosite data

E PROSITEDOCFILES Prosite documentation

An example of this type of file is shown above. The libraries have types A,B,C,D,E. The names such as EMBLFILES are actually environment variables linked to real files that contain the next level of information. The prompts are "EMBL nucleotide library" and "SWISSPROT protein library", etc. Anything to the right of a ! is a comment. (Note that we also include the PROSITE library here because it has EMBL CDROM style indexes which can be handled identically to those for the sequence libraries.) The environment variable SEQUENCELIBRARIES is associated with this file. Below we follow the branch of the tree that deals with the EMBL library.

Level 2: The list of files for a particular library (using EMBL as an example)

This type of file names the indexes and division lookup file for a particular library. For the EMBL library the file is called EMBLFILES. This file uses environment variable EMBLDIVPATH for the directory that contains the division lookup file, EMBLINDPATH for the directory containing the index files. An example is shown below. The software that searches the library uses the "types", A,B,C,..., I to know which files contain which type of information. For example if it needs the entryname index it looks for a line starting with the letter B and then reads the file EMBLINDPATH/entrynam.idx.

A EMBLDIVPATH/embl_div.lkp

B EMBLINDPATH/entrynam.idx

C EMBLINDPATH/acnum.trg

D EMBLINDPATH/acnum.hit

E EMBLINDPATH/brief.idx

F EMBLINDPATH/freetext.trg

G EMBLINDPATH/freetext.hit

H EMBLINDPATH/author.trg

I EMBLINDPATH/author.hit

J EMBLINDPATH/taxon.trg

K EMBLINDPATH/taxon.hit

Level 3: The division lookup file.

This type of file associates the division numbers in the entryname index with real file names stored on disk. For the EMBL library it is called EMBLDIVPATH/embl_div.lkp and it contains the information shown below.

1 EMBLPATH/bb.dat

2 EMBLPATH/fun.dat

3 EMBLPATH/inv.dat

4 EMBLPATH/mam.dat

5 EMBLPATH/org.dat

6 EMBLPATH/patent.dat

7 EMBLPATH/phg.dat

8 EMBLPATH/pln.dat

9 EMBLPATH/pri.dat

10 EMBLPATH/pro.dat

11 EMBLPATH/rod.dat

12 EMBLPATH/syn.dat

13 EMBLPATH/una.dat

14 EMBLPATH/vrl.dat

15 EMBLPATH/vrt.dat

Note that the EMBL CDROM contains a division lookup file for the data on the CDROM, but this is not the one used by our software. We rewrite it so the directory structure and file names can be chosen locally. Its format is I6,1x,A.

5.2.3 Installing the sequence libraries.

Installation of EMBL and SWISSPROT from CDROM is straightforward as no indexes have to be created. Installation of data from any other source requires indexes to be created.
5.2.3.1 Installing from the EMBL CDROM
The data can be left on the CDROM or copied to hard disk. The files staden.login and staden.profile source the file $STADTABL/libraries.config.csh and $STADTABL/libraries.config.sh respectively. Refer to this file to see what is required to install, add or move a sequence library that you want to be used by the programs. The environment variables EMBLFILES, EMBLPATH, EMBLINDPATH and EMBLDIVPATH may need to be redefined. If the names or number of the division files changes the division lookup file will need to be edited. Note that so far the division numbers have corresponded to an alphabetical ordering of the division file names.
5.2.3.2 Installing all libraries other than those onthe EMBL CDROM
For all libraries other than those on the EMBL CDROM, including EMBL updates, indexes must be created. The package includes programs and scripts for creating indexes for all the libraries. It is beyond the scope of this manual to detail these operations. To produce any of the indexes requires the creation of several intermediate files and the indexing programs are written so that the intermediate files are the same for all libraries. This means that only the programs that read the distributed form of each library need to be unique to that library, and all the other processing programs can be used for all libraries. In addition to our own programs the scripts that produce the indexes also use the UNIX sort program. We give no further details here but the programs are described in Staden and Dear, 1992. Refer to the file $STADTABL/libraries.config which is distributed with the package.

6. Conventions Used In The Manual

Obviously the programs can perform many more operations than there is space to describe but, in the selection of uses shown, we have tried to give some feel for the programs' scope. For this reason, and the need to conform as closely as possible to the format of the book, we have chosen specific paths through the programs, rather than attempt to describe all routes. For some sections, such as that on the facilities available for editing contigs, this has not been possible and we have instead described how the major commands are used. It should also be noted that the user interactions described in the methods sections are those that would be required if the options were selected in the "Execute with dialogue" mode. In practice many of the options would normally be used without any dialogue being required.

In the section on the user interface we outlined the different modes of obtaining input from users. Throughout the specific chapters we have adopted the following conventions to indicate which mode of input is being employed. When a program requests numerical or string input we have used the term "Define", as in Define "Minimum search score". When a program requests that a choice is made between several options, as in the case of radio buttons or check boxes, we have used the term "Select". When a program offers a choice between two options in the form of a yes or no answer, as in "Hide translation", we use the terms "Accept" or "Reject". When the digitizer program uses the stylus for input we have used the term "Hit".

Because it is difficult to produce figures including pull down menus and dialogue boxes, almost all examples containing user input are taken from the xterm interface. However the actual wording of the prompts is the same for both interfaces.

The programs contain routines for drawing scales on plots and for simple annotation, but in general such embellishment is not done automatically by the programs. This is because the programs are designed so that many plots can be superimposed, and it is better for the user to explicitly decide to add scales and annotation. More elaborate annotation can be added by saving the graphics output to files which can be handled by, say Macintosh, painting and drawing programs. None of the examples of graphical results shown in the following chapters have added scales: all are exactly as drawn by the programs.


7. NOTES

7.1

Although all the programs in the Macintosh version of the package work, the conversion to this machine was never finished. The package does not provide access to the sequence libraries, handling only simple text files containing sequences, or those generated by the assembly program SAP. The user interface, although using pull down menus and dialogue boxes for all interactions, is not as "Mac like" as many would expect. However many people find this version very useful, and for others, the digitizer program alone makes the package worth having. Data input from a digitizer is a task suited to a machine like the Macintosh, and the data files can be transferred to a larger machine for assembly and other analysis. With the exception of sequence library access, all the options available in the 1990 VAX version are contained in the package (See Staden, 1990). We give no further details specific to the Macintosh version.

8. References

1. Staden, R. 1990. An improved sequence handling package that runs on the Apple Macintosh. Comput. Applic. Biosc. 4, 387-393.

2. Staden, R. and Dear, S. 1992. Indexing the sequence libraries: Software providing a common indexing system for all the standard sequence libraries. DNA Sequence 3, 99-105.

3. Sequence Input, Editing and Sequence Library Use

Table of contents

1. Introduction

1.1 Introduction to sequence input

1.2 Introduction to keyboard input

1.3 Introduction to input from digitizer

1.4 Introduction to editing single sequences

1.5 Introduction to using the sequence libraries

2. Methods

2.1 Sequence input from keyboard

2.2 Sequence input from digitizer

2.3 Sequence input from the Pharmacia A.L.F.

2.4 Sequence input from the ABI 373A.

2.5 Editing a nucleic acid sequence using restriction sites and a translation and base numbering as landmarks.

2.6 Searching the freetext and author indexes of a sequence library

2.7 Using accession numbers to retrieve data from a sequence library

2.8 Displaying the annotations for an entry in a sequence library

2.9 Reading a sequence from sequence library

2.10 Worked example of sequence library access

3. Notes

4. References

1. Introduction

In this chapter we describe sequence input and editing and the use of sequence libraries.

1.1 Introduction to sequence input and editing

The package contains facilities for input of sequence data from the keyboard, sonic digitizers, and ABI 373A and Pharmacia A.L.F fluorescent sequencing machines. Editing of single sequences can be performed using system editors such as EDT on the VAX and EMACS on the SUN. Editing of sequence alignments is discussed in the chapter on managing sequencing projects.

1.2 Introduction to keyboard input

The program SAP contains an option to enter sequence at the keyboard. It also creates a file of file names and will list the sequences. Users may choose any 4 keys to represent the characters A, C, G and T. For example 4 adjacent keys in the same order as the lanes on a gel could be used. The program translates these symbols to A, C, G and T, and any other characters are left unchanged. No line of input should be longer than 80 characters. Terminate input with the symbol @.

1.3 Introduction to input from digitizer

Digitisers provide a convenient way of entering sequences from films into a computer. The digitiser, which is connected directly to the computer, operates on a light box, and is controlled by a program named GIP (1). The film to be read is taped firmly to the surface of the light box, and the user defines the lane order and the centres of the four lanes to be read. These positions are defined at the point where reading will commence and the program adjusts their values as the film is read. The user reads the sequence and transfers it to the computer by hitting the centres of the bands progressing up the film. Any number of sets of lanes and films can be read in a single run of the program. Each sequence is stored in a separate file and a file of file names is also written. The program also uses a menu, which is a series of reserved areas of the light box surface, for entering commands and uncertainty codes. When the pen is pressed in these areas the program responds accordingly. Each time the pen tip is depressed in the digitizing area the program sounds the bell on the terminal to indicate to the user that a point has been recorded. As the sequence is read the program displays it on the screen.

1.4 Introduction to editing single sequences

The editing method used by the programs is designed to give users access to an editor with which they are familiar - i.e. the one on their machine, say EDT on a VAX or EMACS on a UNIX system, and yet to allow them to edit a sequence which contains all the landmarks they need in order to know where they are. Users can create a file containing a simple listing of the sequence (single stranded) with numbering, using "list the sequence", and then edit it with their system editor, using the numbering to know where they are within the sequence. When the edits are complete they exit from the editor and the program "analyses" the edited file to extract only the sequence characters. Similarly a file containing a three phase tranlslation, or a file containing a sequence plus its three phase translation, plus its restriction sites marked above the sequence (see figure 3.1), can be edited. In order to be able to "analyse" such complicated listings and correctly extract the sequence the following simple rule is used: all lines in the file that contain a character that is not A,C,T,G or U are deleted. It is obviously important to be aware of this rule and its implications. For protein sequences only a simple listing i.e. the sequence plus numbering, can be used.

1.5 Introduction to using the sequence libraries

The installation of the sequence libraries is described in the introductory chapter. Direct access to the libraries is provided by all programs that need such a facility: it is not performed by separate programs. The facilities currently offered in NIP, PIP, SIP, NIPL, PIPL, and SIPL include the following: Get a sequence by knowing its entry name

Get a sequences' annotation by knowing its entry name

Get an entry name by knowing its accession number HapII

HpaII

MspI MseI

. .HincII

. .HindII

. .HpaI DsaV

. .. EcoRII

. .. TspAI

. .. . ApyI

. .. . BstNI

. .. . MvaI

. .. . ScrFI MaeIII

. .. . . . BsrI MseI

ccggttagactgttaacaacaaccaggttttctactgatataactggttacatttaacgc

10 20 30 40 50 60

P V R L L T T T R F S T D I T G Y I * R

R L D C * Q Q P G F L L I * L V T F N A

G * T V N N N Q V F Y * Y N W L H L T P

Figure 3.1 The first page width of a sequence display that can be edited by the program. Search the author index for author names

Search the freetext index for keywords

Search the taxon index for species The facilities currently offered in NIPL, PIPL and SIPL include: Search whole library

Search only a list of entry names

Search all but a list of entry names

2. Methods

2.1 Sequence input from keyboard

1. Select "Type in gel readings".

2. Accept "Use special keys for A,C,T,G".

3. Define the keys in turn.

4. Define "File file names". A file of file names so the readings can be processed as a batch.

5. Define in the sequence by typing it in using the selected keys. Finish by typing an @ symbol.

6. Define "File name for this gel reading". This is the name for the sequence just entered.

7. Accept "Type in another reading". This cycles round to step 5. If rejected the next step follows.

8. Accept "List gel readings". The batch of readings entered will each be listed, one after the other, headed by their file names, on the screen.

2.2 Sequence input from digitizer

1. Tape the autoradiograph down securely on the light box.

2. Start the program (GIP).

3. Define "File of file names".

4. Using the digitizer pen hit the digitizer menu ORIGIN, program menu ORIGIN, program menu START.

After the bell has sounded the program will give the default lane order.

5. If correct hit CONFIRM otherwise hit RESET. To reset the lane order hit the A,C,G,T boxes in the menu in left to right order.

6. Hit START, then hit in left to right order, at a height level with the first band to be read, the start positions for the next four lanes. The program will report the mean lane separations and asks for confirmation that they are correct.

7. Hit START

8. Hit the bands on the film in sequence order. If necessary use the uncertainty codes in the program menu. Continue until the sequence is finished.

9. Hit STOP.

10. Define "Name for this reading".

11. Accept "Read another sequence". Otherwise the program will stop.

2.3 Sequence input from the Pharmacia A.L.F.

After processing and base calling on the PC the data for all 10 clones is contained in a single file, and the user names each using local conventions. Then this single file is transfered to the SUN using PC-NFS. This program allows SUN directories to be mounted as if they were DOS disks and data can be transfered by use of the DOS copy command. On the SUN, to prepare for processing by program XBAP the 10 clones are split into 10 separate files each with the names given on the PC. In addition a file of file names is written Then the reads for the individual clones need to be examined to clip off the vector sequence and the poor data at the 5' end. See note 2.

2.4 Sequence input from the ABI 373A.

After processing and base calling on the Macintosh the data for each clone is contained in 2 files: one is simply the sequence but the main file contains the raw data, trace data and sequence. For our processing we do not use the sequence file as we can extract all we need from the main file. The user names each file using local conventions and then the folder is transfered to the SUN using TOPS. This program allows SUN directories to be mounted as if they were on the Macintosh and data can be transfered by simply dragging folders on the Macintosh screen. On the SUN, to prepare for processing by program XBAP, a file of file names is written and the reads for the individual clones are examined to clip off the vector sequence and the poor data at the 5' end. See note 2.

2.5 Editing a nucleic acid sequence using restriction sites and a translation and base numbering as landmarks.

1. Select NIP.

2. Read in the sequence to be edited.

3. Direct output to disk, say creating file edit.seq.

4. Use the restriction enzyme site search routine (See the relevant chapter) to create a file showing "Names above the sequence", as in figure 3.1.

5. Close the redirection file.

6. Select "Edit the sequence".

7. Define "Name of file to edit". This is the file containing the sequence listing, say edit.seq.The sytem editor will start up.

8. Edit the sequence.

9. Exit from the editor.

10. Accept "Make edited sequence active". The edited sequence will replace the original sequence.

2.6 Searching the freetext (or author, or taxon) index of a sequence library

The index searches are effectively instantaneous and produce a list of entry names. The list can be displayed. A typical list is shown in figure 3.2. Each search uses an AND, OR or NOT operator. When the routine is entered the list is empty so the NOT operator is not made available, however successful AND and OR searches will create a current list. After the list is created the operators combine each individual search with the current list of hits. AND means that hits must already be on the list. OR means that hits will be added to the list. NOT means that hits will be removed from the list. AND with an empty list will produce an empty list (The list must be deleted once it becomes empty). When several strings are searched for simultaneously the selected operator also determines how the strings should be combined before the result is, in turn, combined with the current list. i.e. AND means that all strings should match, OR that at least one of them must match, NOT that none of them must match. 1. Select "Read new sequence".

2. Select "Sequence library". The alternative is "Personal file", and if taken would be followed by questions about which of the formats "Staden, EMBL, GenBank, PIR, GCG or FASTA" it was stored in.

3. Select, say, "EMBL nucleotide library".

4. Select "Search indexes". The list of currently available indexes for this library will be displayed. At most this will include author, freetext and taxon. At this stage both AND and OR searches will be offered for each index.

5. Select AND text . Alternatives are OR text, AND author, OR author, AND taxon, OR taxon.

6. Define "Text". Type up to 5 words separated by spaces - i.e.space is the delimiting character .(see note below about author searches).

7. The search will start and for each match the program will display the contents of the matching line which includes the entry name, primary accession number, its length and a 80 character description. After every 20 matches the program will ring the bell and the user can escape by typing "!".

The commands for searching the author and taxon indexes are effectively the same. Note that for authors it is useful to be able to link words together for names such as De Gaule or von Meyenberg. The symbol underscore (_) can be used for this purpose - e.g. De_Gaule or von_meyenberg. The same facility is available for the text and taxon searches. LAMBDA V00636 48502 Genome of the bacteriophage lambda (Styloviridae)

MIBTXX V00654 16338 Complete bovine mitochondrial genome.

MIHSCG J01415 16569 Human mitochondrion, complete genome.

MIHSM1 M10546 2771 Human mitochondrial DNA, fragment M1, encoding tr

MIHSXX V00662 16569 H.sapiens mitochondrial genome

MIPX1C01 M10860 130 Bacteriophage phi-X174, nucleotides 3920-4049.

MIPX1C02 M10861 115 Bacteriophage phi-X174, nucleotides 3480-3595.

MIPX1C03 M10862 121 Bacteriophage phi-X174, nucleotides 4260-4380.

MIPX1CTI M10849 130 Bacteriophage phi-X174, nucleotides 3389-3520.

PHIX174 V01128 5386 Bacteriophage phi-X174 (cs70 mutation) complete g

R17CPRAA M24826 61 Bacteriophage R17 coat protein RNA fragment.

11 different entries found

Figure 3.2 A typical list of hits foran index search. It includes the entry name, principal accession number, sequence length and a description.

2.7 Using accession numbers to retrieve data from a sequence library

1. Select "Read new sequence".

2. Select "Sequence library".

3. Select, say, "EMBL nucleotide library".

4. Select "Get entry names from accession numbers".

5. Define "Accession number".

6. The program will display the entry names corresponding to the accession number. The last entry name found will become the default entry name.

2.8 Displaying the annotations for an entry in a sequence library

1. Select "Read new sequence".

2. Select "Sequence library".

3. select, say, "EMBL nucleotide library".

4. Select "Get annotations".

5. Define "Entry name". The program will display the annotation for the entry. After every 20 lines the program will ring the bell and the user can escape by typing "!".

2.9 Reading a sequence from a sequence library

1. Select "Read new sequence".

2. Select "Sequence library".

3. Select, say, "EMBL nucleotide library".

4. Select "Get a sequence".

5. Define "Entry name". The program will make the sequence the active sequence and display its base composition.

2.10 Worked example of sequence library access

The worked example in figure 3.3 shows: a. selection of the EMBL library; b. selection of "search indexes"; c. an AND author search for Sanger and Coulson a.r. (note the use of underscore (_) to link words together); d. an AND text search for mitochondria; e. a NOT taxon search for human; e. display of the current list; f. escape to the previous menu; g. selection of "list annotation"; h. escape from the listing; i.selection of "get a sequence"; j. selection of the default sequence. Select sequence source

X 1 Personal file

2 Sequence library

? Selection (1-2) (1) =2

Select a library

X 1 EMBL 36 nucleotide library Nov 93

2 SWISSPROT 26 protein library Nov 93

3 PIR 37 protein library June 93

4 NRL3D 59 From Brookhaven protein library March 92

5 prosite data

6 prosite documentation

? Selection (1-6) (1) =1

Library is in EMBL format with indexes

Select a task

X 1 Get a sequence

2 Get annotation

3 Get entryname from accession number

4 Search indexes

? Selection (1-4) (1) =4

Select a task

X 1 Author AND search

2 Author OR search

3 Text AND search

4 Text OR search

5 Taxon AND search

6 Taxon OR search

? Selection (1-6) (1) =1

Search for Authors

? Authors=sanger coulson_a.r

SANGER hits 26

COULSON A.R hits 30

Current number of hits on list is 12

Select a task

X 1 Author AND search

2 Author OR search

3 Author NOT search

4 Text AND search

5 Text OR search

6 Text NOT search

7 Taxon AND search

8 Taxon OR search

9 Taxon NOT search

10 Delete current list

11 Display current list

? Selection (1-11) (1) =4

Search for Text

? Text=mitochondria

MITOCHONDRIA hits 3841

Current number of hits on list is 4

Select a task

X 1 Author AND search

2 Author OR search

3 Author NOT search

4 Text AND search

5 Text OR search

6 Text NOT search

7 Taxon AND search

8 Taxon OR search

9 Taxon NOT search

10 Delete current list

11 Display current list

? Selection (1-11) (1) =9

Search for Taxon

? Taxon=human

HUMAN hits 40226

Current number of hits on list is 1

Select a task

X 1 Author AND search

2 Author OR search

3 Author NOT search

4 Text AND search

5 Text OR search

6 Text NOT search

7 Taxon AND search

8 Taxon OR search

9 Taxon NOT search

10 Delete current list

11 Display current list

? Selection (1-11) (1) =11

MIBTXX V00654 16338 Complete bovine mitochon

1 different entries found

Current number of hits on list is 1

Select a task

X 1 Author AND search

2 Author OR search

3 Author NOT search

4 Text AND search

5 Text OR search

6 Text NOT search

7 Taxon AND search

8 Taxon OR search

9 Taxon NOT search

10 Delete current list

11 Display current list

? Selection (1-11) (1) =!

Select a task

X 1 Get a sequence

2 Get annotation

3 Get entryname from accession number

4 Search indexes

? Selection (1-4) (1) =2

Default Entry name=MIBTXX

? Entry name=

ID MIBTXX standard; circular DNA; ORG; 16338 BP.

XX

AC V00654; J01394;

XX

DT 03-NOV-1982 (Rel. 02, Created)

DT 12-SEP-1993 (Rel. 36, Last updated, Version 38)

XX

DE Complete bovine mitochondrial genome.

XX

KW 12S ribosomal RNA; 16S ribosomal RNA; ATPase; cytochrome;

KW cytochrome oxidase; genome; origin of replication; ribosomal RNA;

KW transfer RNA; transfer RNA-Ala; transfer RNA-Arg; transfer RNA-Asn;

KW transfer RNA-Asp; transfer RNA-Cys; transfer RNA-Gln;

KW transfer RNA-Glu; transfer RNA-Gly; transfer RNA-His;

KW transfer RNA-Ile; transfer RNA-Leu; transfer RNA-Lys;

KW transfer RNA-Met; transfer RNA-Phe; transfer RNA-Pro;

KW transfer RNA-Ser; transfer RNA-Thr; transfer RNA-Trp;

KW transfer RNA-Tyr; transfer RNA-Val; unidentified reading frame.

XX

OS Bos taurus (cattle)

OC Eukaryota; Animalia; Metazoa; Chordata; Vertebrata; Mammalia;

OC Theria; Eutheria; Artiodactyla; Ruminantia; Pecora; Bovidae.

XX

OG Mitochondrion

XX

RN [1]

RP 1-16338

RA Anderson S., de Bruijn M.H.L., Coulson A.R., Eperon I.C.,

RA Sanger F., Young I.G.;

RT "Complete sequence of bovine mitochondrial DNA. Conserved features

RT of the mammalian mitochondrial genome";

RL J. Mol. Biol. 156:683-717(1982).

XX

DR SWISS-PROT; P00157; CYB_BOVIN.

DR SWISS-PROT; P00396; COX1_BOVIN.

DR SWISS-PROT; P00404; COX2_BOVIN.

DR SWISS-PROT; P00415; COX3_BOVIN.

DR SWISS-PROT; P00847; ATP6_BOVIN.

DR SWISS-PROT; P03887; NU1M_BOVIN.

DR SWISS-PROT; P03892; NU2M_BOVIN.

DR SWISS-PROT; P03898; NU3M_BOVIN.

DR SWISS-PROT; P03902; NULM_BOVIN.

DR SWISS-PROT; P03910; NU4M_BOVIN.

DR SWISS-PROT; P03920; NU5M_BOVIN.

DR SWISS-PROT; P03924; NU6M_BOVIN.

DR SWISS-PROT; P03929; ATP8_BOVIN.

XX

FH Key Location/Qualifiers

FH

FT source 1..16338

FT /organism="Bos taurus"

FT /mitochondrion

FT tRNA 364..430

FT /note="tRNA-Phe"

FT rRNA 431..1385

FT /note="12 S rRNA"

FT tRNA 1386..1452

FT /note="tRNA-Val"

FT rRNA 1453..3023

FT /note="16 S rRNA"

FT tRNA 3024..3098

FT /note="tRNA-Leu(UUR)"

FT CDS 3101..4057

FT /note="U.R.F. 1"

FT tRNA 4057..4125

FT /note="tRNA-Ile"

FT tRNA complement(4123..4194)

FT /note="tRNA-GLN"

FT tRNA 4197..4265

FT /note="tRNA-fMet"

FT CDS 4266..5309

FT /note="U.R.F. 2"

FT tRNA 5308..5374

FT /note="tRNA-Trp"

FT tRNA complement(5376..5444)

FT /note="tRNA-Ala"

FT tRNA complement(5446..5518)

FT /note="tRNA-AsN"

FT rep_origin 5519..5550

FT /note="Origin of L-strand replication"

FT tRNA complement(5551..5617)

FT /note="tRNA-Cys"

FT tRNA complement(5618..5685)

FT /note="tRNA-Tyr"

FT CDS 5687..7231

FT /product="cytochrome oxidase I"

FT tRNA complement(7229..7299)

Select a task

X 1 Get a sequence

2 Get annotation

3 Get entryname from accession number

4 Search indexes

? Selection (1-4) (1) =

Default Entry name=MIBTXX

? Entry name=

DE Complete bovine mitochondrial genome.

Sequence length 16338

Sequence composition

T C A G -

4443. 4237. 5460. 2198. 0.

27.2% 25.9% 33.4% 13.5% 0.0% Figure 3.3 A worked example of sequence library use.

3. NOTES

1. The program menu for GIP is simply a set of boxes drawn on the digitizing surface that each contain a command or uncertainty code. Right handed users will find it is best to position the menu to the right of the digitizing area, but in practice as long as its top edge is parallel to the digitizer box, it can be put anywhere in the active region. As well as the codes a,c,g,t,1,2,3,4,b,d,h,v,r,y,x,-,5,6,7,8 the following commands are included in the menu: DELETE removes the last character from the sequence; RESET allows the lane centres to be redefined; START means begin the next stage of the procedure; STOP means stop the current stage in the procedure; CONFIRM means confirm that the last command or set of coordinates are correct.

The digitizing device also has a menu of its own. This lies in a two inch wide strip immediately in front of the digitizing box. Pen positions within this two inch strip are interpretted as commands to the digitizer and are not sent to the GIP program. In general the only time users will need to use the device menu is when they tell GIP where the program menu lies in the digitizing area. This is done by first hitting ORIGIN in the device menu and then hitting the bottom left hand corner of the program menu. If the bell does not sound after hitting START try hitting METRIC in the device menu (the program uses metric units, and some digitizers are set to default to use inches; hitting metric switches between the two).

The user should try to hit the bands as near as possible to the centre of the lanes because the program tracks the lanes up the film using the pen positions. If the lane centres get too close the program stops responding to the pen positions of bands and hence does not ring the bell. If this occurs users must hit the reset box in the menu and the program will request them to redefine the lane centres at the current reading position. Then they can continue reading. As a further safeguard the program will only respond to pen positions either in the menu or very close to the current reading position.

2. Details about preparing the data from fluorescent sequencing machines for processing by XBAP are contained in the notes for the chapter on managing sequencing projects.

3. All of the operations described for the EMBL nucleotide library can be performed in exactly the same way for GenBank and the SWISSPROT and PIR protein libraries. For keyword searching the freetext index is most useful because it contains all words in feature tables, definition lines, title lines, keywords and comment lines. The searches are very fast. The search will find all words that start with the given keywords: e.g. keyword sugar will match with sugar, sugaractivating, sugars, etc. When several keywords are used together, only entries indexed on all the words will be reported. On the VAX, EMBL, GenBank, SWISSPROT and PIR can all be processed.

4. The package also contains a program called LIP which process the sequence libraries. It can perform all the searches on authors and freetext etc, but its unique feature is that it can process lists of entry names. For example the output from a freetext search (which will contain a list of matching entry names) can be saved in a file and then used as input to an option that will extract copies of all those entries from the library. The program can extract sequences only, annotations only, and sequences plus annotations. Output can be in the original format of whichever library is being processed, in staden format or FASTA format. All extracted entries are written to separate files with file name extensions that are a function of the selected format. If required these files can then be concatenated to produce a single sub-library. Obviously it is best to work in a new directory.

4. References

1. Staden, R. 1984. A computer program to enter DNA gel reading data into a computer. Nucl. Acids Res. 12, 499-503.

4. Managing Sequencing Projects

Table of contents

1. Introduction

2. Methods

2.1 Starting a project database

2.2 Screening against restriction enzyme recognition sequences

2.3 Screening against vector sequences and repeat families

2.4 Entering readings in to the project database (assembly)

2.5 Searching for internal joins

2.6 Editing in XBAP

2.7 Joining contigs interactively in XBAP

2.8 Selecting primers and templates

2.9 Examining the quality of a consensus

2.10 Using graphical displays to examine contigs

2.11 Checking assembly

2.12 Examining reads from the same template

2.13 Disassembling contigs

2.14 Searching for repeats

2.15 Filling single stranded regions with hidden data

2.16 Shuffling pads

2.17 Checking for editing mistakes

2.18 Displaying a contig

2.19 Highlighting differences between readings and the consensus

2.20 Screen editing contigs in SAP

2.21 Automatic editing in SAP

2.17 Using the original editor in SAP

3. Notes

4. References

1. Introduction

Data input, assembly, checking and editing are the major tasks of sequence project management. Data input is described in a previous chapter and here we cover everything else. The programs can deal with data derived from autoradiographs and from automated gel reading machines such as the Applied Biosystems 373A and the Pharmacia A.L.F. and film readers such as the Amersham scanner

We describe two alternative programs for managing sequencing projects. They contain the same assembly and vector screening routines but they differ in their editing methods. One program SAP (see references 1 and 2) can be operated from simple terminals and emulators but the other XBAP (3) requires an X terminal or emulator. XBAP contains a superior editor plus the facility to annotate sequences and display the coloured traces for data derived from fluorescent sequencing machines. Those using autoradiographs will find that SAP is adequate but XBAP is essential for users of fluorescent sequencing machines. Readers should note that several of the methods for displaying contigs described below are probably of value only to those unable to use the screen based contig editor in XBAP.

Fluorescent sequencing machines provide machine readable data. This means, given appropriate software, that while making editing decisions the user can see, displayed on the screen, the coloured traces used to derive the sequence. However data from these machines requires some extra processing. First the machines tend to produce long sequences with poor quality at their 3' ends and so we have to decide how much of the data to use. Secondly the sequencing machine does not recognise the primer region (as the user would) so we need to have some way of removing it from the data. The poor quality data from both ends of the sequence and the vector sequences are identified non-interactively by programs clip-seqs and vep. Alternatively these tasks can be performed interactively using program TED (4). We term the data from the 3' end of a reading that is not employed in the assembly process "hidden" sequence. Note that we do not lose this data but simply ignore it until such time as it can be useful for locating joins between contigs, or for double stranding regions of the sequence.

We have devised our own file format (called SCF) for storing traces, sequences and confidence values for data produced by automated sequence readers (5). For ABI data these typically reduce the storage required to 30% of the original. Data from the ABI 373A and the Pharmacia A.L.F. can be converted to this form using the program makeSCF.

We use a database to store all the data for each sequencing project. The individual sequence readings derived from autoradiographs or from sequencing machines are initially stored in separate files but the program copies them into the database during the assembly process. For normal operation the program handles batches of readings - say 36 from a film or machine run. Batch processing is achieved by use of files of file names.

Depending on the strategy employed and the stage of the project the following operations may be performed. 1) Start a project database.

2) Select primers and templates.

3) Obtain readings.

4) Put individual readings into the computer and write a file of file names. For data derived from fluorescent sequencing machines choose which data from the 3' end of the reading should not be used for the assembly process.

5) Screen the batch against any vectors that may be present, excising any vector sequence found and passing to the next step, the names of those readings that contain some non-vector sequence. Those dealing with data such as that from humans which is likely to contain multiple copies of repeat families like Alu may wish to add a further step at this point. By screening against the repeat family sequences the readings can be divided into those that contain the repeat and those that do not. Those that do not contain the repeat can then be assembled first and those that do can be assembled carefully afterwards.

6) Screen the batch against any restriction sites whose presence would indicate a problem, passing those that do not match on to the next step.

7) Compare each reading in the batch with the current contents of the project database adding them to the contigs they overlap, joining contigs or starting new contigs.

8) Check the number of contigs and the quality of the consensus sequence and plan further experiments. Try to join contigs by searching for overlaps between their ends. (This is particularly useful for those using data from fluorescent sequencing machines, where although the 3' end of the sequence is not good enough for automatic assembly, it can be valuable for finding overlaps between contigs).

9) Edit the contigs to resolve dissagreements.

10) Produce a consensus sequence.

11) Analyse the consensus sequence, possibly discovering further errors. Subsets of these operations will be cycled through repeatedly. A pure shotgun strategy would continue using steps 3-7, a pure primer walking strategy would also include step 2. A number of the steps require almost no user intervention, however checking quality and final editing decisions are still interactive procedures. The program contains several options, such as displays of the overlapping readings in a contig, to help indicate, not only the poorly determined regions, but also which clones could be resequenced to resolve ambiguities, or those which can usefully be extended or sequenced in the reverse direction, to cover difficult regions. It is best to use a command procedure or script for handling steps 5-7.

For our projects we have a script which users employ by typing "assemble filename", where filename is the file of file names for the current batch of readings. This script calls all the necessary options in SAP or BAP (see notes) in order to make a backup of the database, screen against any vectors, assemble readings and print a report. In the text below we describe how these operations are performed interactively.

2. Methods

2.1 Starting a project database

The assembled data for each project is stored in a database. At the beginning of a project it is necessary to create an empty database using program SAP or XBAP. 1. Select "Open database"

2. Select "Start new database"

3. Define the database name. Database names can have from one to 12 letters and must not include full stop (.).

4. Accept "Database is for DNA"

5. Define "Database size". This is an initial size and if necessary can be increased later using "Copy database". Roughly speaking it is the number of readings expected to be needed to complete the project. Currently BAP limits the maximum to 8000 and SAP has a limit of 1000.

6. Define "Maximum reading length". This is the length of the longest reading that will be added to the database. The minimum is 512 bases, and the maximum 4096. The program should confirm that "copy 0" of the database has been started. See Note 14 for important information.

2.2 Screening against restriction enzyme recognition sequences

For some strategies it is necessary to compare readings against any restriction enzyme recognition sequences that may have been used during cloning and which should not be present in the data. The function operates on single readings or processes batches accessed through files of file names. The algorithm looks for exact matches to recognition sequences. The recognition sequences should be stored in a simple text file with one recognition sequence per record. 1. Accept "Use file of filenames".

2. Define "File of gel reading names". The input file of file names.

3. Define "File for names of sequences that pass". A file of file names for those readings that do not contain the recognition sequences. After the run it will contain the names of all the files in the batch that do not match any of the restriction enzyme recognition sequences. Hence it can be used for further processing of the batch.

4. Define "File name of recognition sequences". The name of the file of recognition sequences.

2.3 Screening against vector sequences and repeat families

For most strategies it is necessary to compare readings against any vector sequences that may have been picked up during cloning. The package contains two routines for screening against vectors. The original function simply reports any matches between the readings and the vector sequences and only passes on those that do not match. This function should now only be used to screen for any other sequences that should be excluded from the database, because the newer one (program name VEP for vector excising program) is capable of both finding the vector sequences and editing them out automatically.

We have recently written a new program (REP) for detecting the presence of repeat families in gel readings. At this stage the program must be regarded as experimental but it appears to be able to detect Alu sequences quite reliably. The program reports the extent of each reading that contains Alu, including those that have Alu at both ends. For data from projects containing highly repetitive sequences such as Alus we collect all the shotgun data prior to assembly and screen it for Alu with program REP. This tells us which reads contain Alu and which do not. We assemble the Alu free data in the normal way. Then we sort the Alu containing data on the amount of nonAlu data it contains. These readings are then assembled carefully (i.e. at low levels of mismatch) taking those with most nonAlu sequence first. Prior to assembling them we screen them (including their hidden data) against the data all ready in the database (see below). At present we do not know the error rate for REP although we believe it to be low, and it is not fast - depending on the machine used it takes around 30 minutes to compare 1000 readings with our current Alu library.

2.3.1 Clipping off vector sequences
There are two types of vector that may need to be screened out of gel readings: the sequencing vector and, for cases where, say, whole cosmids have been shotgunned, the cloning vector. The two tasks are different. When screening out the sequencing vector we may expect to find data to exclude, both from the primer region and from the other side of the cloning site (when, for example, the insert is short). When screening out cosmid vector we may find that either the 5' end, or the 3' end, or the whole of the sequence is vector. Also for the cosmid search we need to compare both strands of the sequence. The program (VEP) works slightly differently for each of the two cases. Having read the vector sequence from a file the program asks for the "Position of the cloning site". A value of zero signifies that the search will be for the cosmid vector. A nonzero value signifies that the search is for the sequencing vector, and so in this case the program then asks for the "Relative position of the primer site". A negative relative position signifies that a reverse primer is being used, otherwise a forward primer is assumed. See Note 3. for a description of how to calculate these values.

The program screens a batch of readings using a file of file names and creates a new file of file names which contains the names of all those sequences that include some nonvector sequence. For each sequence that contains some vector it writes out a new copy of the file in which the vector portion is identified. It also copies the original file, giving it the name "<original>.<vector>" , where "original" is the original name of the file, and "vector" is the name of the vector file.

The search, which uses a hashing algorithm, is very rapid. Users specify a "Word length", the "Number of diagonals to combine" and a "Minimum score". The word length is the minimum number of consecutive bases that will count as a match. The algorithm treats the problem like a dot matrix comparison and finds the diagonal with the highest score. Then it adds the scores for the adjacent "Minimum number of diagonals to combine". If the combined score is at least "Minimim score" the sequence is marked to indicate that it contains vector. The score represents the proportion of a diagonal that contains matching words, so the maximum score for any diagonal is 1.0. 1. Define "Input file of file names". This is the file containing the names of all the readings to be screened.

2. Define "File name of vector sequence".

3. Define "Position of cloning site". This is the base number, relative to the beginning of the vector sequence, that is on the 3' side of the insert site. For example for m13mp18 the SmaI site is at 6249. A zero value signifies that the search is for cosmid vector.

4. Define "Relative position of 3' end of primer site". This is the position, relative to the cloning site, of the first base that could be included in the sequence. For m13mp18, the 17mer Sequencing Primer and the SmaI site, the position is 41.

5. Define "Word length". Only words of this length will be counted as matches.

6. Define "Number of diagonals to combine". The scores for this number of diagonals around the highest scoring diagonal will be combined to give the total score.

7. Define "Cutoff score". For a match, at least this proportion of the total length of the summed diagonals must contain identical words.

8. Define "Output file of passed file names". The name of the file to contain the names of the readings to pass on to the assembly program. Processing will commence and finishes with a summary stating the number of files processed, the number completely vector, the number partly vector and the number free of vector.

2.3.2 Screening for "vectors"
This function is contained in both SAP and XBAP and operates on single readings or processes batches accessed through files of file names. The algorithm looks for exact matches of length "minimum match length" and displays the overlapping sequences. 1. Accept "Use file of filenames".

2. Define "File of gel reading names". The input file of file names.

3. Define "File for names of sequences that pass". A file of file names for those readings that do not contain the vector sequence. After the run it will contain the names of all the files in the batch that do not match the vector sequence. Hence it can be used for further processing of the batch.

4. Define "File name of vector sequence". The name of the file containing the vector sequence.

2.3.3 Screening for repeat families

This program is new and only tested on Alu sequences. The default score (see below) is for Alu sequences. It assumes you have the Alu library and a file of file names with environment variable name ALUNAMES. The program works by comparing each gel reading in a batch with all the Alu sequences named in ALUNAMES. It takes as input two files of file names: one containing the names of all the readings to screen and the other ALUNAMES. Its output is two new files of file names and a log file. In addition it modifies all the reading files that are found to contain Alu so that when they are assembled the Alu containing segments are tagged and can be seen in the contig editor. One output file of file names contains the names of the reads that pass, the other those that fail. The log file lists the highest score for each reading in turn and is really for assesment of the program performance. For both output files of file names REP writes out the reading name, top score, the top score for any part of the reading that was not covered by the best match, the number of bases in the reading that are not found to match Alu. An example is shown in figure 4.1. The first line shows a read (h4a02.s1) with top score 0.97 and score for the other end of the reading of 0.89 and consequently 0 sequence at either end that does not contain Alu. The next line is for a reading that scores 0.89 and the match is over the whole reading. The next line is for a reading that has a score of 0.75 at one end and 0.0 at the other with 129 bases free of Alu

1. Select program rep.

2. Define "Input file of gel reading file names".

3. Define "Input file of repeat file names". At present this file is called ALUNAMES.

4. Define "Output file of passed file names".

5. Define "Output file of failed file names".

6. Define "Log file name".

7. Define "Cutoff score". The value must be between 0 and 1.0, and at present the default value of 0.6 is judged to be optimal.

Have a cup of coffee. h4a02.s1 0.97 0.89 0

h4a03.s1 0.89 0.00 0

h4a01.s1 0.75 0.00 129 Figure 4.1 showing a file of file names output by program rep.

2.4 Entering readings into the project database (Assembly)

Readings are entered into the database using the auto assemble function. This function compares each reading and its complement with a consensus of all the readings already stored in the database. If it finds any overlaps it aligns the overlapping sequences by inserting padding characters, and then adds the new reading to the database. Readings that overlap are added to existing contigs and readings that do not overlap any data in the database start new contigs. If a new reading overlaps two contigs they are joined. Any readings that appear to overlap but which cannot be aligned sufficiently well are not entered and have their names written to a file of failed gel reading names. Note that it is possible that a reading may align well with two contigs (indicating a possible join) but that after it has been added to one of the contigs, the two contigs do not align sufficiently well. In this case, although the reading has been entered into the database its name will also be added to the file of failed readings. Alignments using more than the maximum number of paddings characters, or exceeding the maximum mismatch may be displayed, but the readings will not be entered into the database. It is advisable to set the consensus cutoff to 1% before running the assembly routine as this will improve the alignments. A typical run of the assembly routine is shown in figure 4.2. 1. Accept "Permit entry". If entry is not permitted (see figure 4.3) the program will operate in a screening mode in which the readings are compared with the current consensus. In this mode the program will optionally screen the full length (i.e. including the poor quality hidden data) of each reading. It also saves the results in a file that is suitable for use as a file of file names. The file (see figure 4.4) includes the reading name and its length, and for the best match found, the percentage mismatch and length of overlap. This file can be sorted on the percentage mismatch value and hence, if it subsequently used as a file of file names, the readings that match best can be entered first.

2. Select "Show all alignments". Alternatives will hide all alignments - the normal action; show only passed alignments; show only failed alignments.

3. Accept "Use file of file names"

4. Define "File of gel reading names". The name of the input file of file names, probably passed on from "Screen against vector".

5. Define "File for names of failures". A file to contain the names of the readings that the program fails to enter, or for which joins are not made.

6. Select "Perform normal shotgun assembly"

7. Accept "Permit joins"

8. Define "Minimum initial match". Only possible overlaps containing exact matches of at least this number of consecutive identical characters will be considered for alignment.

9. Define "Maximum number of pads per reading" This is the maximum number of padding characters permitted in any new reading during the alignment procedure

10. Define "Maximum number of pads per reading in contig" This is the maximum number of padding characters permitted in the contig in order to align any new reading.

11. Define "Maximum percent mismatch after alignment"

Automatic sequence assembler

Database is logically consistent

? (y/n) (y) Permit entry

Select display mode

X 1 Hide all alignments

2 Show passed alignments

3 Show all alignments

4 Show only failed alignments

? Selection (1-4) (1) =3

? (y/n) (y) Use file of file names

? File of gel reading names=demo.nam

? File for names of failures=demo.fail

Select entry mode

X 1 Perform normal shotgun assembly

2 Put all sequences in one contig

3 Put all sequences in new contigs

? Selection (1-3) (1) =

? (y/n) (y) Permit joins

? Minimum initial match (12-4097) (15) =

? Maximum pads per gel (0-25) (8) =

? Maximum pads per gel in contig (0-25) (8) =

? Maximum percent mismatch after alignment (0.00-15.00) (8.00) =

Results skipped to save space

>>>>>>>>>>>>>>>>>>>>gt;>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Processing 4 in batch

Gel reading name=hinw.009

Gel reading length= 292

Working

Contig 1 position 263 matches strand 1 at position 14

Contig 2 position 1 matches strand 1 at position 156

Total matches found 2

Trying to align with contig 1

Padding in contig= 1 and in gel= 0

Percentage mismatch after alignment = 2.9

Best alignment found

251 261 271 281

aattacagcg tt,cctattg acgggcgcat ccac

********** ** ** **** ********** ****

aattacagcg ttcccvattg acgggcgcat ccac

1 11 21 31

Trying to align with contig 2

Padding in contig= 0 and in gel= 2

Percentage mismatch after alignment = 1.4

Best alignment found

1 11 21 31 41 51

tgcacgacat cgagtatgag agttatatcc cgggcgcgct ctgcttgtac atggacctca

********** ********** ********** ********** ********** **********

tgcacgacat cgagtatgag agttatatcc cgggcgcgct ctgcttgtac atggacctca

156 166 176 186 196 206

61 71 81 91 101 111

tgtacctctt tgtctccgtg ctctacttca tgccctccga gcccggcagc gcccacactg

********** ********** ********** ********** ***** ** * **********

tgtacctctt tgtctccgtg ctctacttca tgccctccga gcccg,ca,c gcccacactg

216 226 236 246 256 266

121 131

ctcagacgac ggtcgctgc

********** *********

ctcagacgac ggtcgctgc

276 286

Overlap between contigs 2 and 1

Length of overlap between the contigs= -122

Entering the new gel reading into contig 1

This gel reading has been given the number 4

Working

Trying to align the two contigs

Padding in contig= 2 and in gel= 0

Percentage mismatch after alignment = 1.5

Best alignment found

406 416 426 436 446 456

tgcacgacat cgagtatgag agttatatcc cgggcgcgct ctgcttgtac atggacctca

********** ********** ********** ********** ********** **********

tgcacgacat cgagtatgag agttatatcc cgggcgcgct ctgcttgtac atggacctca

1 11 21 31 41 51

466 476 486 496 506 516

tgtacctctt tgtctccgtg ctctacttca tgccctccga gcccg,ca,c gcccacactg

********** ********** ********** ********** ***** ** * **********

tgtacctctt tgtctccgtg ctctacttca tgccctccga gcccggcagc gcccacactg

61 71 81 91 101 111

526 536

ctcagacgac ggtcgct

********** *******

ctcagacgac ggtcgct

121 131

Editing contig 1

Completing the join between contigs 1 and 2

(Results for other readings skipped to save space)

Batch finished

9 sequences processed

9 sequences entered into database

2 joins made

Figure 4.2 Part of a typical run of "Auto assemble".

Automatic sequence assembler

Database is logically consistent

? Permit entry (y/n) (y) = n

Select display mode

X 1 Hide all alignments

2 Show passed alignments

3 Show all alignments

4 Show only failed alignments

? Selection (1-4) (1) =

? Use file of file names (y/n) (y) =

? File of gel reading names=10

? Save alignment scores in a file (y/n) (y) =

? File for names and scores=top.10

? Use poor data (y/n) (y) =

Select entry mode

X 1 Perform normal shotgun assembly

2 Put all sequences in one contig

3 Put all sequences in new contigs

? Selection (1-3) (1) =

? Permit joins (y/n) (y) =

? Minimum initial match (14-4097) (15) =

? Maximum pads per gel (0-25) (8) =

? Maximum pads per gel in contig (0-25) (8) =

? Maximum percent mismatch after alignment (0.00-100.00) (8.00) =

Working

>>>>>>>>>>>>>>>>>>>>gt;>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Processing 1 in batch

Gel reading name=c38e11.s1

Gel reading length= 401

Working

Contig 47 position 369 matches strand 2 at position 24

Total matches found 1

Trying to align with contig 47

Percent mismatch= 1.5, pads in contig= 3, pads in gel= 4

>>>>>>>>>>>>>>>>>>>>gt;>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

.

.

.

Processing 9 in batch

Gel reading name=c53b11.s1

Gel reading length= 337

Working

Contig 47 position 1380 matches strand 1 at position 4

Total matches found 1

Trying to align with contig 47

Percent mismatch= 3.8, pads in contig= 4, pads in gel= 7

Batch finished

9 sequences processed

0 sequences entered into database

0 joins made

0 joins failed

Figure 4.3 An example of screening full length (ie including hidden data) readings (Note reads 2 to 8 have been skipped to save space)

c38e11.s1 1.5 401 40

c38e5.s1 0.0 375 0

c38g11.s1 1.3 302 30

c53a11.s1 0.0 369 0

c53a5.s1 3.0 456 64

c53a6.s1 2.4 333 35

c53b1.s1 0.6 448 59

c53b10.s1 3.8 317 217

c53b11.s1 3.8 337 244

Figure 4.4 A typical file produced by screening a batch of readings. It includes reading names, percentage mismatch, length and length of overlap.

2.5 Searching for internal joins

The purpose of this function is to use data already in the database to find possible joins between contigs. Although most joins will be made automatically during assembly, due to poor alignments, some may not have been done. The function is particularly useful for sequences from fluorescent sequencing machines because it may be possible to find potential joins within the unused data from the 3' ends of readings. For each potential join found, when the X version is used, the contig joining editor is automatically called up with the two contigs aligned in the edit windows.

The program strategy is as follows. First sort the contigs so that the shortest is first in the list. Then take the first contig and calculate its consensus. If hidden data is being employed, examine all readings that are in the complementary orientation, and sufficiently near to the contigs left end, to see if they have sufficiently good hidden sequence which, if present, would protrude from the left end of the contig. If found add the longest such sequence to the left end of the consensus. Do the same for the right end by examining readings that are in their original orientation. Repeat the consensus calculations and extensions for all contigs hence producing an extended consensus for the whole database. If hidden data is not being employed simply calculate the consensus for the whole database. Now look for possible joins by processing the extended consensus in the following way. Take the last, say 500, bases (termed the "probe length" by the program) of the rightmost consensus, compare it in both orientations with the extended consensus of all the other contigs. Display any sufficiently good alignments. Repeat with the left end of the rightmost contig. Do the same for the ends of all the contigs, always comparing only with the contigs to their left, so that the same matches do not appear twice.

Good hidden data is defined by sliding a window of "Window size for good data scan" bases outwards along the sequence and stopping when greater than "Maximum number of dashes in scan window" appear in the window. Note that it is advisable to have some sort of cutoff because if we simply take all the data it might be of such poor quality that we wont find any good matches. An initial run employing no hidden data is also recommended. Sufficiently good alignments are defined by criteria equivalent to those used in auto assemble, however here we only display alignments that pass all tests.

All numbering is relative to base number one in the contig: matches to the left (i.e. in the unused data) have negative positions, matches off the right end of the contig (i.e. in the hidden data) have positions greater than the contig length. The convention for reporting the orientations of overlaps is as follows: if neither contig needs to be complemented the positions are as shown. If the program says "contig x in the - sense" then the positions shown assume contig x has been complemented. For example in the results given in figure 4.5 the positions for the first overlap are as reported, but those for the second assume that the contig in the minus sense (i.e. 443) has been complemented. 1. Select "Find internal joins".

2. Define "Minimum initial match". Only matches containing this number of consecutive identical characters will be found.

3. Define "Maximum pads per sequence". Only alignments containing less than or equal this number of padding characters in each sequence will be found.

4. Define "Maximum percent mismatch after alignment". Only alignments with at least this level is similarity will be found. Particularly when poor data from the 3' ends of sequences derived from fluorescent sequencing machines is used, it is important to allow for a high degree of mismatch - say around 75%.

5. Define "Probe length". This is the size of sequence from each end of each contig, that is compared with the total length of all other contigs.

6. Accept "Employ hidden data". This means, where available, add the hidden data from the 3' ends of sequences, to the ends of the contigs.

7. Define "Window size for good data scan". To decide how much of the hidden data should be added to the end of a contig the program scans outwards, counting the numbers of dashes (-) over a window of the size defined here.

8. Define "Number of dashes in scan window". If the program finds this many dashes in the scan window it will add no more of the hidden data to the end of the contig. Possible join between contig 445 in the + sense and contig 405

Percentage mismatch after alignment = 4.9

412 422 432 442 452 462

405 TTTCCCGACT GGAAAGCGGG CAGTGAGCGC AACGCAATTA ATGTGAG,TT AGCTCACTCA

********* * ******** ***** *** ********** ********** **********

445 -TTCCCGACT G,AAAGCGGG TAGTGA,CGC AACGCAATTA ATGTGAG-TT AGCTCACTCA

-127 -117 -107 -97 -87 -77

472 482 492 502 512

405 TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT

********** ********** ********** ********** **

445 TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT

-67 -57 -47 -37 -27

Possible join between contig 443 in the - sense and contig 423

Percentage mismatch after alignment = 10.4

64 74 84 94 104 114

423 ATCGAAGAAA GAAAAGGAGG AGAAGATGAT TTTAAAAATG AAACG-CGAT GTCAGATGGG

**** ***** ********** ********** ****** ** ***** **** *********

443 ATCG,AGAAA GAAAAGGAGG AGAAGATGAT TTTAAA,,TG AAACGACGAT GTCAGATGG,

3610 3620 3630 3640 3650 3660

124 134 144 154 164

423 TTG-ATGAAG TAGAAGTAGG AG-AGGTGGA AGAGAAGAGA GTGGGA

*** ****** ********** ** ******* *** ***** ** **

443 TTGGATGAAG TAGAAGTAGG AGGAGGTGGA ,GAG,AGAGA GTTGG-

3670 3680 3690 3700 3710

Figure 4.5 Typical output from "Find internal joins".

2.6 Editing in XBAP

The XBAP editor is mouse-driven and can insert, delete and change readings in contigs. It has facilities to display the traces for data from fluorescent sequencing machines and for annotation of readings. In addition it allows the poor quality data from the ends of readings to be viewed and, if required, added to the sequences.

A typical view of the editor is shown in figure 4.6. This includes the edit window showing an 80 character section of a contig, (position 3899 to 3978). Each reading is numbered and named in the left hand panel, minus signs indicating those in their reverse orientation. Underneath is their consensus. Some of the sequence letters are lighter than the majority showing that they are "unused". One segment (3933 to 3949) is shaded which signifies that it has been annotated. The editing cursor is at position 3921. Above this window are the main buttons the user employs to direct the editing process. Below the edit window is a panel showing the traces for readings 37 and 123. Notice they are centred on the cursor position. Here the traces are shown in four different line styles, but on a colour screen they each have different colours. In the bottom of the figure is the search window. These features are described in the relevant sections below.

2.6.1 Scrolling through the contig
The editor allows scrolling from one end of a contig to the other using the scroll bar and scroll buttons and also the arrow keys.

Action of mouse button presses when the mouse pointer is in the scroll bar:

Middle Mouse Button Set editor position

Left Mouse Button Scroll forward one screenful

Right Mouse Button Scroll backwards one screenful

Figure 4.6 A typical display from the contig editor in XBAP The four scroll buttons operate as follows:

"<<" Scroll left half a screenful

"<" Scroll left one character

">" Scroll right one character

">>" Scroll right half a screenful

The Editor cursor can be positioned anywhere in the edit window by moving the mouse pointer over the character of interest, then pressing the left mouse button. The Editor cursor can also be moved by using the direction arrow keys.

2.6.2 Editing operations
The editor operates in two main edit modes - Replace and Insert. Replace allows a character to be replaced by another. Insert allows characters to be inserted into a reading. Characters are entered by typing them from the keyboard. Only valid characters are permitted. Characters can be deleted by positioning the cursor one character to their right, then pressing the delete key. Normally Insert and Delete apply to the consensus line of the contig only. This restraint can be overridden by using the "Super Edit" mode of operation, though it should be employed with caution as misuse may corrupt alignments.

Edits can also be performed on the consensus, though they are restricted to insertion and deletion of padding characters ("*"). These edits also have special meanings. A deletion will delete all characters at the position to the left of the cursor in the contig, and move the relative positions of all sequences starting to the right of the cursor position left one character. An insertion will insert the character typed ("*") into all gel reading sequences at the cursors position in the contig, and move the relative positions of all sequences starting to the right of the cursor position right one character.

2.6.3 Use of buttons
The effect of the last edit can be undone by pressing the "Undo" button at the top of the editor window. Pressing it n times will undo the last n edits.

The cursor will automatically be positioned at the next problem when the "Find Next Problem" button is selected. The next problem is where the consensus shows either a disagreement ("-") or a pad ("*") character.

The edits to the contig can be saved by pressing the "Leave Editor" button and replying "Yes" to the prompt to "Save changes?".

As no changes are made to the working copy of the database until this point it is possible to abort the editor if the edit session ends up in an unsatisfactory state.

2.6.4 Displaying traces for readings from fluorescent sequencing machines
The original trace data from which the gel reading sequences were derived can be seen by double clicking (two quick clicks) with the middle mouse button on the area of interest. The trace will be displayed with the point clicked at the centre of the trace viewport. All traces that are displayed are maintained in one window, which will display a maximum of four traces. When four traces are already being displayed and a new one is requested, the one at the top of the window is removed and the new one is added to the bottom. Traces can be removed individually by using the "quit" button in the panel next to the trace.
2.6.5 Extending reads with the hidden data
Sequence data from fluorescent sequencing machines is normally clipped to remove the primer region and the poor quality data from the 3' end is marked to be ignored during assembly. Only the sequence used during assembly is made visible in the XBAP editor. However the unused data is copied into the database and can be viewed from within the editor. Also the position of this "cutoff" can be altered. To display the unused sequences, press the "Display Cutoff" button at the top of the editor window. The cutoff sequence appears in grey. This sequence can be incorporated into the editable sequence, by moving the cutoff position. This is done by positioning the cursor at the end of the sequence, and using Meta-Left-Arrow and Meta-Right-Arrow to adjust the point of cutoff. The Meta key is a diamond on the Sun keyboard. As an alternative to Meta the control key (CTRL) can be used.
2.6.6 Using the pop-up menu
A pop-up menu is revealed by depressing the "Control" key on the keyboard and at the same time pressing the left mouse button.

The menu has the following functions: Find Next Problem

Highlight Disagreements

Save Contig

Create Tag

Edit Tag

Delete Tag

Search

Select Oligo

"Find Next Problem" and "Save Contig" are described above. Operations on tags are described in the section on annotation below, and then searching is outlined.

2.6.7 Annotating readings
Parts of a sequence can be annotated to record the positions of primers used for walking, or to mark sites, such as compressions, that have caused problems during sequencing. The annotations are termed "tags". Each tag has a type such as "primer", a position, a length and a comment. Each type has an associated colour that will be shown on the display. First the segment to tag is selected, then it is annotated. The consensus sequence cannot be annotated.
2.6.8 Creating a new annotation
Use the left mouse button to position the start of the selection. While this button is being held down, move the mouse to the other end of the segment. The selection can be extended further using the right mouse button. To create the annotation, invoke the pop-up menu, and select the "Create Tag" function. A small "tag editor" will appear which allows users to select the type of the annotation from a pull-down menu, and specify a comment if desired. To select a new type pull down the Type menu, and select the entry desired. To enter a comment, simply type into the text window in the tag editor. The annotation is created when the "Leave" button on the tag editor is pressed, and is displayed in the colour defined in the tag database file (TAGDB).
2.6.9 Editing an existing annotation
Position the cursor with the left mouse button on the tag, and select the "Edit Tag" off the pop-up menu. This invokes the tag editor, and changes to the type and comment of the annotation can be made. The tag is updated when the "Leave" button is pressed.
2.6.10 Deleting an annotation
To delete an existing annotation, position the cursor with the left mouse button on the tag, and select the "Delete Tag" off the pop-up menu.
2.6.11 Searching
Selecting "Search" brings up a window which can remain present during normal editor operation. The window allows the user to select the direction of search, the type of search and a value to search on. The value is entered into a value text window, then pressing the "search" button performs the search. If successful, the cursor is positioned accordingly. An audible tone indicates failure. Pressing the "ok" button removes the search window. The search window is automatically removed when the contig editor is exited. There are seven different search modes.
2.6.11.1 Search by position
This positions the cursor at the numeric position specified in the value text window. Eg a value of "1234" causes the cursor to be placed at base number 1234 in the contig. Positioning withing a reading is achieved by prefixing the number with the "@" character, eg "@123" positions the cursor at base 123 of the sequence in which the cursor lies. Relative positions can be specified by prefixing the number with a plus or minus character. Eg "+1234" will advance the cursor 1234 bases. If possible, the cursor is positioned within the same sequence. The direction buttons have no effect on the operation of "search by position".
2.6.11.2 Search by reading name
This positions the cursor at the left end of the gel reading specified in the value text window. If the value is prefixed with a slash it is assumed to be a gel reading name. Otherwise it is assumed to be a gel reading number. Eg "123" positions the cursor at the left end of gel reading number 123. "/a16a12.s1" positions at the start of reading a16a12.s1. If the value was "/a16" the cursor is positioned at the first reading which starts with "a16". The direction buttons have no effect on the operation of "search by reading name".
2.6.11.3 Search by tag type
This positions the cursor at the start of the next tag which has the the same type as specified by the type value menu. To change the type, select from the menu that pops up when the mouse is clicked on the button labeled "Type:". The search can be performed either forwards or backwards from the current cursor position. To find all tags, use "search by annotation", with a null text value string.
2.6.11.4 Search by annotation
This positions the cursor at the start of the next tag which has a comment containing the string specified in the value text window. The search performed is a regular expression search, and certain characters have special meanings. Be careful when your value string contains ".", "*", "[", "^" or "$". The search can be performed either forwards or backwards from the current cursor position.
2.6.11.5 Search by sequence
This positions the cursor at the start of the next piece of sequence that matches the value specified in the text value window. The search is for an exact match, which means that the case of the value string is important. The search is performed on the gel readings themselves, rather than the consensus sequence. The search can be performed either forwards or backwards from the current cursor position.
2.6.11.6 Search by problem
This positions the cursor at the next place in the consensus sequence which is not "A", "C", "G" or "T". The search can be performed either forwards or backwards from the current cursor position.

2.6.11.7 Search by quality
This positions the cursor at the next place in the consensus sequence where the consensus for each strand is not "A", "C", "G" or "T" or where the two strands disagree. The search can be performed either forwards or backwards from the current cursor position.


2.7 Joining contigs interactively using XBAP

The operation of the join editor in XBAP is very similar to the one for single contigs described above. It allows the user to align the ends of the two contigs by editing each contig separately. First specify which two contigs are to be joined. The program checks that the two contig numbers are different (it will not allow circles to be formed!) The Join Editor consists of two Contig Editors in between which is sandwiched a disagreement box. This disagreement box uses exclamation marks to denote mismatches between the two consensuses. A typical example is shown in figure 4.7. Here we see in the top window the right end of one contig and in the bottom window the left end of another. The left end of the overlap is correctly aligned, as indicated by an absense of exclamation marks, but the top contig has an extra character at position 558 which is spoiling the alignment over the next segment. Notice that the "lock" button is highlighted denoting that the user has asked for the two contigs to scroll together.

The best strategy for joining is to align the leftmost character of the right contig with its counterpart in the left contig. Then press the "Lock" button before editing the contigs to make them align for the whole overlap. The overlap must be of at least one character. Use the scroll bar and the scroll buttons ("<<", "<", ">", and ">>") for positioning the relative positions of the two contigs. The join position can be fixed by pressing the "lock" button at the top of the Join Editor. Locking allows the two contigs to be scrolled as one when using the scroll bar and buttons, the left ends always in the same position relative to each other. Once locked, it is best to proceed to the right along the contigs, inserting padding characters ("*") into the consensuses to minimise the disagreements. It is important that the user aligns the two contigs throughout the whole region of overlap before completing the join because it is only at this stage that the two contigs can be edited independently. If a join is completed leaving a region of mismatch the consensus will consist of dashes and the assembly function will fail to find overlaps in the bad section. Misaligned sections can be corrected using the "super edit" mode of the editor. The join can be completed by pressing the "Leave Editor" button. The percentage mismatch is displayed, and users are required to confirm that they want to perform the join.

Figure 4.7 A typical display from the join editor in XBAP.

2.8 Selecting primers and templates

Primers and templates can be selected automatically by the program finding single stranded regions, or from the contig editor.

2.8.1 Selecting primers and templates interactively

1. Select "Edit contig". The primer and template selection function is available from the popup menu of the contig editor.

2. Open the oligo selection window by selecting "Select Oligo" from the contig editor popup menu.

3. Position the cursor to where you want the oligo to be chosen. While the oligo selection window is visible, you will still have complete control over positioning and editing within the contig editor.

4. Indicate the strand for which you require an oligo. This is done by toggling the direction arrow ("----->" or "<------"), if necessary.

5. Press the "Find Oligos" button to find all suitable oligos (See "Oligo selection" in Note 17.) Information for the closest oligo to the cursor position is given in the output text window. In the contig editor the position of the oligo is marked by a temporary tag on the consensus. The window is recentered if the oligo is off the screen. Selecting "Display Selection Information" will print a short report on the numbers of oligos considered and rejected during oligo selection.

6. If this oligo is not suitable (it may have been previously chosen, and found to be unsuitable by experimentation, say), the next closest oligo can be viewed by pressing "Select Next".

7. Suitable templates are automatically identified for the currently displayed oligo (See "Template selection" in Note 18.) By default, the template is that closest to the oligo site. If the choice is not suitable (it may be known to be a poor quality template, say) another can be chosen from the "Choose Template for this Oligo" menu. Templates that do not appear on the menu can be specified by selecting "other". However, the template must be on the correct strand and be upstream of the oligo.

8. A tag can be created for the current oligo by pressing the button "Create a tag for this oligo". The annotation for this tag holds the name of the template and the oligo primer sequence. There are fields to allow the user to specify their own primer name ("serial#") and comments ("flags") for this tag. An example of oligo tag annotation: serial#=

template=a16a9.s1

sequence=CGTTATGACCTATATTTTGTATG

flags=

9. The oligo selection window is closed when "Create a tag for this oligo" or "Quit" is selected. 2.8.2 Automated oligo selection

The purpose of this function is to suggest custom primer experiments that would help to "double strand" regions of a contig. As contig ends are usually single stranded the routine will also help to extend contigs (or walk off them). The routine finds regions of contigs with data for only one strand and selects suitable templates and primers. This information is written to a file (default name "primers").

The file generated contains the gel reading name, the primer sequence, it's offset in the contig and the orientation. An example file is shown in figure 4.8. c81d12.s1 TTGTCTGTAAGCGGATG (@ 6449 ) +

c98a10.s1 ATTATCACTTTACGGGTC (@ 6959 ) +

c81c1.s1 CAAGAAGGCGATAGAAG (@ 7643 ) +

c76a10.s1 CCTCATCCTGTCTCTTG (@ 8441 ) +

c81g4.s1 ATGAAACCTGGGCGTTG (@ 16156 ) +

c91e6.s1 GTTTTCAGATGTCGGAG (@ 18249 ) +

c81e12.s1 GCTACCGTAAAACACTTC (@ 18737 ) +

c93h11.s1 GCTGCTTTTTGTTTTATCC (@ 19158 ) +

c81h6.s1 CTTCCACTTCTTTCTTATC (@ 21210 ) +

c86a12.s1 CGAATGATAAAGACAAATCAG (@ 22122 ) +

c98b1.s1 GCCACTTTATCCGAGAC (@ 3048 ) -

c97c5.s1 GTGTTTTGGGTATATTGTG (@ 3371 ) -

c83d2.s1 CTACACAGAATGAACCC (@ 3768 ) -

c78h10.s1 GGCGGTGAAGATTGAAG (@ 4200 ) -

c98h9.s2dt CTCGTTTAAATTTCAAACTTCC (@ 7419 ) -

c95a9.s1 ATTGGAAGGAAGGAGGG (@ 22996 ) -

c82b4.s1 TGTAGCCGAAATCTTCC (@ 23369 ) -

Figure 4.8 showing typical output from the auto-select oligos routine.

The routine is best employed after having previously used the "Double strand" option. When selecting the option you will be asked for the contig, a region within this contig and the file to write the list of primers to. For each primer suggested a tag is automatically created containing details of the gel reading name and the sequence. The program will try to place the tag on the gel reading from which the primer was selected. However this is not always possible so failing that the tag will be on another sequence overlapping the primer position.

1. Select "Auto-select oligos".

2. Define "Contig identifier", which is the contig to be processed.

3. Define "Start position in contig", which is the left end of the region to process.

4. Define "End position in contig", which is the right end of the region to be processed.

5. Define "Name of file for primers", which is the file the results will be written to.

6. Define "Start of oligo choice region", the value of which defines how close to the single stranded region oligos can be selected.

7. Define "End of oligo choice region", which defines the furthest distance that oligos can be from the single stranded region.


2.9 Examining the "quality" of a contig

This function reports on the proportion of the consensus that is "well determined" and will display a sequence of symbols that indicate the quality of the consensus at each position or produce a graphical display. Each strand of the contig is analysed separately using the consensus algorithm, and a position is declared "well determined" if it is assigned one of the symbols a,c,g,t. The current consensus calculation cutoff score is used.

A summary showing the percentage of the consensus that falls into each category of quality is shown. The analysis divides the data into five categories, assigning each a code as shown in figure 4.9. Code 0 means well determined on both strands and they agree, 1 means well determined on the plus strand only, 2 means well determined on the minus strand only, 3 means not well determined on either strand and 4 means well determined on both strands but they disagree. If the user chooses to have the data displayed graphically the following scheme is used. A rectangular box is drawn so that the x coordinate represents the length of the contig. The box is notionally divided vertically into 5 possible levels which are given the y values: -2,-1,0,1,2. The quality codes assigned to each base position are plotted as rectangles. Each rectangle represents a region in which the quality codes are identical, so a single base having a different code from its immediate neighbours will appear as a very narrow rectangle. Obviously a single line at the midheight shows a perfect sequence. In figure 4.10 we show the result for the section of contig shown in figure 4.11.

Strands Quality Y cordinates

OK code

+ - and the same 0 0 to 0

+ 1 0 to 1

- 2 -1 to 0

neither 3 -1 to 1

+ - but different 4 -2 to 2 Figure 4.9 The codes and coordinates used by the "Quality plot".

94.67 % OK on both strands and they agree(0)

0.67 % OK on plus strand only(1)

2.00 % OK on minus strand only(2)

2.67 % Bad on both strands(3)

0.00 % OK on both strands but they disagree(4)

3310 3320 3330 3340 3350

0000000000 0000000000 0000000000 0000000000 0000000000

3360 3370 3380 3390 3400

0020000000 0000000032 0000032000 0000000000 0300000030

3410 3420 3430 3440 3450

0000000000 0010000000 0000000000 0000000000 0000000000

Figure 4 .10 Listed output from "Examine Quality" showing the results for the section of contig displayed in figure 4.21.

2.10 Using graphical displays to examine contigs

The programs contain three graphical displays to aid the examination of contigs. The first simply gives an overview of all the contigs in the database and provides, with the use of a crosshair, a mechanism for the other two displays to select contigs. One of these displays produces a schematic representation of each of the readings in a contig. The lines in the display show the relative positions of each reading and also their sense. The plot is divided vertically into two sections by a line that is identified by an asterisk drawn at each end. All lines that lie above this line represent readings that are in their original sense, all lines below show readings that are in the complementary sense. The final graphical display is of the "quality" of the data as described above.

When these graphical displays are visible users may employ a crosshair, moved by mouse or keyboard commands, to examine the data in more detail. The crosshair is positioned and when keyboard characters S, Q, N, I or Z are typed the program will show the local aligned sequences in a text window, produce the quality plot, give the names of the nearest readings, identify the nearest reading or zoom into the display.

A typical display of all three plots is shown in figure 4.12. The top rectangle shows a separate line for each of the projects contigs. The righthand one is bisected by a vertical line indicating that it has been selected by the user. The next rectangle below is divided by a horizontal line marked at each end by an asterisk. Each of the other horizontal lines in the box represents one of the selected contigs gel readings. Those above the dividing line are in their original orientation, those below have been complemented. The box below is also divided by a horizontal line and shows the "quality" for each base in the contig. Rectangluar areas marked above the central line show sections that only have a good consensus on the minus strand, and rectangles below show good sections from the other strand. Places where the vertical lines reach the top and bottom of the box show disagreements between the two strands. Places with only the midline have a good consensus on both strands.

Figure 4.12 A typical graphical display from XBAP or SAP.

2.11 Check assembly

This new function is used for checking the positioning of assembled readings by examining the quality of the alignment between their hidden data and the consensus they overlap. It is useful for checking sequences that contain repeats of length similar to that of a single gel reading. It takes the poor quality (hidden) data for each reading and compares it to the segment of the consensus to which it should align. If the extension of the read does not match the consensus then the read (or its neighbours) has possibly been assembled into the wrong place. The program displays the bad alignments. The quality of an alignment is defined by the percentage mismatch. Naturally the user should select a value that takes into account the poor quality of the data being aligned. When the routine is used from the X version the user is offered the editor to examine poor alignments. If alignments are reported as poor, but on inspection are OK, the user can set a tag so that the poor quality data is ignored on subsequent passes through the routine. Note however such data will then also be ignored by the automatic double stranding routine! The user defines the percentage mismatch; the window size and number of dashes allowed in the window used for selecting the amount of the hidden data to be employed; can choose to save the names of the poorly aligned reads in a file; can select an individual contig or scan the whole database. The file containing the names of the poorly aligned reads can be used by the disassembly routine to remove them from the database, and then can be used to reassemble them. Note that the current alignment routine is not very good at aligning very poor data and so some bad alignments displayed will be due to the short comings of the program and not to the data.

1. Select "Check assembly".

2. Define "Maximum percent mismatch after alignment". Any alignments with a higher value will be displayed.

3. Define "Window size for good data scan", which is the length of sequence the routine will use when selecting the amount to extend into the hidden data. A "window" of this size is moved along the hidden data until either the end is reached or a segment is found in which more than "Maximum number of dashes in scan window" is found. Only the sequence up to this point is used for comparison with the consensus.

4. Define "Maximum number of dashes in scan window". See above.

5. Accept "Save failed names in a file". This file can be used with the disassembly routine to move the readings out of the contig.

6. Define "File name for failed readings".

7. Reject "Select contigs". The routine will search all contigs in turn. Alternatively an individual contig can be selected.

Searching will commence and the poor alignments will be displayed. For the X version of the program the contig editor can be used to check and, if necessary, tag the data. A typical run of the nonX version is shown in figure 4.13.

Check assembly

Database is logically consistent

? Maximum percent mismatch after alignment (0.00-100.00) (13.00) =

? Window size for good data scan (1-1024) (100) =

? Maximum number of dashes in scan window (1-100) (5) =

? Save failed names in a file (y/n) (y) = n

? Select contigs (y/n) (y) =

Default Contig identifier=/c53b9.s1

? Contig identifier=143

? Start position in contig (1-3057) (1) =

? End position in contig (1-3057) (3057) =

Working

Percentage mismatch 17.4, Pads 0 1

446 456 466

C 143 CCAATGGGTG GTCC*ACGTG AGT

***** *** ********* ***

R 128 CCAATTGGT- GTCC,ACGTT AGT

1 11 21

Working

Percentage mismatch 14.7, Pads 15 0

1250 1260 1270 1280 1290 1300

C 143 TTTTCC,TGT AAATAATTTA AA,TTGCAGG G,,CTTATTG CAA,,TTTTA GGG,AAATTT

****** ** ********** ** ******* * * ****** *** ***** *** *****

R 56 TTTTCCCGGT AAATAATTTA AAATTGCAGG GG-TTTATTG CAAATTTTTA GGGGGAATTT

1 11 21 31 41 51

1310 1320 1330 1340 1350 1360

C 143 T,*CGC,TGA TTTAACT*TC G,AGAATTA, TTGAATTA,, TTTATTTAAA ,,GTAGAGGC

* ** *** ****** ** * ****** ******** ********** ****

R 56 TTCGGCTTGA TTTAACCTTC GGGGAATTAT TTGAATTAAT TTTATTTAAA AG-AGGAGGG

61 71 81 91 101 111

1370

C 143 TGAGCGAAG

* *

R 56 TTGAGGCGA

121

Working

Complementing contig 143

Working

Percentage mismatch 14.5, Pads 8 4

2763 2773 2783 2793 2803 2813

C 8 TC,GGCGGAG CAA*CA*CTC GAAATG,ATA ,GGTTCATCT C*GGTTTCCA GG,ATTCCAG

** ***** * * ** ****** *** ********* * ** ***** ** *** ***

R 155 TCGGGCGG-A ACC,AC,TTC GAAATGGATA GGGTTCATCT CGGG,TTCCA GGGATT-CAG

1 11 21 31 41 51

2823 2833 2843 2853 2863 2873

C 8 TCG,CCCGAG TTGAATCGAT CCATAACGAG ACG,CACCAG ATTCCAAATT GGATTT,CCC

*** *** ** ********** ******** * *** **** *** ****** ****** ***

R 155 TCGGCCC,AG TTGAATCGAT CCATAACG-G ACGG-ACCAT ATT-CAAATT GGATTTTCCC

61 71 81 91 101 111

2883 2893

C 8 CCAGAA,GTC T

*** ** *** *

R 155 CCA-AAGGTC T

121 131

Working

Complementing contig 8

Number of possible problems 3

Figure 4.13 showing a typical run of "Check assembly".

2.12 Examining the positions of reads from the same template

This function is used to check the positions of readings taken from each end of the same template. If the sequencing templates have been selected on size this routine will allow the user to detect any for which their forward and reverse reads have anomalous separations and directions. It can also help to find the relative positions of contigs when the read pairs are in different contigs.

For each forward read the routine searches for a corresponding reverse reading. The search can be over the whole database or over a single contig. The results can be presented graphically for single contig searches and the crosshair function can be used to identify the readings displayed.

Note that at present the function only knows that two reads are from the same template by comparing reading names. For our local projects we use the following naming convention: forward reads are named abcdefgh.s1x and reverse reads abcdefgh.r1y. The program expects this naming convention and so if it finds read fred.s1* and fred.r1* it assumes they are the forward and reverse reads for template fred (* means any character). In the very near future we will make the routine more general!

If a single contig is selected and the output is listed (see figure 4.14) the program displays two lines of results for each pair: the first line shows the reading name, its position and length, and the distance between the extremeties of the two reads; the second line shows the other read name, its position and length. If there are pairs that are in separate contigs or are facing away from one another they are listed after the pairs that face inwards and their separations are meaningless. If the results are plotted (see figure 4.15) the full length of the template is drawn with arrows indicating the direction of reads and the extent of each reading. Those reads that have their partner in another contig are marked by asterisks. Those that face away from one another have a small vertical bar at their midpoint. Figure 4.15 includes one pair of reads that face away and several reads whose partners are in other contigs. If contigs are not selected (see figure 4.16) the pairs are sorted on their separations. The output includes: reading name, reading number, position, separation, contig number. Those pointing away from one another are given separation 99999. Those in different contigs are marked by *. 1. Select "Find read pairs".

2. Accept "Select contigs". The alternative searches all contigs.

3. Define "Contig identifier", which is the contig to search.

4. Define "Start position in contig".

5. Define "End position in contig".

6. Reject "Plot results", which causes listed output to be displayed. The alternative plots the results as shown in figure 4.15.

Typical results for a selected contig are shown in figure 4.14 and for all contigs in figure 4.16. ? Select contigs (y/n) (y) =

Default Contig identifier=/i55d8.s1

? Contig identifier=

? Start position in contig (1-15227) (1) =

? End position in contig (1-15227) (15227) =

? Plot results (y/n) (y) = n

852 k23a1.r1 249 238 1615

806 k23a1.s1 1529 -335

238 i68e6.s1 422 193 1632

868 i68e6.r1 1756 -298

576 k17a2.s1 2370 213 1676

885 k17a2.r1 3790 -256

84 k27g6.s1 3456 291 1777

867 k27g6.r1 4905 -328

453 k01g10.s1 5805 142 1251

881 k01g10.r1 6909 -147

781 i98b8.r1 6754 338 1079

10 i98b8.s1 7653 -180

883 k02d11.r1 7327 276 1597

283 k02d11.s1 8726 -198

269 i68f9.s1 8191 169 1055

777 i68f9.r1 8891 -355

710 i91c6.s1 8245 95 1516

780 i91c6.r1 9403 -358

596 k27d12.s1 136 329 -329

219 k27d12.r1 1 -116

159 k27d11.r1 1830 -263 -263

317 k27d11.s1 2902 343

886 k17g11.r1 7107 -123 -123

647 k17g11.s1 1867 265

851 i69g10.r1 8045 -137 -137

277 i69g10.s1 4658 174

Figure 4.14 showing typical output from "Find read pairs" for a single contig.

Figure 4.15 showing typical graphical output from "Find read pairs". Note that in crosshair mode the special symbol I will Identify the reading close to the cross hair.

h4b01h12.s1 77 39208 99999 778

h4b01h12.r1 887 32187 99999 778

h4b01d2.s1 99 33968 99999 778

h4b01d2.r1D 990 27558 99999 778

h4a11e3.s1 16 35246 22366 778

h4a11e3.r1 907 13023 22366 778

h4a11b8.s1 194 12959 15090 778

h4a11b8.r1 906 27775 15090 778

h4a10g6.s1 78 12339 9909 778

h4a10g6.r1 891 22143 9909 778

.

. missing data

.

h4a11d9.r1 845 39099 1167 778

h4a11d9.s1 873 38341 1167 778

h4a11e12.s1 82 6879 1157 778

h4a11e12.r1 848 5957 1157 778

h4a11b4.s1 614 19998 1151 778

h4a11b4.r1 831 21003 1151 778

h4a11a11.s1 139 31785 1148 778

h4a11a11.r1 853 30850 1148 778

.

. missing data

.

h4a11b10.s1 56 27577 884 778

h4a11b10.r1 833 28212 884 778

h4a11c4.s1 228 38265 868 778

h4a11c4.r1 836 37627 868 778

h4a10e9.s1 310 39995 787 778

h4a10e9.r1 905 40582 787 778

h4a11a6.s1 344 38098 661 778

h4a11a6.r1 852 37737 661 778

h4a08a5.s1 246 35089 35685 778*

h4a08a5.r1 879 1 35685 879*

h4a24e3.d1 994 1 35129 994*

h4a24e3.d10 999 35001 35129 778*

h4a11e3.s1a 982 110 28363 211*

h4a11e3.r1D 992 27710 28363 778*

h4a09f8.s1 586 13971 14287 778*

h4a09f8.r1 904 1 14287 904*

.

. missing data

.

h4b01c12.s1 483 249 305 211*

h4b01c12.r1a 943 1 305 943*

Figure 4.16 showing typical output from "Find read pairs" for all contigs.


2.13 Disassembling and breaking contigs

Sometimes it is necessary to drastically alter contigs. Users may need to break a contig in two, remove a single reading, remove a whole set of consecutive readings from a contig, or remove a set of readings from the database independent of which contigs they are in. Sometimes we may wish to move readings from one contig to another or to another part of the same contig.

2.13.1 Disassembling contigs

This function is used to remove readings from a database, move readings to new contigs, or make a list of unattatched readings. If readings are removed from the database all reference to them is deleted. If readings are moved to new contigs each will start a new contig containing only the named reading. Such readings can then be processed by the "find internal joins" or "join editor" functions. The latter is useful for repositioning a reading in a repeat: once separated it can be placed in the join editor and scrolled by the other copies.

Unattatched readings are those that have no left and right neighbours and so form contigs with only one reading. At the end of a project they usually represent data from contamination that has not been detected by other screening processes, and this function allows them to be removed from the database. Removing them is a two step process: first use the function to make a list then use it again to remove the readings on the list.

Removal of sets of readings works in three modes:

1. A set of adjacent readings in a contig can be removed by the user naming the two end ones.

2. A batch of readings from any number of contigs can be defined by the user naming a file containing a list of reading names.

3. The user defines single readings by name or number.

The program cleans up the database by moving data to fill up any holes made in the files.
For the first two modes of operation the program will ask for a file of file names. If users create their own file (ie mode 2) each reading NAME must be on a separate line. For mode 1 the user types the names or numbers of the leftmost and rightmost readings to be removed. They, and all intervening readings, will be removed. For the third mode the user types in reading names. If the user types only return or quit the option will be left. For all modes, if necessary, new contigs will be created. The operations and choices for moving readings to new contigs are exactly the same as for removal except that the readings do not disappear from the database.

Figure 4.17 whows a worked example of removing reads Here six adjacent reads are removed. First the user defines the numbers of the reads at the ends of the segment to be removed, then the program writes their names to a file (reads.out). At this stage it allows the user to quit. Then it uses the names in the file to make the changes. Note that it gives diagnostics as it goes along: it lists the reading names, then for each of them it displays the operations required to move/remove them from the database. For example e05a9.s1 is reading number 111 and its removal requires all the readings in the contig to be shifted 124 positions to the left. In order to fill up the gap in the reading numbers, reading 222 is renumbered 111. Disassemble readings

Database is logically consistent

You are advised to make a backup copy first!

Select task

X 1 Remove readings from database

2 Move readings to new contigs

3 Make a list of unattached readings

? Selection (1-3) (1) =

Select definition mode

X 1 Define a region by reading names

2 Use a file of reading names

3 Define single readings

? Selection (1-3) (1) =

? Leftmost reading=111

? Rightmost reading=69

Position of this reading= 673

Number of leftmost reading this contig= 111

? Name for temporary file of reading names=reads.out

e05a9.s1

e05f6.s1

c99b11.s1

e07g6.s1

e04g11.s1

e17c5.s1

? Process the list (y/n) (y) =

Shifting readings in contig by distance= -124

Renumbering reading 222 to 111

Shifting readings in contig by distance= 0

Renumbering reading 221 to 121

Shifting readings in contig by distance= -33

Renumbering reading 220 to 52

Shifting readings in contig by distance= -74

Renumbering reading 219 to 168

Shifting readings in contig by distance= -3

Renumbering reading 218 to 105

Shifting readings in contig by distance= 0

Renumbering reading 217 to 200

Database is logically consistent

Figure 4.17 showing typical output from Disassemble readings.

2.11.2 Breaking a contig

This function is found in the "Alter relationships" menu. It can be used to break a contig at the beginning of a particular reading so that the identified reading becomes the left end of a new contig. The user types in the number of the reading that will become the left end.

2.14 Finding and labelling repeats

This function can find and, optionally tag, direct or inverted repeats in the consensus. It only finds exact matches - ie runs of consecutive bases that match without interruption. Hence long repeats with occasional mismatches will be reported as several separated matches. The search is currently (October 93) new and should be regarded as experimental. The routine requires a lot of memory to run but is very fast. The function works in two stages: first it does the search and writes a list of matches to the screen and also to a file; then the file is used to create the tags in the database. The program requires the user to confirm the tagging step. Before doing so the user can use an editor such as emacs to alter the file - for example to stop previously labelled repeats being tagged again. That is the user can edit the file before answering "yes" to the question "? Add tags to database (y/n) (y) = ".

At present we are unable to tag the consensus sequence and so are forced to find suitable reads to hold any labels we wish to create. A repeat has two ends. The tag at one end of the repeat will have as its comment the name of the reading at the other end. The file describing the repeats (see figure rodger) has the following format. Each repeat occupies 7 lines of the file and is followed by a blank line. Line 1 is a header showing the repeat number; line 2 labels the first end of the repeat; line 3 gives the number of the reading to be tagged; line 4 defines the tag to be of type REPT, its start position in the read, the length of the tag, and the comment field for the tag as the name of the reading labelled as the other end of the repeat; line 5 labels the other end of the repeat; line 6 the read number; line 7 is equivalent to line 4. Figure 4.18 defines two repeats. Repeat number 1 is tagged on reading 131 at position 48; the tag is 12 bases long and the comment is e05a1.s1; the other end of the repeat is tagged on reading 107 (whose name will be e05a1.s1) at position 2, etc.

Repeat number 1

End 1

; 131

;;REPT 48 12 e05a1.s1

End 2

; 107

;;REPT 2 12 e06a3.s1

Repeat number 2

End 1

; 84

;;REPT 31 13 e03a6.s1

End 2

; 70

;;REPT 166 13 e03h10.s1

Figure 4.18 showing a repeat tagging file containing two repeats.

The function can find direct or inverted repeats, can search the whole database or sections of single contigs. Typical output is shown in figure 4.19. 1. Select "Find repeats".

2. Define "File name for results", which is the file used to save the descriptions of the repeats used for tagging the readings.

3. Define "Minimum repeat". Only exact matches of at least this length will be reported. Of most interest during assembly are repeats at least as long as the "Minimum match length" used by the assembly routine.

4. Select "Find direct repeats". Alternatively the program will report inverted repeats.

5. Select "Search whole database". Alternatively the program will search a region of a particular contig.

6. Wait for the program to finish listing its results when it will display the message " ? Add tags to database". If required use an editor like emacs to edit the results file. After closing the file accept "Add tags to database" and the program will create tags for all the repeats. If a repeat is removed from the file make sure that all 8 lines are deleted.

Find repeats

? File name for results=repeats

? Minimum repeat (7-1000) (25) =12

Select task

X 1 Find direct repeats

2 Find inverted repeats

? Selection (1-2) (1) =

Select task

X 1 Search whole database

2 Search single contig

? Selection (1-2) (1) =

Working

Working

Direct repeat of 12: contig 143 at 2625 and contig 143 at 189

Direct repeat of 13: contig 122 at 1565 and contig 143 at 1261

Direct repeat of 15: contig 122 at 1663 and contig 143 at 987

Direct repeat of 12: contig 122 at 2840 and contig 143 at 732

Direct repeat of 12: contig 27 at 77 and contig 27 at 43

Direct repeat of 12: contig 27 at 533 and contig 122 at 1158

Direct repeat of 14: contig 180 at 1192 and contig 180 at 335

Direct repeat of 12: contig 180 at 1198 and contig 180 at 949

Direct repeat of 14: contig 177 at 313 and contig 122 at 923

Direct repeat of 15: contig 177 at 411 and contig 180 at 584

Direct repeat of 12: contig 177 at 1287 and contig 122 at 1461

Direct repeat of 12: contig 177 at 1315 and contig 143 at 1282

Direct repeat of 27: contig 184 at 1 and contig 122 at 3066

Direct repeat of 15: contig 184 at 26 and contig 122 at 3092

Direct repeat of 17: contig 184 at 42 and contig 122 at 3109

Direct repeat of 12: contig 184 at 243 and contig 122 at 3099

Direct repeat of 15: contig 184 at 460 and contig 184 at 458

Direct repeat of 13: contig 184 at 462 and contig 184 at 458

? Add tags to database (y/n) (y) =

Figure 4.19 showing a typical run of "Find repeats".

2.15. Filling single stranded regions with hidden data

The purpose of this function is to "double strand" regions of contigs that have data on only one strand. This is best explained by describing how the routine operates. First it finds regions that have data for only one strand. Then it examines the nearby readings on the other strand to see if they have hidden data covering that region. If so it finds the best alignment between this hidden data and the consensus over the region. If this alignment is good enough the data is converted from hidden to visible. Significant portions of the sequence can be covered by this operation, hence saving a great deal of experimental work. The function is a standard part of cleaning up a sequencing project and is best used prior to "primer walking". The hidden data is used carefully to try and minimise the number of data disagreements created. However it must be noted that an overall slight degradation in quality will still occur.

The criteria for evaluating the amount of hidden data to be used is based upon a maximum number of mismatches and a score (derived by accumulating points for mismatches, matches and insertions over the length of an alignment). The defaults are:

maximum mismatches : 6

score for mismatch : -8

score for correct match : +1

score for insertion : -5

Note that with successive calls to this option it is possible to double strand more and more data. Naturally however the quality of the data generated will diminish each time.

1. Select "Double strand".

2. Define "Contig identifier", which is the contig to be processed.

3. Define "Start position in contig", which is where processing will start.

4. Define "End position in contig", which is where processing will end.

5. Define "Maximum number of mismatches". Any alignments with more mismatches will not be used.

6. Define "Score for mismatch", which is the score for each mismatch in the alignment.

7. Define "Score for correct match", which is the value assigned to each matching base in the alignment.

8. Define "Score for insertion", which is the score for each insertion in either sequence.

Processing will then begin and the routine will report any double stranding it is able to perform. An example is shown in figure 4.20.

Double strand

Default Contig identfier=/e05f8.s1

? Contig identfier=

? Start position in contig (1-3143) (1) =

? End position in contig (1-3143) (3143) =

? Maximum number of mismatches (0-99) (5) =

? Score for mismatch (-100-0) (-8) =

? Score for correct match (0-100) (1) =

? Score for insertion (-100-0) (-5) =

Working

Double stranded e07b9.s1 by 70 bases at offset 2435

Positive strand : double stranded 70 bases with 3 inserts into consensus

Complementing contig 122

Double stranded e06g6.s1 by 11 bases at offset 1225

Double stranded e04e6.s1 by 20 bases at offset 966

Double stranded c53c5.s1 by 94 bases at offset 202

Negative strand : double stranded 127 bases with 3 inserts into consensus

Total : double stranded 197 bases with 6 inserts

Complementing contig 82

Figure 4.20 showing a worked example of the double stranding routine.

2.16 Shuffling pads

One weakness of the assembly routine is that padding characters introduced to line up the readings are not always aligned with the pads in other sequences: a single problem such as a compression can give rise to pads apparently randomly arranged in the different readings covering the region. This function attempts to shuffle the pads around so that they align with one another, hence simplifying editing. No information is lost in the process: only the positions of padding characters are changed. The function is best used prior to editing.

2.16 Checking for editing mistakes

This function should be employed when the editing for a project is (apparently) completed. Its purpose is to detect bases (actually pairs of adjacent bases, see below) in the final consensus for which there is no evidence in the original readings. It will find insertions, deletions and changes. The program assumes that every pair of adjacent bases in the final sequence must appear in at least one of the original overlapping readings. The program's task is to find, and inform the user, of any pairs of adjacent bases for which this assumption fails. These positions can then be checked, and if necessary, corrected. Notice that the program does not find all places that disagree with the final consensus (of which there will generally be many), rather it tries to find one reading that does agree. Most disagreements are in poor data and at this stage of a project they are no longer of interest as they should have been checked. Here we are checking that there is evidence for the final sequence and hence should detect any inadvertent errors made during its assembly and editing.

The program works in the following way. A consensus is calculated. A table of the same length as the consensus is set to zero. Each element in this table corresponds to a pair of adjacent bases in the consensus. The program then finds an alignment between each of the original readings (extracted from their trace files) and the consensus. When a pair of bases in the consensus is found to match exactly with a pair in an original reading the corresponding element in the table is set to 1. At the end of this process the positions of any elements in the table that are still set to zero are reported to the user. The program allows the user to specify, the project name, the project version, the consensus calculation cutoff, a search path for where traces (raw data) are to be found, and the contig to process. Two interfaces are provided: all options can be prompted for, or they can be specified on the command line. The command line interface is as follows.

cop-bap

[-p project]

[-v version]

[-c consensus_cutoff_percentage]

[-r raw_data_search_path]

[-h]

[-c contig]

For example: COP-BAP can be run on T05G5 version 0, contig 66, taking data

from ~kt/T05G5 with the command:

cop-bap -p t05g5 -v 0 -r ~kt/T05G5 -c 66

The "standard" interface is as follows.

1. Select program "cop-bap". Notice that this is not an option in bap but is a separate program.

2. Define "Project name". This is the name of the project database to be checked.

3. Define "Version". This is the version of the database (usually 0).

4. Define "Consensus cutoff". The value supplied will be used to calculate the consensus (usually 100%).

5. Define "Trace directory". By default the program will look in the current directory for the original trace files. This option allows users to define a different directory.

6. Define the name or number of the contig to check. By default all contigs will be checked. If the contig is specified by a reading name it must be preceded by a slash (/) symbol. The program will then start to check the data and will write its results to the screen and to a file called "<project.<version>.LOG". The results specify the contig and each of its problem areas.

Figure 4.21 shows part of a run where the user has elected to be prompted for each input. The program was unable to find one trace file. It reports 7 problem regions.

cop-bap

COP v1.2: Check Out Project

Checks xbap database for errors

Project name ? T05G5

Version ? 0

Consensus cutoff = 100%

Trace directory = /home/jkb/data/T05G5

Check which contig? [all] i49b2.s1

Checking contig 952: i49b2.s1

Error reading SCF trace file /home/jkb/data/T05G5/g23e9.s1SCF

Problem areas:

1616-1617

4893-4896

18369

21426-21427

26022-26023

36304-36308

36314-36318

Figure 4.21 showing typical output when checking for editing mistakes

2.18 Displaying a contig

The "Display a contig" option shows the aligned readings for any part of a contig. Users select "Display a contig", then select the contig. The number, name and strandedness of each reading is shown and the consensus is written below. A typical example, showing part of a contig from positions 3301 to 3450, is seen in figure 4.22. Overlapping this region are readings 3, 40, 8, 37, 35 and 2, with archive names L3.SEQ, A21A7.S1 and so on. Readings 3, 8, 35 and 2 are in reverse orientation as indicated by the minus signs. There are a few padding characters in the working versions, but the consensus (shown below each page width) has a definite assignment for every position except 3376.

2.19 Highlighting differences between readings and the consensus

During the latter stages of a project this option is used to highlight disagreements between individual gel readings and their consensus sequences. Typical output is seen in the figure 4.22 which shows the result for the section of contig shown in figure 4.21. Characters that agree with the consensus are shown as + symbols for the plus strand and - for the minus strand. Characters that disagree with the consensus are left unchanged and so stand out clearly. Note that a similar display is now more conveniently available within the contig editor. 1. Set the consensus cutoff score.

2. Redirect output to disk.

3. Display the contig.

4. Close the redirection file.

5. Select "Highlight disagreements".

6. Define the name of the redirection file.

7. Define an output file name.

8. Select a symbol for good plus strand data.

9. Select a symbol for good minus strand data.

10. Print the file. 3310 3320 3330 3340 3350

-3 L3.SEQ atggttacgccagactatcaaatatgctgcttgaggcttattcgggcgca

40 A21A7.S1 atggttacgccagactatcaaatatgctgcttgaggcttattcgggcgca

-8 A16A2.S1 atggttacgccagactatcaaatatgctgcttgaggcttattcgggcgca

37 A21A2.S1 atggttacgccagactatcaaatatgctgcttgaggcttattcgggcgca

CONSENSUS atggttacgccagactatcaaatatgctgcttgaggcttattcgggcgca

3360 3370 3380 3390 3400

-3 L3.SEQ gatctgaccaagcgacag*tttaaa*gtgctgcttgccatt*ctgcgt*a

40 A21A7.S1 gatctgaccaagcgacag*gttaaagttgctgctt

-8 A16A2.S1 gatctgaccaagcgacag*tttaaa*gtgctgcttgccatt*ctgcgt*a

37 A21A2.S1 ga-ctgaccaagcgacag*tttaaa*gtgctgcttgccatt*ctgcgt*a

35 A16D12.S1 gttttaaa-gtgctgcttgccatttctgcgtaa

-2 L2.SEQ t*ctgcgt*a

CONSENSUS gatctgaccaagcgacag*tttaaa-gtgctgcttgccatt*ctgcgt*a

3410 3420 3430 3440 3450

-3 L3.SEQ aaacctatgggt*ggaataaaccaatggacagaatcaccgattctcaact

-8 A16A2.S1 aaacctatgggt*ggaataaaccaatggacagaatcaccgattctcaact

37 A21A2.S1 aaacctatgggtgggaataaaccaatggacagaatcaccgattctcaact

35 A16D12.S1 aaacctatgggt*ggaataaaccaatggacagaatcaccgattctcaact

-2 L2.SEQ aaacctatgggt*ggaataaaccaatggacagaatcaccgattctcaact

CONSENSUS aaacctatgggt*ggaataaaccaatggacagaatcaccgattctcaact

Figure 4.21 Typical output from "Display contig".

3310 3320 3330 3340 3350

-3 L3.SEQ --------------------------------------------------

40 A21A7.S1 ++++++++++++++++++++++++++++++++++++++++++++++++++

-8 A16A2.S1 --------------------------------------------------

37 A21A2.S1 ++++++++++++++++++++++++++++++++++++++++++++++++++

atggttacgccagactatcaaatatgctgcttgaggcttattcgggcgca

3360 3370 3380 3390 3400

-3 L3.SEQ -------------------------*------------------------

40 A21A7.S1 +++++++++++++++++++g+++++gt++++++++

-8 A16A2.S1 -------------------------*------------------------

37 A21A2.S1 ++-++++++++++++++++++++++*++++++++++++++++++++++++

-35 A16D12.S1 -t----------------------t------a-

-2 L2.SEQ ----------

gatctgaccaagcgacag*tttaaa-gtgctgcttgccatt*ctgcgt*a

3410 3420 3430 3440 3450

-3 L3.SEQ --------------------------------------------------

-8 A16A2.S1 --------------------------------------------------

37 A21A2.S1 ++++++++++++g+++++++++++++++++++++++++++++++++++++

-35 A16D12.S1 --------------------------------------------------

-2 L2.SEQ --------------------------------------------------

aaacctatgggt*ggaataaaccaatggacagaatcaccgattctcaact

Figure 4.22 Typical output from "Highlight disagreements", showing the results for the section of contig displayed in figure 4.16.

2.20 Screen editing contigs in SAP

When using SAP the best way for users to edit a whole contig interactively is to use their prefered external editor on the standard display of a contig. When the screen edit function is selected SAP writes a text file containing a display of the contig and passes it to an external editor - say EDT on the VAX or emacs on a UNIX system. The user modifies the file using the editor and when the editor is exited SAP moves the changed contig back into the project database. 1. Select "Screen edit".

2. Select the contig to edit.

3. Define a temporary file for use by the editor. After a slight pause the editor will start and the first page of the contig will appear on the screen.

4. Edit the contig using the editors standard commands.

5. Exit from the editor.

6. Accept "Put contig back into the database".

2.21 Automatic editing of contigs in SAP

This function automatically changes characters in gel readings to make them agree with the consensus sequence. At first sight this may seem like an unethical procedure but as is explained in the notes it is quite legitimate and saves a great deal of time. In figure 4.23 we show the effect on using autoedit on the section of contig displayed in figure 4.21. All changed characters (for example position 3369, reading A21A7.S1) are denoted by uppercase letters. Note that apart from position 3375 which has an unresolved consensus all other changes have been made. These edits were made using a combined consensus for both strands, but the standard version of the program treats each strand separately and will only make a change if the consensus for the two strands agree. 1. Redirect output to disk.

2. Select "Display contig".

3. Identify the contig to edit/display.

4. Close the redirection file.

5. Print the file containing the displayed contig.

6. Check the contig and the original films and annotate the printout to indicate the required edits.

7. Set the cutoff for the consensus calculation.

8. Select "Auto edit".

9. Identify the contig and the section to edit.

10. The program will display a summary of changes made.

11. Display the contig and compare it with the annotated printout.

12. Use another editing method to finish the editing. 3310 3320 3330 3340 3350

-3 L3.SEQ atggttacgccagactatcaaatatgctgcttgaggcttattcgggcgca

40 A21A7.S1 atggttacgccagactatcaaatatgctgcttgaggcttattcgggcgca

-8 A16A2.S1 atggttacgccagactatcaaatatgctgcttgaggcttattcgggcgca

37 A21A2.S1 atggttacgccagactatcaaatatgctgcttgaggcttattcgggcgca

CONSENSUS atggttacgccagactatcaaatatgctgcttgaggcttattcgggcgca

3360 3370 3380 3390 3400

-3 L3.SEQ gatctgaccaagcgacagtttaaa*gtgctgcttgccattctgcgtaaaa

40 A21A7.S1 gatctgaccaagcgacagTttaaagGtgctg

-8 A16A2.S1 gatctgaccaagcgacagtttaaa*gtgctgcttgccattctgcgtaaaa

37 A21A2.S1 gaTctgaccaagcgacagtttaaa*gtgctgcttgccattctgcgtaaaa

-35 A16D12.S1 gtttaaa-gtgctgcttgccattctgcgtaaaa

-2 L2.SEQ tctgcgtaaaa

CONSENSUS gatctgaccaagcgacagtttaaa-gtgctgcttgccattctgcgtaaaa

3410 3420 3430 3440 3450

-3 L3.SEQ cctatgggtggaataaaccaatggacagaatcaccgattctcaacttag

-8 A16A2.S1 cctatgggtggaataaaccaatggacagaatcaccgattctcaacttagc

37 A21A2.S1 cctatgggtggaataaaccaatggacagaatcaccgattctcaacttagc

-35 A16D12.S1 cctatgggtggaataaaccaatggacagaatcaccgattctcaacttagc

-2 L2.SEQ cctatgggtggaataaaccaatggacagaatcaccgattctcaacttagc

CONSENSUS cctatgggtggaataaaccaatggacagaatcaccgattctcaacttagc

Figure 4.23 The result of applying the "Auto editor" to the section of contig displayed in figure 4.21.

2.22 Using the original editor in SAP

This simple editor can insert, delete and change gel reading sequences by performing one selected operation at a time. It is used during the interactive entry of new readings and interactive joining of contigs. The commands request the position at which the edit is required and the number of characters to insert, delete or change.

3. NOTES

1. As each reading is entered into a project database it is given a unique number. The first is numbered 1, the second 2 and so on. Their original file names (known as "archives" because they are kept outside the database and never edited) are also copied into the database. During assembly contigs are constantly being changed and reordered so the program identifies them by the numbers or names of the readings they contain. Whenever the program asks users to identify a contig or reading they can type its number or its archive name. If they type its archive name they must precede the name by a slash "/" symbol to denote that it is a name rather than a number. For example if the archive name is fred.gel with number 99, users should type /fred.gel or 99 when asked to identify the contig. Generally, when it asks for the reading to be identified, the program will offer the user a default name, and if the user types only return, that contig will be accessed. When a database is opened the default contig will be the longest one, but if another is accessed, it will subsequently become the current default.

2. An XBAP database is made from five separate files: the "archive names" file *.ARN, the "relationships" file *.RLN, the "sequences" file *.SQN, the "tag" file *.TGN, and the "comments" file *.CCN. If the database is called FRED then version 0 of database FRED comprises files FRED.AR0, FRED.RL0, FRED.SQ0, FRED.TG0 and FRED.CC0. The version is the last symbol in the file names. If the "copy database" option is used it will ask the user to define a new "version". The normal strategy is to use version 0 for all work and to use other versions as backups. Program SAP uses databases formed from only the first three of these files. Normally the program is used to handle DNA sequences but many of the functions also work on protein sequences. The choice of sequence type is made when the database is started.

3. Here we describe how to define the positions of cloning and primer sites for vep. The vector sequence should be stored in a simple text file with up to 80 characters of data per line. We define "sequencing vectors" to be those vectors such as m13 used to produce templates for sequencing. All other vectors, such as cosmid vectors, that are used to purify and grow the DNA prior to it being subcloned into sequencing vectors are termed "cloning vectors". It is important that the files containing cloning vector sequences that are used by vep are arranged so that the cloning site follows the last base in the file. For example: start of file

acatacatacatatata

acatagatagatacaga

.

.

.

cagatataX

end of file where X is the cloning site.

For sequencing vectors it is somewhat tedious to calculate the correct valuesfor vep. The numbers can be worked out from listings of the vector sequences but it is far easier to use the restriction enzyme search in nip to do it. The last section of this note explains how to define the positions of cloning site and primer in a single search using nip. First we do it in two steps to explain the operations. The position of the cloning site depends on the ordering of the bases in the particular vector sequence file being used. That is, as the sequences are circular, the file may be arranged to start at any base and still give the same circular sequence. Vep must be told the correct position of the cloning site, then, relative to that, the position of the first base that will be included in the reading. i.e. the relative position of the first base 3' of the primer.

For "forward" primers we search for the complement of the primer sequence in the vector. For "reverse" primers we search for the primer sequence in the vector. The relative positions of reverse primers, being to the "left" of the cloning site, have negative values. Below we use EMBL entry M13MP18 as an example. Here the SmaI site is at 6249, the forward primer (ForwardP) is at relative position 41, and the reverse primer (ReverseP) is at position -24. The figure was produced by nip.

ECORI BANII

. BSP1286

. HGIAI

. SACI

. . BANI

. . . AVAI

. . . BINI

. . . KPNI

. . . .NCII

. . . ..NCII

. . . ..SMAI

. . . ... BAMHI

. . . ... XHOII XBAI

. . . ... . . BINI

123456789012

123456789012345678901234

ReversePaacagctatgaccatg

acacaggaaacagctatgaccatgattacgaattcgagctcggtacccggggatcctcta

6210 6220 6230 6240 6250 6260

SALI

.ACCI

..HINCII PSTI

... . BSPMI

... . . SPHI

... . . . HINDIII EAEI

34567890123456789012345678901

tgaccggcagcaaaatg ForwardP

gagtcgacctgcaggcatgcaagcttggcactggccgtcgttttacaacgtcgtgactgg

6270 6280 6290 6300 6310 6320

Figure 4.24 The positions of SmaI site and a forward and reverse primer for M13MP18.

3.1 Finding the cloning site.

Figure 4.25 shows how to use nip to search for the restriction enzyme site.

Select operation

X 1 Search

2 List enzyme file

3 Clear text

4 Clear graphics

? Selection (1-4) (1) =

Select input source

1 All enzymes file

X 2 Six cutter file

3 Four cutter file

4 Personal file

5 Keyboard

? Selection (1-5) (2) =

? Search for all names (y/n) (y) = n

? Name=smai

? Name=

Select results display mode

X 1 Order results enzyme by enzyme

2 Order results by position

3 Show only infrequent cutters

4 Show names above the sequence

? Selection (1-4) (1) =

? List matches (y/n) (y) =

? The sequence is circular (y/n) (y) =

? Search for definite matches (y/n) (y) =

Working

Matches found= 1

Name Sequence Position Fragment lengths

1 SMAI ccc'ggg 6249 7249 7249

Figure 4.25 Searching for the SmaI site (6249) in M13MP18.

3.2 Finding the primer site

3.2.1 The forward primer.
We must search for the complement of the primer sequence in the vector and find where its last base lies relative to the cloning site. Note that we define the cut site to be 5' of the search string which means that the position reported will that of the first base in the reading. We search for the -20 forward primer in M13MP18 as shown in figure 4.26.

Select operation

X 1 Search

2 List enzyme file

3 Clear text

4 Clear graphics

? Selection (1-4) (1) =

Select input source

1 All enzymes file

X 2 Six cutter file

3 Four cutter file

4 Personal file

5 Keyboard

? Selection (1-5) (2) =5

Define search strings by typing a string name

followed by the string(s)

? Name=forward

? String(s)='actggcc

? Name=

? select names (y/n) (y) = n

Select results display mode

X 1 Order results enzyme by enzyme

2 Order results by position

3 Show only infrequent cutters

4 Show names above the sequence

? Selection (1-4) (1) =

? List matches (y/n) (y) =

? The sequence is circular (y/n) (y) =

? Search for definite matches (y/n) (y) =

Working

Matches found= 1

Name Sequence Position Fragment lengths

1 forward 'actggcc 6290 7249 7249

Figure 4.26 A search for the -20 forward primer in M13MP18.

So the absolute position is 6290 and the relative position is 6290-6249 = 41

3.2.2 The reverse primer
We must search for the primer sequence in the vector and find where its last base lies relative to the cloning site. Note that we define the cut site to be 3' of the search string which means that the position reported will that of the first base in the reading. We search for a reverse primer in M13MP18 as shown in figure 4.27.

Select operation

X 1 Search

2 List enzyme file

3 Clear text

4 Clear graphics

? Selection (1-4) (1) =

Select input source

1 All enzymes file

X 2 Six cutter file

3 Four cutter file

4 Personal file

5 Keyboard

? Selection (1-5) (2) =5

Define search strings by typing a string name

followed by the string(s)

? Name=reverse

? String(s)=gaccatg'

? Name=

? Search for all names (y/n) (y) =

Select results display mode

X 1 Order results enzyme by enzyme

2 Order results by position

3 Show only infrequent cutters

4 Show names above the sequence

? Selection (1-4) (1) =

? List matches (y/n) (y) =

? The sequence is circular (y/n) (y) =

? Search for definite matches (y/n) (y) =

Working

Matches found= 1

Name Sequence Position Fragment lengths

1 reverse gaccatg' 6225 7249 7249

Figure 4.27 A search for a reverse primer in M13MP18.

So the absolute position is 6225 and the relative position is 6249-6225 = 24. However it is to the left of the cloning site so it is -24.

3.3 How to do the calculations in a single step

3.3.1 The forward primer
Figure 4.28 shows how to find the values for the forward primer. The SmaI recognition sequence is ccc'ggg and the primer sequence is 'actggcc.

Select operation

X 1 Search

2 List enzyme file

3 Clear text

4 Clear graphics

? Selection (1-4) (1) =

Select input source

1 All enzymes file

X 2 Six cutter file

3 Four cutter file

4 Personal file

5 Keyboard

? Selection (1-5) (2) =5

Define search strings by typing a string name

followed by the string(s)

? Name=b

? String(s)=ccc'ggg/'actggcc/

? Name=

? Search for all names (y/n) (y) =

Select results display mode

X 1 Order results enzyme by enzyme

2 Order results by position

3 Show only infrequent cutters

4 Show names above the sequence

? Selection (1-4) (1) =

? List matches (y/n) (y) =

? The sequence is linear (y/n) (y) = n

? Search for definite matches (y/n) (y) =

Working

Matches found= 2

Name Sequence Position Fragment lengths

1 b ccc'ggg 6249 7208 41

2 b 'actggcc 6290 41 7208

Figure 4.28 Finding the vep values for the forward primer in a single step. This gives us the position of the cloning site (6249) and the relative position of the primer site (41)

3.3.2 The reverse primer
Figure 4.29 shows how to find the reverse primer in a single step. The SmaI recognition sequence is ccc'ggg and the primer sequence is gaccatg'

Select operation

X 1 Search

2 List enzyme file

3 Clear text

4 Clear graphics

? Selection (1-4) (1) =

Select input source

1 All enzymes file

X 2 Six cutter file

3 Four cutter file

4 Personal file

5 Keyboard

? Selection (1-5) (2) =5

Define search strings by typing a string name

followed by the string(s)

? Name=p

? String(s)=gaccatg'/ccc'ggg/

? Name=

? Search for all names (y/n) (y) =

Select results display mode

X 1 Order results enzyme by enzyme

2 Order results by position

3 Show only infrequent cutters

4 Show names above the sequence

? Selection (1-4) (1) =

? List matches (y/n) (y) =

? The sequence is linear (y/n) (y) = n

? Search for definite matches (y/n) (y) =

Working

Matches found= 2

Name Sequence Position Fragment lengths

1 p gaccatg' 6225 7225 24

2 p ccc'ggg 6249 24 7225

Figure 4.29 Finding the vep values for the reverse primer in a single step.

This gives us the position of the cloning site (6249) and the relative position of the primer site (-24). 4. Almost all readings are assembled automatically in their first pass through the assembly routine. Those that are not can be dealt with in two ways. Either they can be put through assembly again as single named readings (Users should type n when asked "Use file of file names"), with the parameters set to allow the reading in. Or they can be entered through the assembly routine using the "Put all readings in new contigs" mode, and then joined to the contig they overlap using the Contig Joining Editor. If it is found that readings are not being assembled in their first pass through the assembler, then it is likely that the contigs require some editing to improve the consensus. Also it may be that poor quality data is being used, possibly by users overinterpretting films or traces. In the long term it can be more efficient to stop reading early and save time on editing. For those using fluorescent sequencing machines the unused data can be incorporated after assembly.

5. Obviously we cannot use a script to operate a program that expects to be controlled by mouse clicks! The program BAP is an xterm version of XBAP which can be used from a script.

6. The "copy database" option allows users to make a backup of their database. At the same time they can also change the database size or the maximum reading length.

7. For those using fluorescent sequencing machines and XBAP the combination of the contig editor and the graphical displays of consensus "quality" will probably be sufficient for checking and editing contigs as everything can be done at the computer screen. For those using autoradiographs the facility to produce printouts of "display" and "highlight disagreements" options for use while checking films, and the autoedit command are most appropriate.

8. In general the quality of a reading deteriorates along the length of the gel and so it is also possible to use a length cutoff for the quality calculation. Only the data from the first section of each reading will be included in the calculation.

9. There are some limitations on the changes that can be made to the contigs when using the SAP screen editor. Alignments must be maintained during editing. Whole lines of sequence should not be deleted or added unless the order of the gel readings in the contig is preserved. Each line in the contig display consists of gel reading numbers, their names and 50 character sections of sequence. Insertions are limited in the following way. No line of sequence can be extended rightwards more than 5 characters beyond the end of a full length line (a full length line is 50 characters long). Only one character can be added to the left end of full length lines, but sections of sequence beginning further into a line can be extended leftwards up to an equivalent position. Do not delete any non-sequence lines in the file. Before returning the contig to the database the program checks that the rules have been obeyed. If an error is found the number of the erroneous line in the file is displayed and the contig will not be changed.

10. The following is a justification for using the auto edit function. The general strategy employed when collecting shotgun sequence data is to keep sequencing until the redundancy in the contigs is fairly high, and then to get a printout of a contig, check problems against the films, note corrections on the printout, and make the changes using an interactive editor. In general the consensus is correct except for places where padding characters have been used to accommodate a single gel with an extra character, or where the consensus is dash. The important point for the auto editor is that most edits simply make the gel readings conform to the consensus, or remove columns of pads. The auto editor does the following. 1) calculates a consensus for the contig (or part of a contig) to be edited, and then uses this consensus to direct the editing of the contig in 3 stages 2) stage 1: find and correct all places where, if the order of two adjacent characters is swapped, they will both agree with the consensus (given that they did not match the consensus before). These corrections are termed "transpositions" 3) stage 2: find and correct all places where there is a definite consensus but the gel reading has a different character. These corrections are termed "changes". 4) stage 3: delete all positions in which the consensus is a padding character. These corrections are termed "deletions". All changed characters are shown in uppercase letters so it will be obvious which characters have been assigned by the program (except for deletions). The number of each type of correction will be displayed.

11. The "calculate consensus" function, the "display contig" routine, the contig editor and the "show quality" option use the rules outlined here to calculate a consensus from aligned gel readings. The consensus sequence can contain any of 6 possble symbols: a,c,g,t,* and -. The last symbols is assigned if none of the others makes up a sufficient proportion of the aligned characters at any position in the contig. The following calculation is used to decide which symbol to place in the consensus at each position. Each uncertainty code contributes a score to one of a,c,g,t,* and also to the total at each point. Symbols like r and y which don't correspond to a single base type contribute only to the total at each point.

Definite assignments i.e. A,C,G,T,a,c,g,t,b,d,h,v,k,l,m,n,a,c,g,t,* =1 probable assignments i.e. 1,2,3,4 = 0.75 other uncertainty codes including r,y,5,6,7,8,- = 0.1 A cutoff score between 1 and 100% is set by the user. (When the program starts this is set to 1%.). At each position in the contig we calculate the total score for each of the 5 symbols a,c,g,t and * (denote these by Xi, where i=a,c,g,t or *), and also the sum of these totals (denote this by S). Then if 100 Xi / S > the cutoff for any i, symbol i is placed in the consensus; otherwise - is assigned. However if the cutoff is <51% and the highest score is equalled by more than one base the algorithm will still assign one of these bases to the consensus. For the "examine quality" algorithm each strand is treated separately but the calculation is the same.

12. Databases can become corrupted if the machine crashes so the programs contain a function "Check database for logical consistency" which checks to see if all the relational data is internally consistent. Some routines automatically perform this check before they start. Users are advised to make frequent copies of their databases using the "Copy database" option. Note that if BAP is used in "execute with dialogue" mode the "Check logical consistency" function also creates a consensus for the whole database and scans it to find any regions which contain 15 dashes in 20 characters. Such a finding would indicate problems with the database.

13. We have covered many of the most important or complicated operations peformed by SAP and XBAP, but several others have not been mentioned. These include those for creation of consensus sequence files for processing by other programs, and complementing contigs, both of which are trivial. There is also a set of routines for fixing corrupted databases.

14 The VAX version of SAP will only allow one person to access a sequencing database at a time - producing an "unable to open database" error message if a second person tries. On UNIX machines there is no such check in program SAP so users need to make sure that simultaneous use does not occur. Otherwise the data will be corrupted. Program BAP prevents more than one person from using a database at any time. It does so using the following mechanism. When a user requests to open a particular copy (say 0) of a database (say DB) the program checks for the existence of a file named DB_BUSY0 in the current directory. In normal circumstances, if the file exists, it indicates that somebody else is currently using the database and the program displays the message "Sorry database busy" and does not open the files. If the file does not exist the program creates it and opens the database. When a user stops using the database (usually by quitting the program) the "busy file" is deleted, hence allowing others to use the database. If the program terminates abnormally the busy file will not be deleted and so the database will not be useable until the busy file is explicitly deleted using the rm command. Obviously it is dangerous to delete the file before checking if another user is using the database.

15 After a run of the assembly routine, the names of the readings that fail to be entered are written to a file of failed reading names. Each is given a failure code. Reasons for failure and codes are as follows. 1. The reading file was not found (0); 2. the reading file was too short (less than the minimum match length) (1); 3. the reading appeared to match somewhere but failed to align sufficiently well (too many padding characters or too high a percentage mismatch) (2); 4. a reading of the same name was already present in the database(3).

16 We have recently devised our own file format (called SCF) for storing traces, sequences and confidence values for data produced by automated sequence readers (Dear and Staden, 1992). For ABI data these typically reduce the storage required to 30% of the original. Data from the ABI 373A and the Pharmacia A.L.F. can be converted to this form using the program makeSCF. Note that A.L.F. files must first be processed by program alfsplit which splits the original data into one file per reading. Sequences can be extracted from SCF files in a form suitable for assembly by use of the program trace2seq. To locate and mark regions of a sequence from an automated sequence reader that are of too low a quality to be used for assembly we use the script clip-seqs. This script takes as input a file of reading file names. For each reading it renames the original file "original-filename~" and writes a new file called "original-filename" in which the poor quality regions are marked.makeSCF is used for creating SCF files from ABI 373A or A.L.F. trace files. A formal definition of the user interface is very simple as is shown below. makeSCF [(-ABI | -ALF) {tracefilename} -output {outputfilename}]

For example to make an SCF file called fred.SCF from an ABI file called fred.trace the user would type:

makeSCF -ABI fred.trace -output fred.SCF 17 The oligo selection engine is the one used in the program OSP (6). The parameters controlling the selection of oligos can be changed in the "Oligo Selection Parameters" window. The weights controlling the scoring of selected oligos can be changed in the "Oligo Selection Weights" window. By default, the oligos are selected from a window that extends 40 bases either side of the cursor. The size and location of this window relative to the cursor position can be changed in the "Parameters" window. In XBAP oligos are ranked according to their proximity to the cursor position, rather than by their scores.

18 For simplicity, each reading is considered to represent a template. In practise, many readings can be made off the same template. Suitable templates that are identified are those that: 1. are in the appropriate sense, 2. have 5' ends that start upstream of the oligo, and 3. are sufficiently close to the oligo to be useful. This last criterion relates to the insert size for the subclones used for sequencing and the average reading length. A template is considered useful if a full reading can be made from it, taking into account both of these factors. The default insert size is 1000 bases, and the default average reading length is 400 bases. These values can be changed in the "Parameters" window.

19 The consensus calculation routine is used to create a consensus file that can be analysed by other programs. It can create a consensus for single contigs or for all contigs. It can produce its file in staden format or FASTA format. A special mode of operation produces a consensus identical to the standard one except that all places that are judged to be well determined on both strands of the sequence are written as dashes (-), and all "single stranded" regions contain the appropriate A,C,G or T. If a large scale project is sequencing several overlapping cosmids it is important to reduce the duplication of effort for the overlapping regions as much as possible. This special form of consensus can be used in the following way. All the readings from neighbouring cosmids can be compared with the consensus using the vector screening option of BAP. The ones that match will correspond to regions that require extra data and so they can be assembled into the database where they will provide useful coverage. A script for this purpose called STEALDATA is included in the package.

20 The Alu screening program REP works in the following way. Each reading is compared in both orientations with the complete contents of a library of Alu sequences obtained from J. Jurka (Jurka, et al). Although the underlying comparison is fast, because the program always compares against all the Alu sequences (126 of them at present) the overall running time is slow. In the future we hope to either reduce the number of Alu sequences in the library, or use a different type of search in which the whole family is represented by a single pattern or model.

21 Our methods for assembling Alu containing sequences are experimental. We apply REP to produce a list of Alu containing readings and a list of Alu free readings. We sort the these lists using the UNIX sort function prior to assembly. The Alu free list (with file name pass) is sorted into ascending order on Alu match score order using "sort -n +1 -o pass.sort pass" to produce the file pass.sort; and the Alu list is sorted (with file name fail) into descending order on the amount of non-Alu sequence at either end of the readings using "sort -n -r +3 -o fail.sort fail" to produce the file fail.sort. The Alu free data is assembled first using the file of file names pass.sort. Then the Alu containing data is assembled using the file of file names fail.sort. It is useful to assemble at high stringency (ie using a low percentage mismatch); to screen the full length of each reading prior to assembly and to sort the resulting file into ascending order on percentage mismatch. It is also useful to use the the "check assembly" function. An independent and important check is provided by sequencing both ends of each template. The program can plot out the positions of these read pairs and hence show any inconsistencies in their locations in the contigs.

4. References

1. Staden, R. 1982. Automation of the computer handling of gel reading data produced by the shotgun method of DNA sequencing. Nucl. Acids Res. 10 (15):4731-4751.

2. Staden, R. 1990. An improved sequence handling package that runs on the Apple Macintosh. Comput. Applic. Biosci. 4, 387-393.

3. Dear S and Staden,R. 1991. A sequence assembly and editing for efficient management of large projects. Nucl. Acids Res. 19, 3907-3911.

4. Gleeson, T and Hillier, L. 1991. A trace display and editing program for data from fluorescence based sequencing machines. Nucl. Acids Res. 19, 6481-6483.

5. Dear S and Staden, R. 1992. A standard file format for data from DNA sequencing instruments. DNA Sequence, 3, 107-110.

6. Hillier, L. and Green, P. 1991. "OSP: an oligonucleotide selection program," PCR Methods and Applications, 1:124-128.

7. Jurka, J., Walichiewicz, J and Milosavljevic, A. 1992. Prototypic sequences for human repetitive DNA. J. Mol. Evol. 35, 286-291.

5. Analysing Sequences to Find Genes

Table of contents

1. Introduction

2. Methods

2.1 The uneven positional base frequencies method.

2.2 The positional base preferences method

2.3 The codon usage method

2.4 Searching for open reading frames

2.5 Searching for tRNA genes

3. Notes

1. Introduction

We outline three methods for finding protein genes and one for locating tRNA genes, plus routines for finding open reading frames and displaying the positions of stop codons. All the methods are contained in the program NIP. The correct interpretation of the analyses presented requires a good understanding of the underlying ideas used by the methods. Despite this we concentrate here on the use of the techniques and refer the reader to earlier publications (1-5) for more background information.

The assumption made by the methods for finding protein genes is that protein coding regions, when analysed in terms of 3 letter nonoverlapping "words", will look different to noncoding regions analysed in the same way. Suppose we analyse a sequence in one reading frame and count its codons. Then we define the "positional base composition" as the frequency at which each of the four base types occupies each of the three positions in codons. In coding regions the positional base frequencies will be less random than they are in noncoding regions. This is the basis of method 1: the "Uneven positional base frequencies method". If this reading frame is coding for a protein the positional base composition will tend towards a particular bias which is common to the majority of genes. This is the basis of method 2 the "Positional base preferences method". If the sequence has a very biased base composition then in protein genes this may effect the choice of amino acids, and will effect the use of bases in the third positions of codons. This bias is also utilised by the positional base preferences method. Finally if the reading frame is coding for a protein its use of codons is also likely to be nonrandom and this is the basis of method 3, the "Codon usage method".

All the methods perform their analyses over segments of the sequence of size "window", and then move the window on by three bases and repeat the calculation. The "Uneven positional base frequencies" method only produces a single value for each segment and hence cannot distinguish between frames or strand - it only measures the probability that a region is coding and nothing more. The other two methods produce different values for each of the three potential reading frames and hence can help to decide which is coding. Their results are plotted in three separate boxes arranged one above the other. For these we also indicate which of the three reading frames is the highest scoring at each position along the sequence. This is done by plotting a single dot at the mid-height of the box that contains the highest score, so that if one frame is the highest scoring for many consecutive positions, the dots will produce a solid line at the mid-height of its box. We also mark the positions of stop codons. These are represented by short vertical lines and are positioned so that they bisect the mid-height of each box. Start codons are marked at the base of the box for each reading frame.

The search for tRNA genes involves looking for segments that could fold into the cloverleaf structure and which have the expected conserved bases in the appropriate positions.

Notice that we have not mentioned searches for relevent "signals" like promoters or splice junctions which are also useful for finding genes. These searches are described in the chapter on searching for motifs. In the current chapter the only "signal" we include is the stop codon. However as all results are presented graphically it is easy for users to overlay the displays of signal searches with those presented here and so effectively combine them.

2. Methods

2.1 The uneven positional base frequencies method.

This method produces a single value for each segment of the sequence, and would give the same result if applied to each reading frame or to the complementary strand. The results are plotted in a box that is cut by a horizontal line. This line is labelled 76% and we expect 76% of noncoding sequences to score below this line and 76% of coding sequence to score above it. Of the methods described this one makes the fewest assumptions and so is a good unbiased indicator of the probability that a sequence is coding. 1. Select "Uneven positional base frequencies".

2. Define "Odd window length".

3. Define "Plot interval". The plot will appear as in figure 5.1. In the example shown the 5' end of the sequence codes for several proteins and the 3' end codes for ribosomal RNAs.

Figure 5.1. Example output from the uneven positional base frequencies method. The 5' end codes for proteins and the 3' end contains ribosomal RNA genes.

2.2 The positional base preferences method

As a result of the genetic code and the relative frequencies with which amino acids are used in proteins, DNA sequences coding for proteins have a particular bias in their positional base frequencies. This method scans DNA sequences and measures the closeness of each reading frame to this bias in their positional base frequencies. The closeness to the expected bias is expressed as a :"score". By default the program will use a "global" set of expected values for the positional base frequencies which are derived from average amino acid compositions in known proteins. Alternatively users may create their own set of expected values by analysing known genes from the same genome. In addition users can combine the "global" values for the first two positions in codons with third position values derived from other genes of the same genome.

In order to use a nonglobal standard, a codon table in the format described in the chapter on statistical analysis of nucleic acid sequences, can be created using the method "Creating a codon usage file". Alternatively a section of the sequence being analysed can be scanned to produce an internal standard. The method is particularly useful for selecting which reading frame is coding.

2.2.1 Using the global standard
1. Select "Positional base preferences method".

2. Select "Standard source" as "Global".

3. Define "Window length". The default length of 67 should be used for most cases. Shorter windows give noisier plots and the longer the window the more chance there is of missing a short exon.

4. Define "Plot interval". The plot will appear as in figure 5.2. This shows a 10,000 base section of sequence that codes for several proteins in each of the three reading frames. See the introduction for an explanation of the plotting scheme used.

Figure 5.2 Example output from the positional base preferences method. Most of the sequence is coding for proteins.

2.2.2 Using a nonglobal standard
1. Make an appropriate codon usage file as described in the chapter on statistical analysis of nucleotide sequences.

2. Select "Positional base preferences method".

3. Select "Standard source" as "Codon usage table".

4. Define "File name of standard". The file will be read and displayed on the screen.

5. Select "Normalisation" as "Combine with global standard". This alternative means we will use the values for the first two positions of codons combined with the third position values from our codon table. Otherwise ("Use observed frequencies") will use all three positions from our codon table. The positional base frequencies to be used will be displayed.

6. Accept "Use 1.0 for positional weights". The alternative allows users to give greater or lesser emphasis to any of the three positions by defining weights for each. The program displays the "Expected scores per codon in each frame".

7. Define "Window length". Windows shorter than the default of 67 may be useful if the bias is sufficiently strong. Look at the "Expected scores in each frame" to help decide.

8. Define "Plot interval".

9. Accept "Plot relative scores". This means that for each frame we plot its score divided by the sum of the scores for all three frames. It produces smoother plots than the alternative "Plot absolute scores" which simply plots the scores for each frame. The minimum and maximum expected scores for the given standard and window length are displayed.

10. Accept "Leave scaling values unchanged". The expected scores just displayed will be used to scale the plots. If required the user can change the scaling values at this point.

The plot will now appear as in figure 5.2. Typical dialogue is shown in figure 5.3.

2.3 The codon usage method

The codon usage method scans along a sequence and measures the closeness of each reading frames codon composition to an expected set of codons. Of the methods described it is the most sensitive, but consequently has to make the strongest assumption, namely that we know the approximate codon usage for the genes being searched for. The codon usage will depend on the codon preferences and the amino acid composition of the protein product. For this reason the program contains three methods of "normalisation". The table of codon usage may be used as read "Observed frequencies"; the table may be transformed to reflect an average amino acid composition "Normalise to average amino acid composition"; the table may be transformed to have no amino acid bias "Normalise to no amino acid bias". The table can be read from a file produced by "Creating a codon usage file" as described in the chapter on statistical analysis of nucleic acid sequences, or an "internal standard" can be used by the user defining a region of the current sequence. In the latter case the program will calculate the codon usage for the defined region. 1. Select "Codon usage method".

2. Reject "Define internal standard". If an internal standard is used the program will ask for the end points of the segments over which to calculate the codon usage.

3. Define "File name of standard". The file will be read and displayed on the screen.

4. Select "Normalisation" as "Average amino acid composition". The program will display the expected values for each reading frame for the window lengths 21, 31 and 41 codons.

5. Select "Window length".

6. Select "Plot interval". The plot will appear as in figure 5.4. This shows a 10,000 base section of sequence that codes for several proteins in each of the three reading frames. See the introduction for an explanation of the plotting scheme used. Positional base preferences method to find protein genes

Select standard source

X 1 Use global standard

2 Use internal standard

3 Use codon usage table

? Selection (1-3) (1) =3

? File name of standard=atpase.cods

===========================================

F TTT 21. S TCT 33. Y TAT 15. C TGT 5.

F TTC 55. S TCC 40. Y TAC 40. C TGC 4.

L TTA 8. S TCA 7. * TAA 8. * TGA 0.

L TTG 19. S TCG 12. * TAG 1. W TGG 17.

===========================================

L CTT 22. P CCT 17. H CAT 6. R CGT 73.

L CTC 21. P CCC 4. H CAC 30. R CGC 23.

L CTA 1. P CCA 10. Q CAA 19. R CGA 5.

L CTG 168. P CCG 48. Q CAG 80. R CGG 3.

===========================================

I ATT 47. T ACT 14. N AAT 17. S AGT 8.

I ATC 98. T ACC 54. N AAC 52. S AGC 26.

I ATA 6. T ACA 7. K AAA 85. R AGA 0.

M ATG 75. T ACG 13. K AAG 28. R AGG 0.

===========================================

V GTT 67. A GCT 56. D GAT 41. G GGT 90.

V GTC 29. A GCC 53. D GAC 66. G GGC 66.

V GTA 49. A GCA 59. E GAA 101. G GGA 5.

V GTG 57. A GCG 64. E GAG 41. G GGG 8.

===========================================

Select normalisation

X 1 Use observed frequencies

2 Combine with global standard

? Selection (1-2) (1) =2

T C A G Range

1 0.177 0.211 0.277 0.336 0.159

2 0.271 0.238 0.310 0.182 0.128

3 0.242 0.301 0.168 0.289 0.132

? Use 1.0 for positional weights (y/n) (y) =

Expected scores per codon in each frame

0.785 0.736 0.736

? odd span length (31-101) (67) =

? plot interval (1-11) (5) =

? Plot relative scores (y/n) (y) =

Minimum maximum range

0.3219 0.3519 0.0214

? Leave scaling values unchanged (y/n) (y) =

Figure 5.3 Typical dialogue from the "Positional base preferences method" using a nonglobal standard in the form of a codon table to specify the values for the third positions in codons.

2.4 Searching for open reading frames

This routine finds all open reading frames of some minimum length and writes its results in the form of an EMBL feature table. 1. Select "Find open reading frames".

Figure 5.4 Example output from the codon usage method. Most of the sequence is coding for proteins. 2. Define "Minimum open frame in amino acids".

3. Select "Strands". The alternatives are: + strand only, - strand only, or both strands. Typical output is shown in figure 5.5. FT CDS 525..965

FT CDS 956..1789

FT CDS 2128..2607

FT CDS 2604..3155

FT CDS 3159..4709

FT CDS 4733..5623

FT CDS 5539..7032

FT CDS 7044..7454

FT CDS 7797..8134

FT CDS complement(2227..2634)

FT CDS complement(2250..3023)

FT CDS complement(3027..3899)

FT CDS complement(3903..4760)

FT CDS complement(4327..4626)

FT CDS complement(4646..5332)

FT CDS complement(5345..5647)

FT CDS complement(5635..6012)

FT CDS complement(6016..6441)

FT CDS complement(6445..7083)

FT CDS complement(7035..7445)

FT CDS complement(7406..7777)

Figure 5.5 Typical output from "Find open reading frames"

2.5 Searching for tRNA genes

tRNA genes have two classes of feature that can be used to locate them in genomic sequences: their ability to fold into the cloverleaf secondary structure, and the presence of specific "conserved" bases at particular positions relative to this structure. The level of congruence with the canonical structure is quite variable: some tRNA genes contain intervening sequences and others, particular those from organelles, have few of the conserved bases. The program searches for potential cloverleaf forming structures and optionally the presence of conserved bases. The user can define the range of loop sizes, the minimum numbers of potential base pairs, a range of intron sizes, and which, if any, of the conserved bases should be present. The results are presented either textually or graphically. 1. Select "tRNA search".

2. Define "Maximum tRNA length".

3. Define "Aminoacyl stem score". See note 8.

4. Define "Tu stem score".

5. Define "Anticodon stem score".

6. Define "D stem score".

7. Define "Minimum base pairing total".

8. Define "Minimum intron length".

9. Define "Maxmimum intron length".

10. Define "Minimum length for TU loop".

11. Define "Maximum length for TU loop".

12. Accept "Skip search for conserved bases". See notes section.

13. Reject "Plot results". This gives listed output in which the potential cloverleafs are displayed. The alternative plotted output simply draws a vertical line to represent the score for the potential gene, at the position it has been found. Typical dialogue and the beginning of some listed output is shown in figure 5.6.

3. Notes

1. In general, for finding protein genes, we recommend the use of all the methods. The "Uneven positonal base frequencies" method can show which regions are likely to be coding but not which strand or frame. The "Positional base preferences" method can show the correct frame and also help to find which regions are coding. The "Codon usage" method has the greatest resolution, having been used successfully with windows of 11 codons, and can help find small exons and to pinpoint exon/intron boundaries.

2. When the "Uneven positional base frequencies" calculation was applied to all the sequences in the 1984 version of the EMBL library 14% of noncoding segments failed to reach the value represented by the base of the box, whereas all coding segments did. The top value of the box was not reached by any noncoding segments but was exceeded by 16% of coding sequences. 76% of noncoding segments failed to reach the line labelled 76% but 76% of coding segments fell above it. We would not expect this result change significantly if it were to be recalculated on the current libraries.

3. When the "Positional base preferences" method, using "global" values, was applied to all the E. coli genes in the 1984 version of the EMBL library it chose the correct reading frame for 91% of coding segments. E. coli sequences were used for technical rather than scientific reasons and we have no reason to believe that other organisms should give significantly different results. This result used only the values for the first two positions in codons and so for genes with a strongly biased base composition we would expect even better discrimination. tRNA search

? Maximum trna length (70-130) (92) =

? Aminoacyl stem score (0-14) (11) =

? Tu stem score (0-10) (8) =

? Anticodon stem score (0-10) (8) =

? D stem score (0-8) (3) =

? Minimum base pairing total (30-44) (30) =

? Minimum intron length (0-30) (0) =

? Maximum intron length (0-30) (0) =

? Minimum length for TU loop (4-12) (6) =

? Maximum length for TU loop (6-12) (9) =

? Skip search for conserved bases (y/n) (y) =n

Give a score for each base, then a minimum total at the end

? Base 8, T is 100% conserved. Score (0-100) (0) =

? Base 10, G is 95% conserved. Score (0-100) (0) =

? Base 11, Y is 96% conserved. Score (0-100) (0) =

? Base 14, A is 100% conserved. Score (0-100) (0) =

? Base 15, R is 100% conserved. Score (0-100) (0) =

? Base 21, A is 97% conserved. Score (0-100) (0) =

? Base 32, Y is 100% conserved. Score (0-100) (0) =

? Base 33, T is 98% conserved. Score (0-100) (0) =

? Base 37, A is 91% conserved. Score (0-100) (0) =

? Base 48, Y is 100% conserved. Score (0-100) (0) =

? Base 53, G is 100% conserved. Score (0-100) (0) =

? Base 54, T is 95% conserved. Score (0-100) (0) =

? Base 55, T is 97% conserved. Score (0-100) (0) =

? Base 56, C is 100% conserved. Score (0-100) (0) =

? Base 57, R is 100% conserved. Score (0-100) (0) =

? Base 58, A is 100% conserved. Score (0-100) (0) =

? Base 60, Y is 92% conserved. Score (0-100) (0) =

? Base 61, C is 100% conserved. Score (0-100) (0) =

? Minimum total conserved base score (0-0) (0) =

? Plot results (y/n) (y) =n

264

t

t-a

c-g

a-t

t+g

a-t

a a

a-t gta

c aacgc

a t !!!! c

cgt gtgcg a

!!! t cga

a gca c

g t g

c aa t

a-t a

t-a t a

t-a

t-a

g t

c g

caa

Figure 5.6 Typical dialogue and textual output from "Find tRNA genes". 4. If the codon table used by the "Codon usage" method is normalised to have average amino acid composition it retains its codon preference bias for each amino acid type but now the amino acid composition is the average of all proteins. In general this is optimal: we have the expected codon preference bias plus an expected amino acid bias. If we normalise to no amino acid bias we are safeguarding ourselves against missing a protein of anomalous composition but at the expense of not employing all of the useful information for distinguishing coding from noncoding.

5. The program also contains a graphical version of Ficketts method (6), except here we use a window to analyse each segment of the sequence rather than giving a single value for each open reading frame. The tables used are those from the original publication.

6. If the results from the "Find open reading frames" option are directed to disk (See the introductory chapter), the file can be used by the routines that use feature tables as input.

7. The program also contains several routines for plotting the positions of stop and start codons for either strand of the sequence. One form of the output is included in figures 5.2 and 5.4.

8. The tRNA gene search using a simple scoring system for base pairing: A-T and G-C base pairs each score 2 and G-T scores 1. The use of a "Minimum base pairing total" allows low cutoffs to be set for each individual stem, but that overall some reasonable level of stability is possible. In this way a low score for one stem can be compensated by a high score in another.

10. The cloverleaf is composed of four base-paired stems and four loops. Three of the stems are of fixed length but the fourth, the dhu stem which usually has four base pairs, sometimes has only three. All of the loops can vary in size. The following relationships between the stems in the cloverleaf are assumed in the program: (a) there are no bases between one end of the aminoacyl stem and the adjoining tuc stem; (b) there are two bases between the aminoacyl stem and the dhu stem; (c) there is one base between the dhu stem and the anticodon stem; (d) there are at least three bases between the anticodon stem and the tuc stem.

4. References

1. Staden, R. and McLachlan, A.D. 1982. Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucl. Acids Res. 10:151-156.

2. Staden, R. 1984. Measurements of the effects that coding for a protein has on a DNA sequence and their use for finding genes. Nucl. Acids Res. 12:551-567.

3. Staden, R. 1985. Computer methods to locate genes and signals in nucleic acid sequences. (in) Genetic Engineering, Principle and Methods, Setlow J.K., Hollaender A., (eds.), 7:67-114, (Plenum Press, New York).

4. Staden, R. 1990. Finding Protein Coding Regions in Genomic Sequences. (in) Methods in Enzymology R.F. Doolittle (ed.), 183:163-180 (Academic Press, New York).

5. Staden, R. 1980. A computer program to search for tRNA genes. Nucl. Acids Res. 8:817-825.

6. Fickett, J.W. 1982. Recognition of protein coding regions in DNA sequences. Nucl. Acids Res. 10:5303-5318.

6. Searching for Motifs in Nucleic Acid Sequences

Table of contents

1. Introduction

2. Methods

2.1 Searching for percentage matches to consensus sequences

2.2 Searching for consensus sequences using a score matrix

2.3 Using weight matrices for searching nucleotide sequences

2.4 Using "hardwired" motif searches.

3. Notes

4. References

1. Introduction

The program NIP contains several ways of defining and searching for motifs (1-4), and also contains a number of "hardwired" motifs that are already defined and can be selected as separate searches. We describe searches for percentage matches to consensus sequences, the use of score matrices and the creation and use of nucleotide and dinucleotide weight matrices (see note 7). In addition we give details of the "hardwired" motifs available from the program. In another chapter we have covered searches for exact matches to consensus sequences by describing how to find restriction enzyme recognition sequences. When searching for exact matches, percentage matches or using a score matrix the search string or consensus sequence may include IUB redundancy codes. All of the searches produce both listed and graphical output. The listed output displays the matching sequence and its position and the graphical output draws a box to represent the length of the sequence, and plots vertical lines within the box at the positions of matches. The heights of the lines are proportional to the match score (see figure 6.1).

Figure 6.1 Typical graphical output from a motif search. It shows a rectangular box in which each match is identified by a vertical line whose height gives the match score and whose x coordinate indicates the position in the sequence.

2. Methods

2.1 Searching for percentage matches to consensus sequences

1. Select "Find percentage matches".

2. Accept "Type in strings". The alternative allows the string to be extracted from a named file.

3. Reject "Keep picture". This will cause the graphics window to be cleared. The alternative leaves it unchanged.

4. Define "String". Type in the search string. When the program cycles round to this point again the previous string will be offered as a default.

5. Accept "This sense". The alternative directs the program to search for the complement of the string.

6. Define "Percent match". The search is performed, the results are presented graphically (see figure 6.1), the number of matches displayed, and the scores and positions of the top 10 matches displayed.

7. Define the number of matches to "Display". For the number of matches chosen the program will display the search string and matching sequence written one above the other with matching characters indicated by asterisk symbols. The program now cycles round to step 3. See figure 6.2. Find percentage matches

? Type in string (y/n) (y) =

? Keep picture (y/n) (y) =

? String=AAAATTTT

STRING=AAAATTTT

? This sense (y/n) (y) =

? Percent match (1.00-100.00) (70.00) =

Total scoring positions above 70.000 percent = 41

Scores 7 7 7 7 6 6 6 6 6 6

Positions 428 534 2994 7026 130 191 192 372 427 429

? Display (0-41) (0) =4

428

aaaatatt

***** **

AAAATTTT

1

534

aaagtttt

*** ****

AAAATTTT

1

2994

aaaatttc

*******

AAAATTTT

1

7026

aaaacttt

**** ***

AAAATTTT

1

Figure 6.2 Worked example for the percentage match search

2.2 Searching for consensus sequences using a score matrix

A score matrix gives a score for the alignment of each possible pair of sequence symbols. The matrix used by this program includes all the IUB redundancy codes and gives scores that represent the level of redundancy. The matrix is shown in figure 6.3. 1. Select "Find matches using a score matrix".

2. Accept "Type in strings". The alternative allows the string to be extracted from a named file.

3. Reject "Keep picture". This will cause the graphics window to be cleared. The alternative leaves it unchanged.

4. Define "String". Type in the search string. When the program cycles round to this point again the previous string will be offered as a default.

5. Accept "This sense". The alternative directs the program to search for the complement of the string. The program displays the maximum possible score for the string.

6. Define "Score". The search is performed, the results are presented graphically (see figure 6.1), the number of matches displayed, and the scores and positions of the top 10 matches displayed.

7. Define the number of matches to "Display". For the number of matches chosen the program will display the search string and matching sequence written one above the other with matching characters indicated by asterisk symbols. The program now cycles round to step 3. The dialogue shown in figure 6.2 is almost exactly the same as that for "Searching for consensus sequences using a score matrix". T C A G - R Y W S M K H B V D N ?

T 36 0 0 0 9 0 18 18 0 0 18 12 12 0 12 9 0

C 0 36 0 0 9 0 18 0 18 18 0 12 12 12 0 9 0

A 0 0 36 0 9 18 0 18 0 18 0 12 0 12 12 9 0

G 0 0 0 36 9 18 0 0 18 0 18 0 12 12 12 9 0

- 9 9 9 9 36 18 18 18 18 18 18 27 27 27 27 36 0

R 0 0 18 18 18 36 0 9 9 9 9 6 6 12 12 18 0

Y 18 18 0 0 18 0 36 9 9 9 9 12 12 6 6 18 0

W 18 0 18 0 18 9 9 36 0 9 9 12 6 6 12 18 0

S 0 18 0 18 18 9 9 0 36 9 9 6 12 12 6 18 0

M 0 18 18 0 18 9 9 9 9 36 0 12 6 12 6 18 0

K 18 0 0 18 18 9 9 9 9 0 36 6 12 6 12 18 0

H 12 12 12 0 27 6 12 12 6 12 6 36 8 8 8 27 0

B 12 12 0 12 27 6 12 6 12 6 12 8 36 8 8 27 0

V 0 12 12 12 27 12 6 6 12 12 6 8 8 36 8 27 0

D 12 0 12 12 27 12 6 12 6 6 12 8 8 8 36 27 0

N 9 9 9 9 36 18 18 18 18 18 18 27 27 27 27 36 0

? 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Figure 6.3 The DNA score matrix using IUB symbols

2.3 Using weight matrices for searching nucleotide sequences

A weight matrix is the most sensitive way of defining a motif. It is a table of values that gives scores for each base type in each position along a motif. For a motif of length 8 bases the weight matrix would be a table 8 positions long and 4 deep. The simplest way of choosing the values for the table is to take an alignment of all known examples of the motif and to count the frequency of occurrence of each base type at each position. These frequencies can be used as the table of weights. When the table is used to search a new sequence the program calculates a score for each position along the sequence by adding or multiplying (see note 6) the relevant values in the table. All positions that exceed some cutoff score are reported as matching the original set of motifs.

How can we select a suitable cutoff score? The simplest way is to apply the weight matrix to all the known occurrences of the motif - i.e. the set of sequence segments used to create the table - and to see what scores they achieve. The cutoff can be selected accordingly. For convenience the weight matrix is stored as a file along with its cutoff score, a title that is displayed when the file is read, and a few other values need by the program. A routine for creating weight matrix files from sets of aligned sequences is included in the program. When a search using the weight matrix is performed the program will either list the matching sequence segments or plot their positions as for the other motif search methods.

2.3.1 Creating a weight matrix file from a set of aligned sequences
1. Select "Motif search using weight matrix".

2. Select "Make weight matrix".

3. Define "Name of aligned sequences file". We assume the file of aligned sequences has already been created (See note 3). The program reads and displays the contents of the file numbering each sequence as it goes. Then it displays the length of the longest sequence.

4. Accept "Sum logs of weights". The alternative is to sum the weights when calculating scores (see note 4).

5. Accept "Use all motif positions". The alternative allows the user to define a "mask" which identifies positions within the motif that should be ignored when the matrix is created (see note 5). The program now calculates the weights and applies them in turn to each of the sequences in the file. The number and score for each sequence is displayed, followed by the top, bottom and mean scores and the standard deviation. In addition the mean plus and minus 3 standard deviations is displayed.

6. Define "Cutoff score". The default is the mean minus 3 standard deviations, but users may, for example, decide to use the lowest score obtained by the sequences in the file.

7. Define "Top score for scaling plots". This parameter is used by the graphics output routine when scaling the plots. Its value will influence the height of lines plotted to represent matches.

8. Define "Position to identify". When a search is performed it is not always appropriate to report the position of a match relative to the leftmost base in the motif. For example when performing a splice junction search we may want to know the position of the G in the conserved GT, rather than the position of the first base in the matrix. The "Position to identify" allows the user to define which base is marked. The bases in the table are number 1,2,3 and so on.

9. Define a "Title". This is a title that will be displayed when the matrix file is read prior to performing a search. It is limited to 60 characters.

10. Define "Name for new weight matrix file". Give a name for the weight matrix file. Typical dialogue is shown in figure 6.4.

2.3.2 Searching using a weight matrix
Once a weight matrix has been stored in a file it can be used to search any sequence. Results can be displayed graphically or the matching sequence segments can be listed out with their scores. 1. Select "Motif search using weight matrix".

2. Select "Use weight matrix".

3. Define "Motif weight matrix file". The name of the file containing the weight matrix. The program reads the file and displays its title.

4. Define "Cutoff score". The default will be the value set when the weight matrix file was created. If the score is negative the program will calculate sums of logs of frequencies, otherwise it will add frequencies.

5. Accept "Plot results". Alternatively they will be listed. The results will appear as in figure 6.5 Motif search using weight matrix

Select operation

X 1 Use weight matrix

2 Make weight matrix

3 Rescale weight matrix

? Selection (1-3) (1) =2

? Name of aligned sequences file=heatshock.seq

1 ATAAAGAATATTCTAGAA

2 CTCGAGAAATTTCTCTGG 144

3 TTCTCGTTGCTTCGAGAG 36

4 GCCTCGAATGTTCGCGAA 15

5 GACTGGAATGTTCTGACC 45 DROSOPHILA HSP68

6 ATCTCGAATTTTCCCCTC 12

7 ATCCAGAAGCCTCYAGAA 35 DROSOPHILA HSP83

8 CTCTAGAAGTTTCTAGAG 25

9 TTCTAGAGACTTCCAGTT 15

10 CCCCAGAAACTTCCACGG 147 DROSOPHILA HSP22

11 GCGAAGAAAATTCGAGAG 46

12 TGCCGGTATTTTCTAGAT 26

13 CCCGAGAAGTTTCGTGTC 97 DROSOPHILA HSP23

14 TTCCGGACTCTTCTAGAA 13 DROSOPHILA HSP26

15 CTCGAGAAAGCTCGCGAA 204 XENOPUS HSP70

16 CTCGCGAATCTTCCGCGA 194

17 CTCGCGAAAGTTCTTCGG 139

18 CTCGGGAAACTTCGGGTC 72

19 TGCCAGAAGTTGCTAGCA 124 XENOPUS HSP30

20 CTCGGGAACGTCCCAGAA 14

21 ATCCCGAAACTTCTAGTT 129 SOYBEAN HSP17

22 GTCCAGAATGTTTCTGAA 98

23 TTTCAGAAAATTCTAGTT 78

24 CCCAAGGACTTTCTCGAA 28

25 TTTTAGAATGTTCTAGAA 179 DICTYOSTELIUM DIRS-1

26 TTCTAGAACATTCGAAGA 169

Length of motif 18

? Sum logs of weights (y/n) (y) =

? Use all motif positions (y/n) (y) =

Applying matrix to input sequences

1 -15.609 ATAAAGAATATTCTAGAA

2 -15.965 CTCGAGAAATTTCTCTGG

3 -18.186 TTCTCGTTGCTTCGAGAG

4 -15.331 GCCTCGAATGTTCGCGAA

5 -20.897 GACTGGAATGTTCTGACC

6 -17.347 ATCTCGAATTTTCCCCTC

7 -16.271 ATCCAGAAGCCTCYAGAA

8 -12.227 CTCTAGAAGTTTCTAGAG

9 -15.933 TTCTAGAGACTTCCAGTT

10 -15.604 CCCCAGAAACTTCCACGG

11 -17.866 GCGAAGAAAATTCGAGAG

12 -17.159 TGCCGGTATTTTCTAGAT

13 -16.399 CCCGAGAAGTTTCGTGTC

14 -14.646 TTCCGGACTCTTCTAGAA

15 -14.801 CTCGAGAAAGCTCGCGAA

16 -16.163 CTCGCGAATCTTCCGCGA

17 -16.280 CTCGCGAAAGTTCTTCGG

18 -15.598 CTCGGGAAACTTCGGGTC

19 -17.721 TGCCAGAAGTTGCTAGCA

20 -16.257 CTCGGGAACGTCCCAGAA

21 -14.243 ATCCCGAAACTTCTAGTT

22 -16.456 GTCCAGAATGTTTCTGAA

23 -15.453 TTTCAGAAAATTCTAGTT

24 -17.443 CCCAAGGACTTTCTCGAA

25 -13.335 TTTTAGAATGTTCTAGAA

26 -15.914 TTCTAGAACATTCGAAGA

Top score -12.227 Bottom score -20.897

Mean -16.119 Standard deviation 1.636

Mean minus 3.sd -21.028 Mean plus 3.sd -11.210

? Cutoff score (-999.00-9999.00) (-21.03) =

? Top score for scaling plots (-21.03-999.00) (-11.21) =

? Position to identify (0-18) (1) =

? Title=Heatshock weights 24-10-91

? Name for new weight matrix file=heatshock.wts

Figure 6.4 An example run of creating a weight matrix

Motif search using weight matrix

Select operation

X 1 Use weight matrix

2 Make weight matrix

3 Rescale weight matrix

? Selection (1-3) (1) =

? Motif weight matrix file=heatshock.wts

Heatshock weights 24-10-91

? Cutoff score (-9999.00-9999.00) (-21.03) =

? Plot results (y/n) (y) =

619 -20.84 gctcggaagcttctgctc

818 -20.74 ttggcgaagctttcaaag

1190 -21.02 gccaggtaagtttcagac

1601 -20.91 tttgcgactgttcggtaa

2387 -20.24 cgctcgcagattctggac

2534 -20.87 gccgagaagatcatcgaa

2890 -16.38 ctcccggatgttctggag

2989 -19.54 ctcgcgaaaatttctgct

3451 -20.76 atcctggaagttccggtt

6020 -20.73 tctcaggaactgctggaa

6335 -20.51 gctgagaaattccgtgac

7107 -20.31 ctctggtctggtcgagaa

7117 -19.61 gtcgagaaaatccaggta

7892 -20.18 cttccgaaagtgctgcat

Figure 6.5 Example run of a search using a weight matrix to produce text output.

2.4 Using "hardwired" motif searches.

The program contains predefined motif definitions for the following: E. coli promoters

prokaryotic ribosome binding sites

mRNA splice junctions

eukaryotic ribosome binding sites

polyadenylation sites All except the polyadenylation site, which is simply defined as an exact match to the string AATAAA, are represented as weight matrices. Each search is performed simply by the user selecting the appropriate option from the menu and each plots its results in its own graphics window. The ribosome binding site searches are reading frame specific and so they normally plot their results to fit nicely with the output from the "gene search by content" methods described in the chapter on finding genes. Likewise the splice junction searches produce separate output for each of the three reading frames. Below, as an example of using the hardwired motifs, we show how to perform such a search.

2.4.1 Searching for splice junctions
1. Select "Splice search using weight matrix". The program automatically reads in weight matrices that define the donor and acceptor sites and displays their titles.

2. Define "Donor cutoff score". The default is stored in the file.

3. Define "Acceptor cutoff score". The default is stored in the file.

4. Accept "Plot results". The alternative lists the results giving the position, score, matching sequence and reading frame. A typical plotted result appears in figure 6.6.

Figure 6.6 Typical graphical output from using the hardwired splice junction search. The results are presented in a reading frame specific way so it shows, in the bottom three boxes, results for donor sites and in the top three boxes those for acceptor sites. In both cases the vertical ordering of the boxes is frame 0 at the bottom, frame 1 in the middle and frame 2 at the top. For example there is a very strong peak corresponding to an acceptor in frame 1 that can be seen just over halfway along the sequence .

3. Notes

1. For this program a motif is a short segment of sequence of fixed length. More complex structures termed "patterns" which we define as sets of motifs separated by varying gaps, are covered in another chapter. The current chapter should be read before the chapter on patterns.

2. It is debateable whether the gain in sensitivity that is afforded by the use of a score matrix is of value for searching nucleotide sequences, however it is very important for protein sequences.

3. The files of aligned sequences used to make weight matrices have the following format. Each sequence should be on a separate line. The sequence should start in column 2 and is terminated by a new line or a space. Anything after the space is treated as a comment. The files can be created by previous searches or using an editor.

4. The frequencies in the weight matrix can be used in two ways to calculate scores for sequences. Some users prefer to add the frequencies to give a total score, and others to multiply them by summing their logs. If we regard the frequencies as probabilities then multiplication seems the correct procedure. The user chooses which method will be employed when the weight matrix is created, however the choice can be overridden when the matrix is used. If multiplication is selected then all results will presented as sums of logs.

5. Masking the weight matrix is particularly useful in cases where a limited number of examples of a motif are available, or when the motif may have several components. In the first case the limited number of examples may make the matrix unrepresentative of the motif because the bases in the unconserved positions may bias the results of searches. When a large number of examples is available to create the matrix, the unconserved positions should tend towards equal base composition and hence have no influence on the overall score. We stated that a motif might have several components: for example a motif might have both structural and specificity components. We may want to separate out the two parts and masking provides such a facility.

6. The weight matrix handling routine contains a further option "Rescale weight matrix". If the user has edited a weight matrix to change the frequency values this provides a way of selecting a new cutoff score. It allows users to read in a set of aligned sequences and a weight matrix and to apply the matrix to the set of sequences to see the range of scores achieved. A new weight matrix file containing the selected cutoff score is written to disk.

7. The program also contains a set of routines identical to those used to create and search for nucleotide weight matrices, but which deal instead with dinucleotide weight matrices.

8. The reader is reminded that most options in the program, if selected when in "execute without dialogue" mode, will automatically use a set of defaults and produce a result with little or no user input. Most motif searches require far less user input than that shown above, where we have tried to show the scope of the methods.

9. Although the program contains hardwired motifs we expect most sites that use the programs to accumulate their own libraries of motifs and patterns, which users can employ by simply knowing the names of the corresponding files.

4. References

1. Staden, R. 1984. Computer methods to locate signals in nucleic acid sequences. Nucl. Acids Res. 12:521-538.

2. Staden, R. 1985. Computer methods to locate genes and signals in nucleic acid sequences. (in) Genetic Engineering, Principle and Methods, Setlow J.K., Hollaender A., (eds.), 7:67-114, (Plenum Press, New York).

3. Staden, R. 1988. Methods to define and locate patterns of motifs in sequences. CABIOS 4 (1):53-60.

4. Staden, R. 1990. Searching for patterns in protein and nucleic acid sequences. (in) Methods in Enzymology R.F. Doolittle (ed.), 183:193-211 (Academic Press, New York).

7. Using Patterns to Analyse Nucleic Acid Sequences

Table of contents

1. Introduction

2. Methods

2.1 Creating a pattern file containing an exact match motif and weight matrix motif.

2.2 Searching a sequence using a pattern file

2.3 Comparing a sequence against a library of patterns

2.4 Searching sequence libraries for patterns

3. Notes

4. References

1. Introduction

Here we describe one of the most powerful facilities provided by the program NIP: the ability to define and search for complex patterns of motifs (1-3). In another chapter we give details of seaching for individual motifs but here we show how to create patterns and libraries of patterns and to use them to search single sequences and sequence libraries. Once a pattern has been defined and stored in a file it can used to search any sequence. In addition if users want to routinely screen sequences against libraries of patterns this can be achieved by use of files of file names. The program can produce several alternative forms of output. It will display the segment of sequence matching each individual motif in the pattern, display all the sequence between and including the two outermost motifs, produce a description of the match in the form of an EMBL feature table, or draw a simple graphical plot.

At the end of the chapter we describe how a related program NIPL is used to search libraries of sequences to find patterns. NIPL is capable of producing alignments of sequence families.

Patterns are defined as sets of motifs with variable spacing. Each motif in a pattern can be defined using any of several methods, and their positions relative to one other are defined in terms of minimum and maximum separations. In addition, by the use of logical operators, each motif can be declared to be essential (the AND operator), optional (the OR operator), or forbidden (the NOT operator). The following methods (termed "classes" by the program) for defining motifs are provided: 1) exact match to a short sequence; 2) percentage match to a short sequence; 3) match to a short sequence using a score matrix and cutoff score; 4) match to a weight matrix; 5) match to the complement of a weight matrix; 6) inverted repeat or stem-loop; 7) exact match to a short sequence with a defined step; 8) direct repeat. Classes 1, 2 , 3 and 7 permit the use of IUB redundancy codes.

The motifs in a pattern are numbered sequentially and motif spacing is defined in the following way. When a new motif is added to a pattern the user specifies the "Reference motif" by its number and then a "Relative start position". The "Relative start position" is defined by taking the first base of the "Reference motif" as position 1, the next as 2, and so on. Then the user defines the allowed variation in the spacing by specifying the "Number of extra positions". Notice that the position of a motif can be defined relative to any other motif, and that a negative "Relative start position" declares the motif to be to the left of its "Reference motif".

The probability of finding each individual motif in the current sequence, the product of the probabilities for all the motifs in a pattern "Probability of finding pattern", and the "Expected number of matches" is calculated and displayed by the program. In addition to the cutoffs used for the individual motifs, users can apply two pattern cutoffs: "Maximum pattern probability" and "Minimum pattern score".

Below we describe: how to create a pattern; how to use a pattern file to search a sequence; how to use a "File of pattern file names" to search a sequence for a whole library of patterns. To describe how to create a pattern file we first show all the steps to make one containing two motifs, and then, to save space, the parts specific to the individual motif types are sketched in the notes section.

2. Methods

2.1 Creating a pattern file containing an exact match motif and weight matrix motif.

1. Select "Pattern searcher".

2. Select "Pattern definition mode" as "Use keyboard".

3. Select "Results display mode" as "Motif by motif". The alternatives are listed in the introduction.

4. Select "Motif definition mode" as "Exact match".

5. Define "Motif name". Each motif can be given an 8 character name.

6. Define "String". Type in the sequence of the motif. The program will display the probability of finding the motif.

7. Select "Motif definition mode" as "Weight matrix".

8. Define "Motif name".

9. Select "Logical operator" as "AND". The alternatives are "OR" and "NOT".

10. Select "Number of reference motif". At this stage the only choice is 1 and this is the default.

11. Define "Relative start position". The base position relative to the "Reference motif". See the introduction.

12. Define "Number of extra positions".

13. Define "Weight matrix file name". Type the name of the file containing the weight matrix.

The program now cycles round to step 7 and all subsequent passes round the loop to add further motifs to the pattern would differ only in the details for the different motif "classes".

14. Select "Pattern complete"

15. Accept "Save pattern in a file". The alternative does not save the pattern and so it can only be used once on the current sequence.

16. Define "Pattern definition file". Give a name for the new file.

17. "Define "Pattern title". All patterns can have a 60 character title that can be displayed when the pattern file is read and the sequence searched. The program will now display a detailed textual description of the pattern, the "Probability of finding the pattern" and the "Expected number of matches".

18. Define "Maximum pattern probability". Yes maximum: any match with a greater probability of being found will be rejected. If no value is specified the search will be quicker (see notes).

19. Define "Minimum pattern score". A minimum pattern score only makes sense if all the motifs in the pattern are defined with compatible scoring methods. For example percentage matches and weight matrices using sums of logs are incompatible. Searching will now commence and any matches displayed using the chosen method. A worked example of creating such a pattern and performing a search is shown in figure 7.1, and the actual pattern file is shown in figure 7.2. Pattern searcher

Select pattern definition mode

X 1 Use keyboard

2 Use pattern file

3 Use file of pattern file names

? Selection (1-3) (1) =

Select results display mode

X 1 Motif by motif

2 Inclusive

3 Graphical

4 EMBL feature table

? Selection (1-4) (1) =

Select motif definition mode

X 1 Exact match

2 Percentage match

3 Cut-off score and score matrix

4 Cut-off score and weight matrix

5 Complement of weight matrix

6 Inverted repeat or stem-loop

7 Exact match, defined step

8 Direct repeat

9 Pattern complete

? Selection (1-9) (1) =

? Motif name=T run

? String=TTTTT

Probability of score 5.0000 = 0.870E-03

Select motif definition mode

X 1 Exact match

2 Percentage match

3 Cut-off score and score matrix

4 Cut-off score and weight matrix

5 Complement of weight matrix

6 Inverted repeat or stem-loop

7 Exact match, defined step

8 Direct repeat

9 Pattern complete

? Selection (1-9) (1) =4

? Motif name=heat

Select logical operator

X 1 And

2 Or

3 Not

? Selection (1-3) (1) =

? Number of reference motif (1-1) (1) =

? Relative start position (-1000-1000) (6) =10

? Number of extra positions (0-1000) (0) =20

? Weight matrix file name=heatshock.wts

Heatshock weights 18-12-90

Probability of score -21.0280 = 0.117E-02

Select motif definition mode

1 Exact match

2 Percentage match

3 Cut-off score and score matrix

X 4 Cut-off score and weight matrix

5 Complement of weight matrix

6 Inverted repeat or stem-loop

7 Exact match, defined step

8 Direct repeat

9 Pattern complete

? Selection (1-9) (4) =9

? Save pattern in a file (y/n) (y) =

? Pattern definition file=_paper.pat

? Pattern title=demo pattern

Pattern description

demo pattern

Motif 1 named T run is of class 1

Which is an exact match to the string

TTTTT

Motif 2 named heat is of class 4

Which is a match to a weight matrix with score -21.028

and the 5 prime base can take positions 10 to 30

relative to the 5 prime end of motif 1

It is anded with the previous motif.

Probability of finding pattern = 0.1015E-05

Expected number of matches = 0.1734E+00

? Maximum pattern probability (0.00-1.00) (1.00) =

? Minimum pattern score (-9999.00-9999.00) (-9999.00) =

Working

Match

505 T run

ttttt

528 heat

ttaaagaaagttttatac

Total matches found 1

Minimum and maximum observed scores -15.34 -15.34

Figure 7.1 Worked example of creating a simple pattern and performing a search.

demo pattern

A1 T run Class

TTTTT

@ End of string

A4 heat Class

1 Relative motif

10 Relative start position

20 Number of extra positions

heatshock.wts

Figure 7.2 The pattern file created by the work shown in figure 7.1.

2.2 Searching a sequence using a pattern file

1. Select "Pattern searcher"

2. Select "Pattern definition mode" as "Use pattern file".

3. Select "Results display mode" as "Inclusive"

4. Define "Pattern definition file". Type the name of the file containing the pattern. The program will read the file then display its title, a detailed textual description of the pattern, the "Probability of finding the pattern", and the "Expected number of matches".

5. Define "Maximum pattern probability".

6. Define "Minimum pattern score". Searching will now commence and any matches displayed using the chosen method. A worked example, using the pattern file created in figure 7.1 is shown in figure 7.3. Pattern searcher

Select pattern definition mode

X 1 Use keyboard

2 Use pattern file

3 Use file of pattern file names

? Selection (1-3) (1) =2

? Pattern definition file=_paper.pat

Select results display mode

X 1 Motif by motif

2 Inclusive

3 Graphical

4 EMBL feature table

? Selection (1-4) (1) =2

Probability of score 5.0000 = 0.870E-03

Heatshock weights 18-12-90

Probability of score -21.0280 = 0.117E-02

Pattern description

demo pattern

Motif 1 named T run is of class 1

Which is an exact match to the string

TTTTT

Motif 2 named heat is of class 4

Which is a match to a weight matrix with score -21.028

and the 5 prime base can take positions 10 to 30

relative to the 5 prime end of motif 1

It is anded with the previous motif.

Probability of finding pattern = 0.1015E-05

Expected number of matches = 0.1734E+00

? Maximum pattern probability (0.00-1.00) (1.00) =

? Minimum pattern score (-9999.00-9999.00) (-9999.00) =

Working

505 T run

tttttgatgcttgactctaagccttaaagaaagttttatac

Total matches found 1

Minimum and maximum observed scores -15.34 -15.34

Figure 7.3 Worked example of using a pattern file as input.

2.3 Comparing a sequence against a library of patterns

This mode of operation allows a sequence to be searched, in turn, for any number of patterns each stored in a separate pattern file. The names of the files containing the individual patterns must be stored in a simple text file. This file is called "a file of pattern file names" and its name is the only user input required to define the search. 1. Select "Pattern searcher"

2. Select "Pattern definition mode" as "Use file of pattern file names".

3. Select "Results display mode" as "Inclusive"

4. Define "File of pattern file names". Type the name of the file containing the list of pattern file names. The program will read the file and then, in turn, all the pattern files it names. Each of these patterns will be compared against the current sequence but only those that give matches will produce any output. The pattern title and each match will be displayed.

2.4 Searching sequence libraries for patterns

The program NIPL can be used to search sequence libraries for patterns. Its use is similar to the pattern search routine described above, except that it does not have the facility for creating pattern files, so they must be created beforehand using NIP. In addition to its obvious application of finding new occurrences of patterns or checking on their frequency it is a useful way of obtaining sequence alignments. It can restrict its search to a list of named entries or can search all but those on a list of entries. It can restrict its output to showing the highest scoring match in each sequence, but by default it will show all matches.

Of its modes of output, two require further description. The first "Padded sections" creates a new file for each match. The file will contain the sequence between and including the two outermost motifs in the pattern. It will be gapped to the furthest extent defined by the pattern, which means that if all the files were subsequently written one above the other all the motifs in the pattern would be exactly aligned, with the sections between them containing the requisite numbers of padding characters. The second such mode of output is called "Complete padded sequences". Here the user must know the maximum distance between the leftmost motif and the start of all the sequences that match. A trial run in which only the positions of matches are reported is usually required. The user gives this maximum distance to the program. The program then writes a new file containing the full length of all matching sequences, again maximally gapped (including their left ends) so that they would all align if written above one another. For both of these modes of output the files created are named "entryname" where "entryname" is the name given to the sequence in the sequence library. These modes are best used with the option "Report all matches" rejected, so that only the best match for each sequence is reported. The sequences can be lined up using the sequence assembly program SAP. 1. Select NIPL.

2. Define "Name for results file."

3. Select a library.

4. Select "Search whole library". The alternatives are "Search only a list of entries" and "Search all but a list of entries". The files containing the list of entries should contain one entry name per line, left justified.

5. Select "Results display mode" as "Inclusive". The alternatives include "Motif by motif", "Scores only", "Complete padded sequences" and "Padded sections".

6. Accept "Report all matches". The alternative only shows the best match for each sequence.

7. Define "Pattern definition file". The name of the file containing the pattern created using NIP.

The program displays a textual description of the pattern and the expected number of matches per 1000 residues assuming an average nucleic acid composition.

8. Define "Maximum pattern probability". The program will run much more quickly if none is given.

9. Define "Minimum pattern score". The search will start.

3. Notes

1. The "exact match" motif class requires a consensus sequence. The "percentage match" motif class requires a consensus sequence and a cutoff score. The "score matrix" motif class requires a consensus sequence and a cutoff score. The "weight matrix" search and the "complement of a weight matrix" only require the name of the file containing the matrix. The "inverted repeat" or "stem-loop" requires a stem length, minimum and maximum loop sizes, and a cutoff score using scores A-T = G-C = 2, G-T = 1. Note that if the user defines an inverted repeat as a "Reference motif" the "Relative position" can be defined from either its 5' or 3' ends. The "direct repeat" motif class requires a repeat length, the minimum and maximum gap between the two occurrences of the repeat, and a minimum score.

2. The motif class "Exact match, defined step" is rarely used. A typical use might be to find a start codon followed, for some minimum distance, by no stop codons in the same reading frame. The step would have the value 3 to keep the reading frame the same as that of the start codon, and the stop codon searches would be included using the NOT operator.

3. The details of the probabilty calculations are outside the scope of this article. They are quite rapid and are essential both for assessing the statistical significance of any matches found and for allowing meaningful cutoffs to be applied to patterns. Obviously, in general, cutoff scores are inappropriate for patterns containing a mixture of motif classes.

4. The program calculates the "Probability of finding the pattern" and the "Expected number of matches". The first figure is actually the product of the individual motif probabilities but the latter figure is more useful because it takes into account the allowed variation in spacing between motifs and the length of the current sequence. In both cases the composition of the current sequence is also used so that different probabilities would be calculated for other sequences.

5. The pattern definition system is very flexible. Assume that a laboratory has a large library of patterns stored in its computer. Different groups or users may want to screen their sequences against different subsets of a pattern library. Each group therefore uses its own "File of pattern file names" which contains only the names of the pattern files that are relevant to their sequences. Of course a pattern may contain only one motif. Hence a library of patterns can include both simple and complex patterns. In the same way a laboratory may have a large library of weight matrices defining different motifs and different users may want to combine them in different ways to produce their own patterns.

4. References

1. Staden, R. 1988. Methods to define and locate patterns of motifs in sequences. CABIOS 4(1):53-60.

2. Staden, R. 1989. Methods for calculating the probabilities of finding patterns in sequences. CABIOS 5(2):89-96.

3. Staden, R. 1990. Searching for patterns in protein and nucleic acid sequences. (in) Methods in Enzymology R.F. Doolittle (ed.), 183:193-211 (Academic Press, New York).

8. Searching for Restriction Sites

Table of contents

1. Introduction

2. Methods

2.1 Search for restriction sites and list them enzyme by enzyme

2.2 Search for restriction sites and list them by position

2.3 Search for restriction sites and list their names above the sequence

2.4 Search for restriction sites and plot their positions

2.5 Find restriction enzymes that cut infrequently

2.6 Producing a back translation from a protein sequence

3. Notes

1. Introduction

The program NIP contains a routine for finding and displaying the positions of the cut sites of restriction enzyme recognition sequences. Linear or circular sequences can be searched and the results can be listed in various forms or displayed graphically. The recognition sequences to be searched for can be typed on the keyboard or read from files. The format of these files is given in note 1. At the end of the chapter we also describe how to produce back translations of protein sequences so that these routines can be used to search them for restriction sites.

2. Methods

2.1 Search for restriction enzyme sites and list them enzyme by enzyme

1. Select "Search".

2. Select "Input source" as "All enzymes file". A number of standard files are available and users may also have their own.

3. Accept "Search for all names".

4. Select "Order results enzyme by enzyme".

5. Accept "List matches".

6. Accept "The sequence is linear". The alternative is circular.

7. Accept "Search for definite matches". The alternative is to search for possible matches in a sequence containing IUB redundancy codes. The results will then appear in the form shown in figure 8.1 Each match is numbered and its enzyme name given, followed by the matching sequence with the cut site indicated by a ' symbol. The position of the cut site is given followed by the length of the potential fragment ending at that site, followed by a list of fragments sizes sorted on length.

Matches found= 3

Name Sequence Position Fragment length

1 AccII cg'cg 313 312 51

2 AccII cg'cg 364 51 188

3 AccII cg'cg 552 188 312

449 449

Matches found= 6

Name Sequence Position Fragment length

1 AciI cc'gc 503 502 12

2 AciI gc'gg 553 50 12

3 AciI gc'gg 714 161 50

4 AciI gc'gg 872 158 105

5 AciI gc'gg 884 12 158

6 AciI cc'gc 896 12 161

105 502

Matches found= 3

Name Sequence Position Fragment length

1 AcyI gg'cgtc 698 697 5

2 AcyI gg'cgtc 765 67 67

3 AcyI ga'cgcc 996 231 231

Figure 8.1 Typical output from "List enzyme by enzyme".

2.2 Search for restriction enzyme sites and list them by position

1. Select "Search".

2. Select "Input source" as "All enzymes file".

3. Accept "Search for all names".

4. Select "Order results by position".

5. Accept "List matches".

6. Accept "The sequence is linear".

7. Accept "Search for definite matches". The results will then appear in the form shown in figure 8.2 Each match is numbered and its enzyme name given, followed by the matching sequence with the cut site indicated by a ' symbol. The position of the cut site is given followed by the length of the potential fragment ending at that site.

2.3 Search for restriction enzyme sites and list their names above the sequence

1. Select "Search".

2. Select "Input source" as "All enzymes file".

3. Accept "Search for all names".

4. Select "Show names above the sequence".

5. Reject "Hide translation".

6. Accept "Use 1 letter codes".

7. Define "Line length". This is the number of bases that will appear on each line of output. It must be a multiple of 30. Name Sequence Position Fragment length

1 HapII c'cgg 2 1

2 HpaII c'cgg 2 0

3 MspI c'cgg 2 0

4 MseI t'taa 14 12

5 HincII gtt'aac 15 1

6 HindII gtt'aac 15 0

7 HpaI gtt'aac 15 0

8 DsaV 'ccagg 23 8

9 EcoRII 'ccagg 23 0

10 TspAI 'ccagg 23 0

11 ApyI cc'agg 25 2

12 BstNI cc'agg 25 0

13 MvaI cc'agg 25 0

14 ScrFI cc'agg 25 0

15 MaeIII 'gttac 47 22

16 BsrI actggt' 49 2

17 MseI t'taa 55 6

18 MaeII a'cgt 63 8

19 SfaNI gcatcaacaa'gata 86 23

20 MaeII a'cgt 91 5

Figure 8.2 Typical output from "List by position". 8. Accept "The sequence is linear".

9. Accept "Search for definite matches". The results will then appear in the form shown in figure 8.3 The sequence is listed with a 3 phase translation underneath and every tenth base numbered. Above the sequence the positions of the cut sites of restriction enzymes are marked.

2.4 Search for restriction enzyme sites and plot their positions

1. Select "Search".

2. Select "Input source" as "All enzymes file".

3. Accept "Search for all names".

4. Select "Order results by position".

5. Reject "List matches".

6. Accept "The sequence is linear".

7. Accept "Search for definite matches". The results will then appear in the form shown in figure 8.4. Each enzyme that has a match is named at the left edge of the display and its cut sites are marked by short vertical lines. If the display window fills up the bell will ring. Users may then take a screen dump before typing return. The program then displays the message " ? Restart plotting from bottom of frame". To do so type return. To quit type !. Search for restriction enzyme sites

Select operation

X 1 Search

2 List enzyme file

3 Clear text

4 Clear graphics

? Selection (1-4) (1) =

Select input source

1 All enzymes file

X 2 Six cutter file

3 Four cutter file

4 Personal file

5 Keyboard

? Selection (1-5) (2) =1

? Search for all names (y/n) (y) =

Select results display mode

X 1 Order results enzyme by enzyme

2 Order results by position

3 Show only infrequent cutters

4 Show names above the sequence

? Selection (1-4) (1) =4

? Hide translation (y/n) (y) =n

? Use 1 letter codes (y/n) (y) =

? Line length (30-90) (60) =

? The sequence is linear (y/n) (y) =

? Search for definite matches (y/n) (y) =

HapII

HpaII

MspI MseI

. .HincII

. .HindII

. .HpaI DsaV

. .. EcoRII

. .. TspAI

. .. . ApyI

. .. . BstNI

. .. . MvaI

. .. . ScrFI MaeIII

. .. . . . BsrI MseI

ccggttagactgttaacaacaaccaggttttctactgatataactggttacatttaacgc

10 20 30 40 50 60

P V R L L T T T R F S T D I T G Y I * R

R L D C * Q Q P G F L L I * L V T F N A

G * T V N N N Q V F Y * Y N W L H L T P

Figure 8.3 Typical dialogue and output for a "Names above the sequence" search.

2.5 Finding restriction enzymes that cut infrequently

1. Select "Search".

2. Select "Input source" as "All enzymes file".

3. Accept "Search for all names".

4. Select "Show only infrequent cutters".

5. Define "Maximum number of cuts".

6. Accept "The sequence is linear".

Figure 8.4 Typical output from "Plot positions". 7. Accept "Search for definite matches". The names and number of cut sites of all enzymes that cut less than or equal to the "Maximum number of cuts" will then be displayed.

2.6 Producing a back translation from a protein sequence

The routine for producing back translations is contained in the program PIP. It back translates protein sequences into DNA using the standard genetic code. The translation can use either the IUB symbols or a set of codon preferences. If a set of codon preferences is used they must conform to the format of codon tables produced by the nucleotide interpretation program, and the back translation will contain the favoured codons. If, for any amino acid there is no favoured codon, the IUB symbols will be employed. The program will plot the redundancy along the sequence and hence can be used to find the best sequences to use as primers. The DNA sequence can be saved to a file and analysed using the nucleotide analysis program. 1. Select "Back translate".

2. Accept "No codon preference". The alternative will cause the program to ask for "File name of codon table", which should be in the same format as those created by the nucleotide interpretation program.

3. Reject "Plot redundancy". The alternative will ask for a window length to use for the plot. The window length is in codons. A plot will appear in which the best primers are sited at the peaks and the worst at the troughs.

4. Accept "Save DNA to disk"

5. Define "File name for DNA sequence". This file can later be read into program NIP and all the searches described above employed.

3. Notes

1. The file containing the definitions of the restriction enzymes names and their recognition sequences uses the standard IUB redundancy symbols and has the following format. Each name is followed by a /, then each of its recognition sequences is followed by a /. The last recognition sequence for each enzyme is followed by //. The cut sites should be indicated by a '. If the cut site is not contained in the recognition sequence, the recognition sequence should be extended by sufficent N symbols. For example the two lines from the standard file shown below define the enzymes Alw21I and Alw26I. These files are kindly updated each month by Dr. Rich Roberts. Alw21I/GWGCW'C//

Alw26I/GTCTCN'NNNN/'NNNNNGATCC//

2. To search for a subset of the restriction enzymes in a file the user should reject "Search for all names" and the program will ask for the names of the enzymes wanted and extract their recognition sequences from the file. Alternatively, if a user was always using the same subset, then a file containing only those enzymes could be created by editing the standard file. This file would then be selected as "Personal file" for "Input source".

3. The routine also allows names and recognition sequence to be entered on the keyboard. This is selected as "Keyboard" for "Input source", and the program will prompt for names and their recognition sequences. In this way the routine can be used to search for exact matches to any short sequence. Again IUB redundancy codes can be used.

4. When back translating from proteins it is often useful to produce a back translation using both a table of codon preferences and one using the IUB symbols. This is because the restriction enzyme search program can distinguish between definite and possible cuts in the sequence. Those matches that the program terms "definite matches" are ones in which the specification of the recognition sequence corresponds exactly to that of the back translation. The program will also find what it terms "possible matches" which are ones that depend on the particular codons chosen for each amino acid. These are sites at which recognition sequences could be engineered to produce a cut in the DNA without changing the amino acid, but which are not necessarily found in the original sequence.

9. Statistical and Structural Analysis of Nucleotide Sequences

Table of contents

1. Introduction

2. Methods

2.1 Calculating the base composition

2.2 Calculating the dinucleotide composition

2.3 Calculating the codon composition

2.4 Creating a codon usage file

2.5 Plotting the base composition

2.6 Searching for anomalous compositions

2.7 Search for anomalous word usage

2.8 Calculate codon constraint

2.9 Searching for stem-loops

2.10 Searching for long range inverted repeats

2.11 Searching for long range repeats

2.12 Searching for repeated words

2.13 Searching for possible Z DNA

3. Notes

4. References

1. Introduction

In this chapter we deal with performing simple statistical and structural analysis of nucleotide sequences and also describe some more unusual tests. We cover base, dinucleotide and codon compositions, potential amino acid compositions, and the relative frequencies of each base in each position of codons. We describe how to produce plots to show regions of unusual composition and to measure the codon bias for a gene. In addition we describe a set of functions for finding "structures" in nucleotide sequences, including short range inverted repeats or stem-loops, long range inverted repeats, long range direct repeats, and Z DNA. All the methods are contained in the program NIP.

2. Methods

2.1 Calculating the base composition

Select "Calculate base composition". The composition of the active region is shown.

2.2 Calculating the dinucleotide composition

Select "Calculate dinucleotide composition". The dinucleotide composition of the active region and an expected dinucleotide composition is shown. The expected composition is calculated from the base composition assuming a random order of bases in the sequence. See figure 9.1. T C A G

Obs Expected Obs Expected Obs Expected Obs Expected

T 5.86 5.97 6.18 5.99 4.24 5.91 8.14 6.56

C 6.10 5.99 5.14 6.02 5.91 5.93 7.38 6.59

A 5.57 5.91 5.64 5.93 7.91 5.84 5.05 6.49

G 6.90 6.56 7.56 6.59 6.11 6.49 6.30 7.22

Figure 9.1 The dinucleotide composition display

2.3 Calculating the codon composition

This function counts codons, amino acid composition, protein molecular weights, hydrophobicity and base compositions. Users select the segments of the sequence to be analysed. The segments can be defined on the keyboard or from an EMBL/GenBank feature table. 1. Select "Calculate codon composition".

2. Accept "Show observed counts". The alternative displays its codon tables so that the total for each amino acid sums to 100. This makes it easier to see any bias present in the codon usage.

3. Accept "Define segments using keyboard". The alternative is to use a feature table.

4. Define "From". The start of the segment to be analysed.

5. Define "To". The end of the segment to be analysed. The results will be displayed as in figure 9.2 and then the program will again ask "From". The user should define a zero value for "From" when all segments of interest have been analysed. The program will then display a cummulative total for all the values it calculates. The counts are broken down into several figures. Apart from the codon counts we see the base composition by position in codon expressed as a percentage of each bases own frequency; base composition by position in codon expressed as a percentage of the overall base composition of the segment; base composition expected for the observed amino acid composition if there was no codon preference; percentage deviations of the observed amino acid composition from an average amino acid composition (1) ; the molecular weight and hydrophobicity (2) of the putative amino acid sequence.

2.4 Creating a codon usage file

This method writes a file of codon usage in the form of a codon table (see figure 9.2). Such tables can be used by several other methods contained within the programs. If required the user can start with an existing file and add to it. 1. Select "Calculate a codon table and write it to disk".

2. Accept "Start with empty table". Calculate base, codon and amino acid compositions

? Show observed counts (y/n) (y) =

? Define segments using keyboard (y/n) (y) =

? From (0-8134) (0) =1

? To (1-8134) (8134) =1000

? + strand (y/n) (y) =

===========================================

F TTT 5. S TCT 7. Y TAT 4. C TGT 2.

F TTC 17. S TCC 3. Y TAC 5. C TGC 3.

L TTA 3. S TCA 4. * TAA 3. * TGA 1.

L TTG 4. S TCG 3. * TAG 0. W TGG 7.

===========================================

L CTT 3. P CCT 6. H CAT 6. R CGT 3.

L CTC 1. P CCC 1. H CAC 4. R CGC 2.

L CTA 0. P CCA 4. Q CAA 3. R CGA 1.

L CTG 36. P CCG 6. Q CAG 5. R CGG 4.

===========================================

I ATT 12. T ACT 3. N AAT 6. S AGT 0.

I ATC 13. T ACC 5. N AAC 7. S AGC 7.

I ATA 1. T ACA 2. K AAA 9. R AGA 0.

M ATG 9. T ACG 7. K AAG 3. R AGG 1.

===========================================

V GTT 6. A GCT 5. D GAT 7. G GGT 9.

V GTC 3. A GCC 6. D GAC 6. G GGC 9.

V GTA 7. A GCA 2. E GAA 5. G GGA 5.

V GTG 9. A GCG 7. E GAG 3. G GGG 3.

===========================================

Total codons= 333.

T C A G

1 25.00 34.27 40.28 35.94

2 45.42 28.63 36.02 22.27

3 29.58 37.10 23.70 41.80

----- ----- ----- -----

= 100% 100% 100% 100%

1 21.32 25.53 25.53 27.63 = 100%

2 38.74 21.32 22.82 17.12 = 100%

3 25.23 27.63 15.02 32.13 = 100%

% 28.43 24.82 21.12 25.63 Observed, overall totals

% 29.65 23.25 23.95 23.15 Expected, even codons per acid

A C D E F G H I K L

20. 5. 13. 8. 22. 26. 10. 26. 12. 47.

O-E % -27. -11. -25. -61. 71. 10. 38. 52. -36. 59.

M N P Q R S T V W Y

9. 13. 17. 8. 11. 24. 17. 25. 7. 9.

O-E % 14. -10. 1. -39. -41. 6. -11. 15. 64. -15.

Total acids= 329. Molecular weight= 36493. Hydrophobicity= 64.7

Figure 9.2 A worked example of calculating codon, base and amino acid compositions. 3. Accept "Show observed counts". The alternative is to have the counts for each amino acid type sum to 100.

4. Accept "Define segments using keyboard". The alternative is to use an EMBL/GenBank feature table.

5. Define "From". The start of the segment to count over.

6. Define "To". The end of the segment.

7. Accept "+ strand". Alternatively the minus strand.

The table will appear on the screen and the program will cycle round to step 5. When all segments have been defined a zero value for "From" will instruct the program to display on the screen a table which is the sum of all the individual tables.

8. Define "Name for codon table file". Give the name of the file in which to save the final table.

2.5 Plotting the base composition

This function plots the base composition for each "window length" of the sequence. The frequency of any combinations of bases can be plotted. 1. Select "Plot base composition".

2. Select which combination of bases to plot. The default is A+T, but any single base or combination of bases can be used.

3. Select "Odd window length". This is the size of window over which each count is made, it is "odd" so that the plotted point exactly corresponds to the centre of each window. The count is made over the window and then the window is moved on by 1 base, and the count repeated.

4. Define "Plot interval". Especially when using long windows it is unnecessary to plot the results for every point along the sequence. A plot interval of 5 will mean the value for every fifth point will be plotted. The plot will appear in the form shown in figure 9.3

Figure 9.3 A typical base composition plot. This is an A+T plot for bacteriophage Lambda and shows that one half is A+T rich and the other G+C rich.

2.6 Searching for anomalous compositions

This "search" is performed by comparing a standard composition against each segment of the sequence and plotting the difference. The difference between the observed and expected composition at each point is expressed as the chi-square value. Any one of the base, dinucleotide or trinucleotide compositions can be used as the standard. No expected level of divergence is used so the program always displays the results so that the plots fill the alloted space on the screen. At the end the observed range is displayed. 1. Select "Plot dinucleotide composition differences as chi squared". Alternatively select base or trinucleotides.

2. Define "Start". Define the position of the first base to be used in the standard.

3. Define "End". Define last base of the standard. The default standard region is the whole sequence.

4. Define "Odd window length".

5. Define "Plot interval". The plot will appear as in figure 9.4

Figure 9.4 An anomalous composition plot. This shows an immunoglobulin switch region and the plateau corresponds to a segment composed entirely of A and G bases.

2.7 Search for anomalous word usage

This function is designed to examine the abundances of short words in a nucleotide sequence to see if particular ones are either under or over represented (3). It compares the observed and expected frequencies and plots them for each segment of the sequence. There has been some work on the relative abundances of CG dinucleotides in eukaryotic sequences (e.g. reference 4) and this routine can be used to examine such biases or any others that might be of interest. 1. Select "Plot observed-expected word usage".

2. Define "String". That is the word to search for. The default is CG.

3. Define "Odd window length".

4. Define "Plot interval".

5. Define "Maximum plot value". Define the maximum expected value for the plot.

6. Define "Minimum plot value". The plot will appear as in figure 9.5.

Figure 9.5 A plot of anomalous word usage. This shows a plot of CG usage for the Human CMV immediate-early region. The frequency of CG is much lower than would be expected from the composition.

2.8 Calculate codon constraint

This method measures the level of constraint imposed on a sequence by coding for a protein. The codon constraint is the difference between the observed codon improbability and the mean improbability for a sequence of the same composition. That is it is a measure of the codon bias and the program performs the calculation over windows of length 99 codons. See reference 5. The user can select segments to analyse either by defining them on the keyboard or by using an EMBL/GenBank feature table. The result for each selected segment, which is simply a single number, is displayed. 1. Select "Calculate codon constraint".

2. Accept "Define segments using keyboard".

3. Define "From". The start of the segment.

4. Define "To". The end of the segment.

5. Accept "+ strand". The result will be displayed, and the program will ask for the next segment to be defined.

2.9 Searching for stem-loop structures

This routine finds simple putative stem-loop structures having a minimum number of base pairs in their stems. Results can be listed or plotted. 1. Select "Search for hairpin loops".

2. Define "Minimum loop size".

3. Define "Maximum loop size".

4. Define "Minimum number of base pairs"

5. Reject "Plot results". The alternative writes out the stem-loops as shown in figure 9.6. The plotted output marks the position of each stem, the height of the mark showing the length of the stem. g

g.t

t.g

c-g

a-t

t.g

t.g

g-c

t.g

g.t

g.t

t.g

t.g

g-c

t.g

tggcga gttttaa

843

Figure 9.6 A typical textual display from the routine for finding simple hairpin loops.

2.10 Searching for long range inverted repeats

This method finds inverted repeats. It allows for no mismatches, insertions or deletions within the matching segments. 1. Select "Find long range inverted repeats".

2. Accept "Plot results". The alternatve lists out all the matching segments.

3. Define "Start". The beginning of the region to analyse. In general the whole sequence will be analysed.

4. Define "End".

5. Define "Minimum inverted repeat". The length of the minimum match. The results will now be plotted in an unusual way as shown in figure 9.7 in which the positions of matching segments are joined by rectangular lines.

Figure 9.7 A plot of direct or inverted repeats. Each matching segment is joined by a rectangular line. Here we show the direct repeats of at least 25 bases in a mouse immunoglobulin switch region.

2.11 Searching for long range repeats

This method finds direct repeats. It allows for no mismatches, insertions or deletions within the matching segments. 1. Select "Find long range repeats".

2. Accept "Plot results". The alternatve lists out all the matching segments.

3. Define "Start". The beginning of the region to analyse. In general the whole sequence will be analysed.

4. Define "End".

5. Define "Minimum repeat". The length of the minimum match.

The results will now be plotted in an unusual way as shown in figure 9.7 in which the positions of matching segments are joined by rectangular lines.

2.12 Searching for repeated words

This function can be used to examine the frequencies of repeated words within a sequence. It finds all words that occur more than once. A "word" is a particular sequence of bases so we are dealing only with exact repeats. The user selects a minimum word length and the program finds all words of that length that occur more than once. Then it "follows" each repeated word until it becomes unique. For each word length it can report the number of different repeated words, the number of occurrences of each word, and their actual sequences and positions.

1. Select "Examine repeats".

2. Define "Minimum word length". The maximum expected and observed word lengths are displayed.

3. Define "Minimum word length for display of repeated word frequencies". The number of different repeated words of each length is listed.

4. Define "Minimum frequency for display of repeated words".

5. Define "Minimum word length for display of repeated words". All words occurring this number of times and of this given word length will be displayed.

Expected length of longest repeat 12

? Minumim word length (1-6) (6) =

Working

Memory used in bytes 75164. Length of longest repeat 13

? Show repeat frequencies for words of at least length (6-13) (13) = 10

For length 10 the number of different repeated words is 86

For length 11 the number of different repeated words is 21

For length 12 the number of different repeated words is 5

For length 13 the number of different repeated words is 2

? Show repeats for words of length (6-13) (13) = 10

? Show repeats for words occuring with frequency (2-9999) (2) = 3

aaggcatcat

occurs at 276

occurs at 969

occurs at 6938

gtctggcggc

occurs at 1891

occurs at 4714

occurs at 7250

? Show repeats for words of length (6-13) (13) = 12

? Show repeats for words occuring with frequency (2-9999) (2) =

gttactggtggt

occurs at 641

occurs at 851

aaaggcatcatg

occurs at 968

occurs at 6937

aaggcatcatgg

occurs at 969

occurs at 6938

ttactggtggtg

occurs at 642

occurs at 852

ctgctgggccgt

occurs at 3477

occurs at 6424

? Show repeats for words of length (6-13) (13) =!

Figure 9.8 Typical output from "Examine repeats".


2.13 Searching for possible Z DNA

The program contains three algorithms for searching for sequences with the potential for forming Z DNA. In varying ways they look for segments of alternating purines and pyrimidines and they all plot their results. A typical result is shown in figure 9.9.

Figure 9.9 A plot of predictions for potential Z DNA containing some high peaks produced by regions of alternating purines and pyrimidines.

3. Notes

1. Whenever the program reads a sequence file it always displays the base composition to provide the user with a check on the correctness of the file.

2. The search for anomalous words function operates in the following way. Users select a "word" - say CG and a window length. The program examines each successive window length along the sequence, with each window overlapping the previous one by windowlength-1 bases. For each window position the program calculates the base composition and the number of occurrences of the chosen word. From the base composition it calculates an expected number of occurrences of the chosen word by simply multiplying the relevent frequencies and assuming random ordering. It plots observed - expected hence showing regions that are enriched or depleted in the chosen word.

3. The codon constraint calculation offers a measure of the codon bias that is independent of any set tables of expected codons. Although some users may find the underlying mathematics difficult to understand the values obtained provide an interesting measure. It was shown (5) for a set of E. coli genes that their values of codon constraint correlated with their levels of expression.

4. The algorithm for finding possible stem loops counts A-T, G-C and G-T pairs as matching but will only find stems with no mismatches or loopouts.

5. The long range inverted and direct repeat routines are fast but only find exact matches. More flexible and exhaustive methods are described in the chapter on sequence comparisons.

6. It is also possible to use the pattern searching routines to define and search for inverted and direct repeats. They are particularly useful for finding specific structures - for example tRNA folds.

7. It is possible that the "Examine repeats" algorithm may run out of memory, particularly if a short minimum word length is chosen or the sequence is very long or very repetitive. If this occurs the maximum word length reported may not be the longest in the sequence: the memory will have been consumed before it was found.

4. References

1. McCaldon,P. and Argos,P. 1988 Oligopeptide biases in protein sequences and their use in predicting protein coding regions in nucleotide sequences. Proteins 4, 99-122.

2. Sweet,R.M. and Eisenberg,D. 1983. Correlation of sequence hydrophobicity measures similarity in three-dimensional protein structure. J. Mol. Biol. 171:479-488.

3. Honess,R.W., Gompels,U.A., Barrell,B.G., Craxton,M., Cameron,K.R., Staden,R., Chang,Y.-N and Hayward,G.S. 1989 Deviations from expected frequencies of CpG dinucleotides in herpesvirus DNAs may be diagnostic of differences in the states of their latent genomes. J. Gen. Virol, 70, 837-855.

4. Bird,A.P. 1980 DNA methylation and the frequency of CpG in animal DNA. Nucl. Acids Res. 8, 1499-1504.

5. McLachlan, A.D., Staden, R., and Boswell, D.R. 1984. A method for measuring the non-random bias of a codon usage table. Nucl. Acids Res. 12:9567-9575.

10. Translating and Listing Nucleic Acid Sequences

Table of contents

1. Introduction

2. Methods

2.1 Listing the sequence with all six reading frames translated

2.2 Listing the sequence with its open reading frames translated

2.3 Listing the sequence with defined segments translated

2.4 Listing the sequence with translated segments defined from a feature table

2.5 Producing a file of protein sequences for all open reading frames.

2.6 Producing a file of protein sequences for segments defined from a feature table

3. Notes

1. Introduction

In this chapter we deal with producing simple listings from nucleotide sequences. All functions are contained in the program NIP. We can list the sequence alone, in single or doubled stranded format or with translations to protein. The translations can be of all six phases, all open reading frames, or of specified segments. The positions of these segments can be defined on the keyboard or read from a EMBL/GenBank feature table. Translations can use the one letter or three letter codes. In addition we can produce files containing only the protein translations, and which are suitable for processing by other programs. Again the positions of the translated segments can be defined on the keyboard, read from a feature table, or be all open reading frames. For the user, producing all these results is very simple, so we only give examples of "methods" and show what the results look like. All outputs that list the sequence can be produced from the menu option named "Translate and list".

2. Methods

2.1 Listing the sequence with all six reading frames translated

1. Select "Translate and list".

2. Accept "Show translation".

3. Select "The segments to translate will be "All six frames"".

4. Accept "Use 1 letter codes".

5. Define "Start". Where to list from.

6. Define "End". Where to list to.

7. Define "Line length". The number of characters in each line of output which must be a multiple of 30.

8. Reject "Number ends of lines". This alternative writes the positions underneath each line. The listing will then appear. Given the choices taken it will look the same as figure 10.1. Q D Y I G H H L N N L Q L D L R T F S L

R I T * D T T * I T F S W T C V H S R W

G L H R T P P E * P S A G P A Y I L A

caggattacataggacaccacctgaataaccttcagctggacctgcgtacattctcgctg

1010 1020 1030 1040 1050 1060

gtcctaatgtatcctgtggtggacttattggaagtcgacctggacgcatgtaagagcgac

L I V Y S V V Q I V K L Q V Q T C E R Q

P N C L V G G S Y G E A P G A Y M R A P

S * M P C W R F L R * S S R R V N E S

V D P Q N P P A T F W T I N I D S M F F

W I H K T P Q P P S G Q S I L T P C S S

G G S T K P P S H L L D N Q Y * L H V L

gtggatccacaaaaccccccagccaccttctggacaatcaatattgactccatgttcttc

1070 1080 1090 1100 1110 1120

cacctaggtgttttggggggtcggtggaagacctgttagttataactgaggtacaagaag

H I W L V G W G G E P C D I N V G H E E

P D V F G G L W R R S L * Y Q S W T R R

T S G C F G G A V K Q V I L I S E M N K

S V V L G L L F L V L F R S V A K K A T

R W C W V C C S W F Y S V A * P K R R P

L G G A G S V V P G F I P * R S Q K G D

tcggtggtgctgggtctgttgttcctggttttattccgtagcgtagccaaaaaggcgacc

1130 1140 1150 1160 1170 1180

agccaccacgacccagacaacaaggaccaaaataaggcatcgcatcggtttttccgctgg

R H H Q T Q Q E Q N * E T A Y G F L R G

P P A P D T T G P K I G Y R L W F P S W

E T T S P R N N R T K N R L T A L F A V

S G V P G K F Q T A I E L V I G F V N G

A V C Q V S F R P R L S W * S A L L M V

Q R C A R * V S D R D * A G D R L C * W

agcggtgtgccaggtaagtttcagaccgcgattgagctggtgatcggctttgttaatggt

1190 1200 1210 1220 1230 1240

tcgccacacggtccattcaaagtctggcgctaactcgaccactagccgaaacaattacca

A T H W T L K L G R N L Q H D A K N I T

R H A L Y T E S R S Q A P S R S Q * H Y

L P T G P L N * V A I S S T I P K T L P

Figure 10.1 A six phase translation using the 1 letter codes

2.2 Listing the sequence with its open reading frames translated

1. Select "Translate and list".

2. Accept "Show translation".

3. Select "The segments to translate will be "Open reading frames"".

4. Define "Minimum open frame in amino acids".

5. Accept "Use 1 letter codes".

6. Define "Start". Where to list from.

7. Define "End". Where to list to.

8. Define "Line length". The number of characters in each line of output, which must be a multiple of 30.

9. Select "Both strands"

10. Accept "Number ends of lines". A typical result is shown in figure 10.2. Q D Y I G H H L N N L Q L D L R T F S L

caggattacataggacaccacctgaataaccttcagctggacctgcgtacattctcgctg 1060

. : . : . : . : . : . :

gtcctaatgtatcctgtggtggacttattggaagtcgacctggacgcatgtaagagcgac

L I V Y S V V Q I V K L Q V Q T C E R Q

* S S R R V N E S

V D P Q N P P A T F W T I N I D S M F F

gtggatccacaaaaccccccagccaccttctggacaatcaatattgactccatgttcttc 1120

. : . : . : . : . : . :

cacctaggtgttttggggggtcggtggaagacctgttagttataactgaggtacaagaag

H I W L V G W G G E P C D I N V G H E E

T S G C F G G A V K Q V I L I S E M N K

S V V L G L L F L V L F R S V A K K A T

tcggtggtgctgggtctgttgttcctggttttattccgtagcgtagccaaaaaggcgacc 1180

. : . : . : . : . : . :

agccaccacgacccagacaacaaggaccaaaataaggcatcgcatcggtttttccgctgg

R H H Q T Q Q E Q N * E T A Y G F L R G

E T T S P R N N R T K N R L T A L F A V

S G V P G K F Q T A I E L V I G F V N G

agcggtgtgccaggtaagtttcagaccgcgattgagctggtgatcggctttgttaatggt 1240

. : . : . : . : . : . :

tcgccacacggtccattcaaagtctggcgctaactcgaccactagccgaaacaattacca

A T H W T L K L G R N L Q H D A K N I T

L P T G P L N * V A I S S T I P K T L P

S V K D M Y H G K S K L I A P L A L T I

agcgtgaaagacatgtaccatggcaaaagcaagctgattgctccgctggccctgacgatc 1300

. : . : . : . : . : . :

tcgcactttctgtacatggtaccgttttcgttcgactaacgaggcgaccgggactgctag

A H F V H V M A F A L Q N S R Q G Q R D

L T F S M Y W P L L L S I A G S A R V I

Figure 10.2 A listing showing the translation of open reading frames from both strands of a sequence from position 1001 to 1300

2.3 Listing the sequence with defined segments translated

1. Select "Translate and list".

2. Accept "Show translation".

3. Select "The segments to translate will be "Typed on the keyboard"".

4. Accept "Use 1 letter codes".

5. Define "Start". Where to list from.

6. Define "End". Where to list to.

7. Define "Line length". The number of characters in each line of output, which must be a multiple of 30.

8. Select "Both strands".

9. Accept "Number ends of lines".

10. Define "Translate from". Define the start of the next segment to translate - say the next exon.

11. Define "Translate to". Define the end of the next segment to translate.

12. Select "Strand". As both strands have been selected above the program will allow either to be translated for each defined segment. The program will now cycle around through steps 10, 11 and 12 until a zero value is defined for "Translate from". At which point the listing will appear. Given the choices made it will look the same as figure 10.2.

2.4 Listing the sequence with translated segments defined from a feature table

1. Select "Translate and list".

2. Accept "Show translation".

3. Select "The segments to translate will be "Read from a feature table"".

4. Define "Feature table file name". Type the name of the file containing the appropriate feature table in EMBL/GenBank format.

5. Define "Operator". This defines which feature table operators should be employed when selecting the segments to translate.

6. Accept "Use 1 letter codes"

7. Define "Start". Where to list from.

8. Define "End". Where to list to.

9. Define "Line length". The number of characters in each line of output, which must be a multiple of 30.

10. Select "Both strands"

11. Accept "Number ends of lines". The program will now read the feature table file and translate the segments defined using the selected operator(s) and the listing will appear as in figure 10.2.

2.5 Producing a file of protein sequences for all open reading frames.

1. Select "Translate and write protein sequences to disk".

2. Reject "Translate selected regions". The alternative is "Open reading frames".

3. Define "Minimum open frame in amino acids".

4. Select "Both strands".

5. Define "File name for translation". A typical results file is shown in figure 10.3. It shows that the file is written in FASTA format (i.e. an entry name line starting with a > symbol (here the first entry name is 188, the start of the DNA segment), followed by a title (here in EMBL feature table format giving the start and end of the DNA that produced the protein), followed by the sequence terminated by an *.

>188 188..733

TMEVNKKQLADIFGASIRTIQNWQEQGMPVLRGGGKGNEVLYDSAAVIKWYAERDAEIEN

EKLRREVEELRQASEADLQPGTIEYERHRLTRAQADAQELKNARDSAEVVETAFCTFVLS

RIAGEIASILDGLPLSVQRRFPELENRHVDFLKRDIIKAMNKAAALDELIPGLLSEYIEQ

SG*

>711 711..2633

VNISNSQVNRLRHFVRAGLRSLFRPEPQTAVEWADANYYLPKESAYQEGRWETLPFQRAI

MNAMGSDYIREVNVVKSARVGYSKMLLGVYAYFIEHKQRNTLIWLPTDGDAENFMKTHVE

PTIRDIPSLLALAPWYGKKHRDNTLTMKRFTNGRGFWCLGGKAAKNYREKSVDVAGYDEL

AAFDDDIEQEGSPTFLGDKRIEGSVWPKSIRGSTPKVRGTCQIERAASESPHFMRFHVAC

PHCGEEQYLKFGDKETPFGLKWTPDDPSSVFYLCEHNACVIRQQELDFTDARYICEKTGI

WTRDGILWFSSSGEEIEPPDSVTFHIWTAYSPFTTWVQIVKDWMKTKGDTGKRKTFVNTT

LGETWEAKIGERPDAEVMAERKEHYSAPVPDRVAYLTAGIDSQLDRYEMRVWGWGPGEES

WLIDRQIIMGRHDDEQTLLRVDEAINKTYTRRNGAEMSISRICWDTGGIDPTIVYERSKK

HGLFRVIPIKGASVYGKPVASMPRKRNKNGVYLTEIGTDTAKEQIYNRFTLTPEGDEPLP

GAVHFPNNPDIFDLTEAQQLTAEEQVEKWVDGRKKILWDSKKRRNEALDCFVYALAALRI

SISRWQLDLSALLASLQEEDGAATNKKTLADYARALSGEDE*

>74 complement(74..727)

LFDIFTQQPRYQFIQRGCFVHGFDDIPFQEINMSVFQFRKTPLHRQGEPVENTGNFTCDP

RQHESTECGFHHFSGVSGILQFLCVGLRTRKSMAFVLNSSWLEICLAGLPQFFNLPAQLF

VLNFSIPFGIPFYDGGRVIKHLITLATASQNGHSLFLPVLNGTDTRTENVSQLLFVDFHC

SFHGQKQRKETTEAKKPRFQHLSFPFFSEGILNKNIKL*

>313 complement(313..732)

PDCSIYSLSNPGISSSSAAALFMALMISRFRKSTCRFSSSGKRRCTDRGSPSRILAISPA

IRDSTKVQNAVSTTSAESLAFFSSCASACARVSRWRSYSIVPGWRSASLACRSSSTSRRS

FSFSISASLSAYHFMTAAES*

Figure 10.3 The contents of a file containing the protein sequences of the open reading frames found by the program

2.6 Producing a file of protein sequences for segments defined from a feature table

1. Select "Translate and write protein sequences to disk".

2. Accept "Translate selected regions".

3. Reject "Define segments using keyboard". The alternative is to use a feature table.

4. Define "Feature table file name". Type the name of the file containing the appropriate feature table in EMBL/GenBank format.

5. Define "Operator". This defines which feature table operators should be employed when selecting the segments to translate.

6. Define "File name for translation" The program will now read the feature table file and translate the segments defined using the selected operator(s). The results will be stored as in figure 10.3.

3. Notes

1. To produce a listing without translation the "Translate and list" function can be used with the "Show translation" option rejected. Alternatively the function "List the sequence" can be used.

2. Some users may be confused by the fact that the program asks "Where to list from, and to" and also "Define segments to translate". This allows for 5' and 3' untranslated regions to be included in the listing.

3. The feature table file employed by the programs is a simple text file containing the data for the current sequence. Because of the multiplicity of different sequence library formats we have not provided the facility of reading such data directly from libraries. The feature tables for individual library entries must be extracted (see the introductory chapter) or files can be created for new sequences.

4. The current feature tables use "operators" such as "join" or "order" to specify which segments should be translated together to make a complete protein sequence. The program allows users to select which ones to employ, the default being "Use all operators".

5. The program contains a function "Set genetic code" which allows users to choose from a menu of codes or to define their own by specifying amino acid and codon pairs. This sets the code for all functions.

11. Statistical and Structural Analysis of Protein Sequences

Table of contents

1. Introduction

2. Methods

2.1 Plotting hydrophobicity

2.2 Plotting charge

2.3 Plotting hydrophobic moment and hydrophobicity

2.4 Drawing helical wheels

2.5 Producing a Robson secondary structure prediction

2.6 Calculating the amino acid composition and molecular weight

3. Notes

4. References

1. Introduction

In this chapter we describe the use of routines for plotting hydrophobicity, charge and hydrophobic moments, drawing helix wheels and predicting secondary structure. Use of all these routines is very straightforward and they are contained in the program PIP.

2. Methods

2.1 Plotting hydrophobicity

This method uses the values of Kyte and Doolittle (1) 1. Select "Plot hydrophobicity".

2. Define "Window length".

3. Define "Plot interval". The plot will appear as in figure 11.1.

2.2 Plotting charge

1. Select "Plot charge".

2. Define "Window length".

3. Define "Plot interval". The plot will appear and will be similar to that shown in figure 11.1.

Figure 11.1 A hydrophobicity plot using the values of Kyte and Doolittle.

2.3 Plotting hydrophobic moment and hydrophobicity

This method plots the hydrophobic moment and the hydrophobicity as defined by Eisenberg et al (2). 1. Select "Plot hydrophobic moment".

2. Define "Angle". This is the angle between the residues when the helix is viewed end on. The default value of 100 degrees is that found in alpha helices.

3. Define "Window length". The default of 18, if used in conjunction with the default "Angle", is equivalent to 5 turns of the helix.

4. Define "Plot interval". The plot will appear as in figure 11.2. with the hydrophobicity shown above the hydrophobic moment. The scale for the hydrophobicity runs from -1.0 to 1.5 and for the hydrophobic moment from 0.0 to 1.5. The program plots the mean values for each window position with the value at position x representing the segment from x-window length+1 to x.

Figure 11.2 A hydrophobic moment (below) and hydrophobicity plot. The hydrophobicity plot displays the mean values on a scale of -1.5 to 1.0 and the hydrophobic moment on a scale of 0.0 to 1.5.

2.4 Drawing helical wheels

This method draws helical wheels for any segment of the sequence (3). In addition it displays the hydrophobic moment for the segment (2). 1. Select "Draw helix wheel".

2. Define "Angle". The default angle of 100 degrees is that found in alpha helices.

3. Define "Window length". The default of 18, if used in conjunction with the default "Angle", is equivalent to 5 turns of the helix.

4. Define "Step". To produce a display for a sequence position N bases from the current one type N, and the display will appear in place of the previous one. The default value of N is 1, so by repeatedly hitting carriage return the user can step, residue by residue, through the sequence. The display for the current position in the sequence will appear as in figure 11.3. and the bell will ring. The program now allows the user to "step through the sequence displaying the helix wheel for each position.

Figure 11.3 A typical helix wheel display using a window of only 13 residues. The display includes a schematic of the helix showing the links between residues, with each vertex numbered according to position; the residue type at each vertex; a symbol denoting a classification as hydrophobic (.), positively charged (+), negatively charged (-), or otherwise (). The residue number of the first sequence element in the current window is displayed at the top left corner along with the sequence. Below this is the total hydrophobicity and hydrophobic moment according to Eisenberg et al (2).

2.5 Producing a Robson secondary structure prediction

This method uses the method of Garnier et al (4) to predict the positions of alpha helices, beta sheets, turns and random coil. The results can be either plotted or listed. 1. Select "Robson secondary structure prediction".

2. Accept "Plot results". The alternative produces a listing like that shown in figure 11.4. The plot will appear as in figure 11.5. and the program also prints a count of the number of positions at which each of the 4 structure types is the highest scoring. 350 P 274 -178 -84 -77

351 L 16 -192 -21 -38

352 K 371 -223 -75 -68

353 L 365 -152 -101 -65

354 S 331 -82 -84 -63

355 K 311 -43 -110 -88

356 A 280 -23 -110 -80

357 V 234 -12 -135 -75

358 H 177 -10 -143 -92

359 K 153 2 -180 -138

360 A 158 52 -175 -130

361 V 144 78 -187 -115

362 L 132 58 -186 -80

363 T 124 63 -142 -78

364 I 144 32 -111 -43

365 D 120 -49 -29 5

366 E 103 -80 13 43

367 K 111 -113 23 42

368 G 132 -127 -13 64

369 T 172 -132 -42 52

370 E 216 -170 -122 -4

Figure 11.4 A listing of the Robson secondary structure prediction. It includes the sequence position, the residue type and the values for the four structure classes.

Figure 11.5 A secondary structure plot using the method of Robson. The likelihood that each 17 residue segment of the sequence forms one of the four structure classes: helix (H), extended (E) normally termed sheet, turn (T) and coil (C) are each plotted out across the screen in four strips. Below this is a "decision" strip (D) in which a single dot is poltted for the higest scoring structure class at each point. Here we see a sequence that is predicted to be predominantly helical.

2.6 Calculating the composition and molecular weight of a sequence.

Select "Count amino acid composition". The composition and molecular weight are displayed as in figure 11.6.. Each column contains the one letter code for the amino acid, the number of occurrences of that amino acid in the sequence, and the number expressed as a percentage, and its molecular weight. Sequence composition

A C S T P A G N D E Q B Z H

N 0. 14. 19. 12. 30. 26. 3. 10. 11. 4. 0. 0. 0.

% 0.0 5.3 7.3 4.6 11.5 9.9 1.1 3.8 4.2 1.5 0.0 0.0 0.0

W 0. 1219. 1921. 1165. 2132. 1483. 342. 1151. 1420. 513. 0. 0. 0.

A R K M I L V F Y W - X ?

N 7. 7. 10. 15. 39. 23. 13. 11. 8. 0. 0. 0. 0.

% 2.7 2.7 3.8 5.7 14.9 8.8 5.0 4.2 3.1 0.0 0.0 0.0 0.0

W 1093. 897. 1312. 1697. 4413. 2280. 1913. 1795. 1490. 0. 0. 0. 0.

Total molecular weight= 28256.254

Figure 11.6 A typical molecular weight and composition display. It includes the residue type, their number, their percentage and their contribution to the molecular weight.

3. Notes

1. The methods described in the chapters on motif and pattern searching can also be used to search for specific structures. For example a sequence can be searched for all the structures contained in the PROSITE motif library.

2. It is often convenient to produce displays in which several of the plots described above appear together on the screen.

4. References

1. Kyte, J. and Doolittle, R.F. 1982. A simple method for displaying the hydropathic character of a protein. J.Mol. Biol. 157:105-132.

2. Eisenberg, D., Schwarz, E., Komaromy, M., and Wall, R. 1984. Analysis of membrane and surface protein sequences with the hydrophobic moment plot. J. Mol. Biol. 179:125-142.

3. Schiffer,M and Edmundson,A.B. 1967 Use of helical wheels to represent the structures of proteins and to identify the segments with helical potential. Biophys. J. 7, 121-135.

4. Garnier, J., Osguthorpe, D.J., and Robson, B. 1978. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120:97-120.

12. Searching for Motifs in Protein Sequences

Table of contents

1. Introduction

2. Methods

2.1 Searching for exact matches.

2.2 Searching for percentage matches to consensus sequences

2.3 Searching for consensus sequences using a score matrix

2.4 Using weight matrices for searching protein sequences

3. Notes

4. References

1. Introduction

The program PIP contains several ways of defining and searching for motifs (1,2). We describe searches for exact matches and percentage matches, the use of score matrices and the creation and use of weight matrices. All of the searches produce both listed and graphical output.

2. Methods

2.1 Searching for exact matches.

The routine for finding and displaying the positions of exact matches to sequences can display its results in various forms. It is equivalent to the restriction enzyme search routine in the nucleotide analysis programs. The sequences to be searched for can be typed on the keyboard or read from files. The format of these files is given in the notes. Here we give only a single example of the use of the routine which shows how to produce a plot of the positions of all amino acid types in a sequence. 1. Select "Search".

2. Select "Input source" as "All acids file". A number of standard files are available and users may also have their own. The one selected simply contains the one letter codes for all the standard amino acids.

3. Accept "Search for all names". The alternative allows users to select a subset of the entries in the file by name.

4. Select "Order results name by name".

5. Reject "List matches". If results are listed the output gives the name and position of each match and also the separations between matches. The results will then appear in the form shown in figure 12.1.

Figure 12.1 Typical graphical output from "Search for exact matches" in which the position of each matching string (here individual amino acid types) is marked.

2.2 Searching for percentage matches to sequences

1. Select "Find percentage matches".

2. Accept "Type in strings". The alternative allows the string to be extracted from a named file.

3. Reject "Keep picture". This will cause the graphics window to be cleared. The alternative leaves it unchanged.

4. Define "String". Type in the search string. When the program cycles round to this point again the previous string will be offered as a default.

5. Define "Percent match". The search is performed, the results are presented graphically, the number of matches displayed, and the scores and positions of the top 10 matches displayed.

6. Define the number of matches to "Display". For the number of matches chosen the program will display the search string and matching sequence written one above the other with matching characters indicated by asterisk symbols. The program now cycles round to step 3.

2.3 Searching for sequences using a score matrix

A score matrix gives a score for the alignment of each possible pair of sequence symbols. This method is more sensitive than the simple percentage match search. The default matrix MDM78 used by this program is shown in figure 12.2. 1. Select "Find matches using a score matrix".

2. Accept "Type in strings". The alternative allows the string to be extracted from a named file.

3. Reject "Keep picture". This will cause the graphics window to be cleared. The alternative leaves it unchanged.

4. Define "String". Type in the search string. When the program cycles round to this point again the previous string will be offered as a default. The program displays the minimum and maximum possible scores for the string.

5. Define "Score". The search is performed, the results are presented graphically, the number of matches displayed, and the scores and positions of the top 10 matches displayed.

6. Define the number of matches to "Display". For the number of matches chosen the program will display the search string and matching sequence written one above the other with matching characters indicated by asterisk symbols. The program now cycles round to step 3. An example run is shown in figure 12.3. C S T P A G N D E Q B Z H R K M I L V F Y W - X ?

C 22 10 8 7 8 7 6 5 5 5 5 5 7 6 5 5 8 4 8 6 10 2 10 10 10 10

S 10 12 11 11 11 11 11 10 10 9 10 10 9 10 10 8 9 7 9 7 7 8 10 10 10 10

T 8 11 13 10 11 10 10 10 10 9 10 10 9 9 10 9 10 8 10 7 7 5 10 10 10 10

P 7 11 10 16 11 9 9 9 9 10 9 10 10 10 9 8 8 7 9 5 5 4 10 10 10 10

A 8 11 11 11 12 11 10 10 10 10 10 10 9 8 9 9 9 8 10 6 7 4 10 10 10 10

G 7 11 10 9 11 15 10 11 10 9 10 10 8 7 8 7 7 6 9 5 5 3 10 10 10 10

N 6 11 10 9 10 10 12 12 11 11 12 11 12 10 11 8 8 7 8 6 8 6 10 10 10 10

D 5 10 10 9 10 11 12 14 13 12 13 12 11 9 10 7 8 6 8 4 6 3 10 10 10 10

E 5 10 10 9 10 10 11 13 14 12 12 13 11 9 10 8 8 7 8 5 6 3 10 10 10 10

Q 5 9 9 10 10 9 11 12 12 14 11 13 13 11 11 9 8 8 8 5 6 5 10 10 10 10

B 5 10 10 9 10 10 12 13 12 11 13 11 11 10 10 8 8 6 8 5 7 4 10 10 10 10

Z 5 10 10 10 10 10 11 12 13 13 11 14 12 10 10 8 8 8 8 5 6 4 10 10 10 10

H 7 9 9 10 9 8 12 11 11 13 11 12 16 12 10 8 8 8 8 8 10 7 10 10 10 10

R 6 10 9 10 8 7 10 9 9 11 10 10 12 16 13 10 8 7 8 6 6 12 10 10 10 10

K 5 10 10 9 9 8 11 10 10 11 10 10 10 13 15 10 8 7 8 5 6 7 10 10 10 10

M 5 8 9 8 9 7 8 7 8 9 8 8 8 10 10 16 12 14 12 10 8 6 10 10 10 10

I 8 9 10 8 9 7 8 8 8 8 8 8 8 8 8 12 15 12 14 11 9 5 10 10 10 10

L 4 7 8 7 8 6 7 6 7 8 6 8 8 7 7 14 12 16 12 12 9 8 10 10 10 10

V 8 9 10 9 10 9 8 8 8 8 8 8 8 8 8 12 14 12 14 9 8 4 10 10 10 10

F 6 7 7 5 6 5 6 4 5 5 5 5 8 6 5 10 11 12 9 19 17 10 10 10 10 10

Y 10 7 7 5 7 5 8 6 6 6 7 6 10 6 6 8 9 9 8 17 20 10 10 10 10 10

W 2 8 5 4 4 3 6 3 3 5 4 4 7 12 7 6 5 8 4 10 10 27 10 10 10 10

- 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

X 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

? 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

Figure 12.2 The amino acid score matrix MDM78.

2.4 Using weight matrices for searching protein sequences

A weight matrix is the most sensitive way of defining a motif. It is a table of values that gives scores for each amino acid type in each position along a motif. For a motif of length 8 amino acids the weight matrix would be a table 8 positions long and, allowing for 26 amino acid symbols, 26 deep. The simplest way of choosing the values for the table is to take an alignment of all known examples of the motif and to count the frequency of occurrence of each amino acid type at each position. These frequencies can be used as the table of weights. When the table is used to search a new sequence the program calculates a score for each position along the sequence by adding or multiplying (see notes) the relevant values in the table. All positions that exceed some cutoff score are reported as matching the original set of motifs.

How can we select a suitable cutoff score? The simplest way is to apply the weight matrix to all the known occurrences of the motif - i.e. the set of sequence segments used to create the table - and to see what scores they achieve. The cutoff can be selected accordingly. For convenience the weight matrix is stored as a file along with its cutoff score, a title that is displayed when the file is read, and a few other values need by the program. A routine for creating weight matrix files from sets of aligned sequences is included in the program. When a search using the weight matrix is performed the program will either list the matching sequence segments or plot their positions as for the other motif search methods. Find matches using a score matrix

? Keep picture (y/n) (y) =

? String=ALPHA

Minimum score= 23 Maximum score= 72

? Score (23-72) (72) =60

For score 60 the number of matches= 5

Scores 62 62 62 61 61

Positions 120 217 420 54 326

? Display (0-5) (0) =

120

PLDHD

* *

ALPHA

1

217

ALANT

**

ALPHA

1

420

QLDHG

* *

ALPHA

1

54

SLPGN

**

ALPHA

1

326

ALPII

***

ALPHA

1

? Keep picture (y/n) (y) =

Default String=ALPHA

? String=!

Figure 12.3 An example of the listed output from "Search using a score matrix".

2.4.1 Creating a weight matrix file from a set of aligned sequences
1. Select "Motif search using weight matrix".

2. Select "Make weight matrix".

3. Define "Name of aligned sequences file". We assume the file of aligned sequences has already been created (see note 5). The program reads and displays the contents of the file numbering each sequence as it goes. Then it displays the length of the longest sequence.

4. Accept "Sum logs of weights". The alternative is to sum the weights when calculating scores (see note 6).

5. Accept "Use all motif positions". The alternative allows the user to define a "mask" which identifies positions within the motif that should be ignored when the matrix is created (see note 7). The program now calculates the weights and applies them in turn to each of the sequences in the file. The number and score for each sequence is displayed, followed by the top, bottom and mean scores and the standard deviation. In addition the mean plus and minus 3 standard deviations is displayed.

6. Define "Cutoff score". The default is the mean minus 3 standard deviations, but users may, for example, decide to use the lowest score obtained by the sequences in the file.

7. Define "Top score for scaling plots". This parameter is used by the graphics output routine when scaling the plots. Its value will influence the height of lines plotted to represent matches.

8. Define "Position to identify". When a search is performed it is not always appropriate to report the position of a match relative to the leftmost amino acid in the motif. For example when performing a helix-turn-helix motif search we may want to know the position of the well conserved glycine rather than the position of the first amino acid in the matrix. The "Position to identify" allows the user to define which amino acid is marked. The amino acids in the table are number 1,2,3 and so on.

10. Define a "Title". This is a title that will be displayed when the matrix file is read prior to performing a search. It is limited to 60 characters.

11. Define "Name for new weight matrix file". Give a name for the weight matrix file. See the example run in figure 12.4. Motif search using weight matrix

Select operation

X 1 Use weight matrix

2 Make weight matrix

3 Rescale weight matrix

? Selection (1-3) (1) =2

? Name of aligned sequences file=atpbinding.seq

1 GETLGIVGESGSG

2 GESLGVVGESGGGKSTFAR OppF

3 GDVISIDGSSGSGKSTFLR HisP

4 GEFVVFVGPSGGGKSTLLR MalK E. coli

5 NQVTAFIGPSGGGKSTLLR PstB

6 GRVMALVGENGAGKSTMMK RbsA(N)

7 GEVIGIVGRSGSGKSTLTK HlyB

8 GECFGLLGPNGAGKSTITR NodI R. leguminosarum

9 GEMAFLTGHSGAGKSTLLK FtsE E. coli

10 GQRELIIGDRQTGKTALAI ATPase

11 GGKVGLFGGAGVGKTVNMM ATPase

12 GRIVEIYGPESSGKTTLTL RecA

13 RSNLLVLAGAGSGKTRVLV UvrD

14 GGKIGLFGGAGVGKTVGIM ATPase Bovine

15 SKIIFVVGGPGSGKGTQCE Adenylate Kinase Rabbit

16 NQSILITGESGAGKTVNTK Myosin Rabbit

17 HVNVGTIGHVDHGKTTLTA EF-Tu E. coli

18 YRNIGISAHIDAGKTTERI EF-G E. coli

19 EYKLVVVGARGVGKSALTI v-ras (HARVEY)

20 EYKLVVVGASGVGKSALTI v-ras (KIRSTEN)

21 EYKLVVVGAVGVGKSALTI pEJ BLADDER CARCINOMA TRANSFORMING

22 EYKLVVVGAGGVGKSALTI pEJ BLADDER CARCINOMA CELLULAR

Length of motif 19

? Sum logs of weights (y/n) (y) =

? Use all motif positions (y/n) (y) =

Applying weights to input sequences

1 -36.651 GETLGIVGESGSGKSQSLR

2 -35.780 GESLGVVGESGGGKSTFAR

3 -38.180 GDVISIDGSSGSGKSTFLR

4 -35.403 GEFVVFVGPSGGGKSTLLR

5 -39.039 NQVTAFIGPSGGGKSTLLR

6 -40.653 GRVMALVGENGAGKSTMMK

7 -34.017 GEVIGIVGRSGSGKSTLTK

8 -37.454 GECFGLLGPNGAGKSTITR

9 -36.474 GEMAFLTGHSGAGKSTLLK

10 -43.431 GQRELIIGDRQTGKTALAI

11 -40.210 GGKVGLFGGAGVGKTVNMM

12 -40.720 GRIVEIYGPESSGKTTLTL

13 -45.143 RSNLLVLAGAGSGKTRVLV

14 -40.684 GGKIGLFGGAGVGKTVGIM

15 -45.197 SKIIFVVGGPGSGKGTQCE

16 -39.098 NQSILITGESGAGKTVNTK

17 -43.832 HVNVGTIGHVDHGKTTLTA

18 -44.817 YRNIGISAHIDAGKTTERI

19 -36.305 EYKLVVVGARGVGKSALTI

20 -35.101 EYKLVVVGASGVGKSALTI

21 -36.305 EYKLVVVGAVGVGKSALTI

22 -36.711 EYKLVVVGAGGVGKSALTI

Top score -34.017 Bottom score -45.197

Mean -39.146 Standard deviation 3.441

Mean minus 3.sd -49.470 Mean plus 3.sd -28.822

? Cutoff score (-999.00-9999.00) (-49.47) =

? Top score for scaling plots (-49.47-999.00) (-28.82) =

? Position to identify (0-19) (1) =13

? Title=ATP binding motif

? Name for new weight matrix file=atpbinding.wts

Figure 12.4 An example run of the creation of a weight matrix from a set of aligned sequences.

2.4.2 Searching using a weight matrix
Once a weight matrix has been stored in a file it can be used to search any sequence. Results can be displayed graphically or the matching sequence segments can be listed out with their scores. 1. Select "Motif search using weight matrix".

2. Select "Use weight matrix".

3. Define "Motif weight matrix file". The name of the file containing the weight matrix. The program reads the file and displays its title.

4. Accept "Use frequencies as weights". The alternative will use the weight matrix file as a definition of a "Membership of set" motif (see note 10).

5. Define "Cutoff score". The default will be the value set when the weight matrix file was created. If the score is negative the program will calculate sums of logs of frequencies, otherwise it will add frequencies.

6. Accept "Plot results". Alternatively they will be listed.

The results will appear.

3. Notes

1. The files containing the definitions of peptides that can be be searched for by the exact match search routine have the following format. Each name is followed by a /, then each of its peptide sequences is followed by a /. The last peptide sequence for each name is followed by //. For example a file might contain the following. Acidic/D/E//

Basic/R/K/H//

Glyco/N-S/N-T// Users could then search for these named sets of sequences. Note that the symbol - matches any amino acid.

2. To search for a subset of the names in a file employed by exact match routine the user should reject "Search for all names" and the program will ask for the names wanted and extract their sequences from the file. Alternatively, if a user was always using the same subset, then a file containing only those names could be created. This file would then be selected as "Personal file" for "Input source".

3. The exact match routine also allows names and their sequences to be entered on the keyboard. This is selected as "Keyboard" for "Input source", and the program will prompt for names and their sequences. In this way the routine can be used to search for exact matches to any short sequence.

4. For this program a motif is a short segment of sequence of fixed length. More complex structures termed "patterns" which we define as sets of motifs separated by varying gaps, are covered in another chapter. The current chapter should be read before the chapter on patterns.

5. The files of aligned sequences used to make weight matrices have the following format. Each sequence should be on a separate line. The sequence should start in column 2 and is terminated by a new line or a space. Anything after the space is treated as a comment. The files can be created by previous searches or using an editor.

6. The frequencies in the weight matrix can be used in two ways to calculate scores for sequences. Some users prefer to add the frequencies to give a total score, and others to multiply them by summing their logs. If we regard the frequencies as probabilities then multiplication seems the correct procedure. The user chooses which method will be used when the weight matrix is created, however the choice can be overridden when the matrix is used. If multiplication is selected then all results will presented as sums of logs.

7. Masking the weight matrix is particularly useful in cases where a limited number of examples of a motif are available, or when the motif may have several components. In the first case the limited number of examples may make the matrix unrepresentative of the motif because the amino acids in the unconserved positions may bias the results of searches. We stated that a motif might have several components: for example it might have both structural and specificity components. We may want to separate out the two parts and again masking provides such a facility.

8. The weight matrix handling routine contains a further option "Rescale weight matrix". If the user has edited a weight matrix to change the frequency values this provides a way of selecting a new cutoff score. It allows users to read in a set of aligned sequences and a weight matrix and to apply the matrix to the set of sequences to see the range of scores achieved. A new weight matrix file contining the selected cutoff score is written to disk.

9. The program contains no hardwired motifs as we expect most sites that use the programs to accumulate their own libraries of motifs and patterns, and to use the PROSITE library, both of which users can employ by simply knowing the names of the corresponding files.

10. The weight matrix search can also used as a "Membership of a set" search. This means that at each position in the motif, any amino acid type that is non-zero in the weight matrix is counted as a match and scores a value 1. See the chapter on searching protein sequences for patterns.

4. References

1. Staden, R. 1988. Methods to define and locate patterns of motifs in sequences. CABIOS 4(1):53-60.

2. Staden, R. 1990. Searching for patterns in protein and nucleic acid sequences. (in) Methods in Enzymology R.F. Doolittle (ed.), 183:193-211 (Academic Press, New York).

13. Using Patterns to Analyse Protein Sequences

Table of contents

1. Introduction

1.1 Introduction to the PROSITE motif library

2. Methods

2.1 Creating a pattern file containing a weight matrix motif and a membership of a set motif.

2.2 Searching a sequence using a pattern file

2.3 Comparing a sequence against a library of patterns including PROSITE

2.4 Searching libraries for patterns

2.5 Preparing the PROSITE motif library for use by the programs

3. Notes

4. References

1. Introduction

Here we describe one of the most powerful facilities provided by the program PIP: the ability to define and search sequences or libraries of sequences for complex patterns of motifs. In another chapter we give details of seaching for individual motifs but here we show how to create individual patterns and libraries of patterns and to use them to search sequences. Once a pattern has been defined and stored in a file it can used to search any sequence. In addition if users want to routinely screen sequences against libraries of patterns this can be achieved by use of files of file names. For example, the program can use the PROSITE protein motif library. The program can produce several alternative forms of output. It will display the segment of sequence matching each individual motif in the pattern, display all the sequence between and including the two outermost motifs, produce a description of the match in the form of a SWISSPROT feature table, or draw a simple graphical plot.

Towards the end of the chapter we describe how a related program PIPL is used to search libraries of sequences to find patterns. This program can produce alignments of sequence families.

Patterns are defined as sets of motifs with variable spacing. Each motif in a pattern can be defined using any of several methods, and their positions relative to one other are defined in terms of minimum and maximum separations. In addition, by the use of logical operators, each motif can be declared to be essential (the AND operator), optional (the OR operator), or forbidden (the NOT operator). The following methods (termed "classes" by the program) for defining motifs are provided: 1) exact match to a short sequence; 2) percentage match to a short sequence; 3) match to a short sequence using a score matrix and cutoff score; 4) match to a weight matrix; 5) direct repeat; 6) membership of a set.

The motifs in a pattern are numbered sequentially and motif spacing is defined in the following way. When a new motif is added to a pattern the user specifies the "Reference motif" by its number and then a "Relative start position". The "Relative start position" is defined by taking the first base of the "Reference motif" as position 1, the next as 2, and so on. Then the user defines the allowed variation in the spacing by specifying the "Number of extra positions". Notice that the position of a motif can be defined relative to any other motif, and that a negative "Relative start position" declares the motif to be to the left of its "Reference motif".

The probability of finding each individual motif in the current sequence, the product of the probabilities for all the motifs in a pattern "Probability of finding pattern", and the "Expected number of matches" is calculated and displayed by the program. In addition to the cutoffs used for the individual motifs, users can apply two pattern cutoffs: "Maximum pattern probability" and "Minimum pattern score".

Below we describe: how to create a pattern; how to use a pattern file to search a sequence; how to use a "File of pattern file names" to search a sequence for a whole library of patterns; how to use a pattern file to search a whole library of sequences; how to reformat the PROSITE motif library into a form compatible with these search programs; how to browse through the PROSITE library. To describe how to create a pattern file we first show all the steps to make one containing two motifs, and then, to save space, the parts specific to the individual motif types are sketched in the notes section.

1.1 Introduction to the PROSITE motif library

A very useful library of protein motifs (in our terminology, because they include variable gaps, many would be called patterns) is available from Amos Bairoch, Departement de Biochimie Medicale, University of Geneva. It is also contained on the EMBL CDROM. Currently it contains over 800 patterns/motifs and arrives on cdrom in two files: a .DAT file and a .DOC file. The cdrom also contains EMBL CDROM style indexes for both the .DAT file and the .DOC file. There is also a user documentation file PROSITE.USR. Here we outline the library structure and what is required to prepare the PROSITE library for use by our programs.

We can use the library in two ways. Firstly we may browse through either the .DAT or the .DOC files. The .DAT file contains the pattern definitions and information about which SWISSPROT entries they match. The .DOC file contains a wealth of textual information and references about the patterns. Secondly we may want to search sequences for individual patterns in the library, or for all the patterns in the library.

Browsing through the files is greatly helped by the indexes supplied on the cdrom. At present there is a full text index for each file and so we can use the sequence library index searching routines described in chapter 3. We give an example later in this section.

A typical entry in the .DAT file is shown in figure 13.1.

Each entry has an accession number (in figure 13.1 PS00197), a pattern definition (in figure 13.1 C-x(1,2)-[STA]-x(2)-C-[STA]-{P}-C) and a documentation file cross reference (in figure 13.1 PDOC00175). This pattern means: C, gap of 1 or 2, any of STA, gap of 2, C, any of STA, not P, C.

We need to convert all of these patterns into our pattern definitions (as membership of a set, with the appropriate gap ranges) and write each into a separate pattern file with corresponding "membership of a set" weight matrices. After the conversion each pattern file is named accession_number.pat (here PS00197.PAT). The corresponding weight matrix files are accession_number.wtsa, accession_number.wtsb, etc for however many are needed (here PS00197.WTSA and PS00197.WTSB): two are needed because of the variable gap. We also create a file of file names for all the patterns in the library. Note that the files staden.login and staden.profile define the environment variable PROSITENAMES. This should be defined to be the name used for the PROSITE file of file names. To use the complete PROSITE library from program PIP, users select "pattern searcher" and choose the option "use file of pattern file names", and give the file name PROSITENAMES. For any matches found, the accession number and pattern title will be displayed. A further environment variable PROSITEP is used to give the path to the directory containing the pattern files. This means that any individual PROSITE pattern (say PS00197.PAT) can easily be used (say PROSITEP/PS00197.PAT). ID 2FE2S_FERREDOXIN; PATTERN.

AC PS00197;

DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE).

DE 2Fe-2S ferredoxins, iron-sulfur binding region signature.

PA C-x(1,2)-[STA]-x(2)-C-[STA]-{P}-C.

NR /RELEASE=14,15409;

NR /TOTAL=69(69); /POSITIVE=63(63); /UNKNOWN=0(0); /FALSE_POS=6(6);

NR /FALSE_NEG=5(5);

CC /TAXO-RANGE=A?EP?; /MAX-REPEAT=1;

CC /SITE=1,iron_sulfur; /SITE=5,iron_sulfur; /SITE=8,iron_sulfur;

DR P15788, FER$APHHA , T; P00250, FER$APHSA , T; P00223, FER$ARCLA , T;

DR P00227, FER$BRANA , T; P07838, FER$BRYMA , T; P13106, FER$BUMFI , T;

DR P00247, FER$CHLFR , T; P07839, FER$CHLRE , T; P00222, FER$COLES , T;

DO PDOC00175;

//

Figure 13.1 A typical entry from the PROSITE library In order to make the PROSITE library useable by the search programs it is only necessary to run a program named SPLITP3. SPLITP3 creates a separate pattern file and weight matrix files for each PROSITE entry from the file PROSITE.DAT. Pattern files are named PSentry_number.PAT, weight matrix files PSentry_number.WTSA, PSentry_number.WTSB, etc. The pattern title is the one line description of the motif. SPLITP3 also creates a file of file names. Notice that it will ask for a path name so that the path can be included in the file of file names. This is the path to the directory in which the pattern files are stored. The distribution tape includes all the necessary files in $STADTABL/prosite.

1.1.1 Browsing through PROSITE.

Here we give an example of using the indexes supplied on the EMBL CDROM to browse through the PROSITE library.It shows a text search of the documentation file (for word zinc) followed by listing an entry PDOC00028. This defines PS00028 as the pattern accession number, which is then searched for by the pattern search routines.

Note that access is through the "Read a new sequence" option of the programs because they treat the PROSITE library as a sequence library! PROSITE entrynames can be longer than allowed by the EMBL CDROM style indexes so the entryname index contains the accession number instead. Hence the accession number appears twice when "hits" are listed. The listing also has a field for sequence length which appears as 0! The example shows the level of information available in the library, which combined with the instantaneous text searching, makes up for the current lacks in the interface. The example shows that the user lists accession number PDOC00028 which is for entryname ZINC_FINGER_C2H2. The second line of the entry states that the corresponding pattern has accession number PS00028 and this is how the user knows which pattern to search for.

When the user selects pattern searching for PS00028 the pattern number is prefixed with PROSITEP to define the correct path and followed by .PAT to give the pattern file name.

Select sequence source

X 1 Personal file

2 Sequence library

? Selection (1-2) (1) =2

Select a library

1 EMBL 36 nucleotide library Nov 93

X 2 SWISSPROT 26 protein library Nov 93

3 PIR 37 protein library June 93

4 NRL3D 59 From Brookhaven protein library March 92

5 prosite data

6 prosite documentation

? Selection (1-6) (2) =6

Library is in prosite documentation format

Select a task

X 1 Get annotation

2 Search indexes

? Selection (1-2) (1) =2

Select a task

X 1 Text AND search

2 Text OR search

? Selection (1-2) (1) =

Search for Text

? Text=zinc

ZINC hits 36

Current number of hits on list is 32

Select a task

X 1 Text AND search

2 Text OR search

3 Delete current list

4 Display current list

? Selection (1-4) (1) =4

PDOC00028 PDOC00028 0 Zinc finger, C2H2 type, domain

PDOC00031 PDOC00031 0 Nuclear hormones receptors DNA-binding region signature

PDOC00043 PDOC00043 0 Bacterial regulatory proteins, lysR family signature

PDOC00058 PDOC00058 0 Zinc-containing alcohol dehydrogenases signature

PDOC00059 PDOC00059 0 Iron-containing alcohol dehydrogenases signature

PDOC00060 PDOC00060 0 Short-chain alcohol dehydrogenase family signature

PDOC00082 PDOC00082 0 Copper/Zinc superoxide dismutase signatures

PDOC00113 PDOC00113 0 Alkaline phosphatase active site

PDOC00123 PDOC00123 0 Zinc carboxypeptidases, zinc-binding regions signatures

PDOC00129 PDOC00129 0 Neutral zinc metallopeptidases, zinc-binding region PDOC00146 PDOC00146 0 Eukaryotic-type carbonic anhydrases signature

PDOC00153 PDOC00153 0 Delta-aminolevulinic acid dehydratase active site

PDOC00180 PDOC00180 0 Class I metallothioneins signature

PDOC00275 PDOC00275 0 S-100/ICaBP type calcium binding protein signature

PDOC00293 PDOC00293 0 Hemolysin-type putative calcium-binding region PDOC00300 PDOC00300 0 GATA-type zinc finger domain

PDOC00351 PDOC00351 0 Disintegrins signature

PDOC00357 PDOC00357 0 Prokaryotic zinc-dependent phospholipase C signature

PDOC00360 PDOC00360 0 Poly(ADP-ribose) polymerase zinc finger domain

PDOC00378 PDOC00378 0 Fungal Zn(2)-Cys(6) binuclear cluster domain

PDOC00379 PDOC00379 0 Phorbol esters / diacylglycerol binding domain

PDOC00382 PDOC00382 0 LIM domain

PDOC00383 PDOC00383 0 TFIIS cysteine-rich domain signature

PDOC00401 PDOC00401 0 Dihydroorotase signatures

PDOC00449 PDOC00449 0 Zinc finger, C3HC4 type, signature

PDOC00472 PDOC00472 0 Matrixins cysteine switch

PDOC00548 PDOC00548 0 Cytosol aminopeptidase signature

PDOC00586 PDOC00586 0 Prokaryotic-type carbonic anhydrases signatures

PDOC00599 PDOC00599 0 AP endonucleases family 2 signatures

PDOC00606 PDOC00606 0 Beta-lactamases class B signatures

Current number of hits on list is 32

Select a task

X 1 Text AND search

2 Text OR search

3 Delete current list

4 Display current list

? Selection (1-4) (1) =!

Select a task

X 1 Get annotation

2 Search indexes

? Selection (1-2) (1) =

Default Entry name=PDOC00606

? Entry name=PDOC00028

{PDOC00028}

{PS00028; ZINC_FINGER_C2H2}

{BEGIN}

**********************************

* Zinc finger, C2H2 type, domain *

**********************************

'Zinc finger' domains [1-5] are nucleic acid-binding protein structures first

identified in the Xenopus transcription factor TFIIIA. These domains have

since been found in numerous nucleic acid-binding proteins. A zinc finger

domain is composed of 25 to 30 amino acid residues. There are two cysteine or

histidine residues at both extremities of the domain, which are most probably

involved in the tetrahedral coordination of a zinc atom. It has been proposed

that such a domain interacts with about five nucleotides. A schematic

representation of a zinc finger domain is shown below:

x x

x x

x x

x x

x x

x x

C H

x \ / x

x Zn x

x / \ x

C H

x x x x x x x x x x

Two major classes of zinc fingers are characterized according to the number

and positions of the histidine and cysteine residues involved in the zinc

atom coordination. In the first class, called C2H2, the first pair of zinc

coordinating residues are cysteines, while the second pair are histidines.

Transcription factor TFIIIA is the prototype example for this class of zinc

fingers. A number of experimental reports have demonstrated the zinc-dependent

DNA or RNA binding property of some members of this class. The other class of

zinc fingers, called C4, groups together many different regulatory proteins

that happen to have several cysteines within a short stretch of sequence. The

steroid hormone receptors are an example of proteins belonging to this class.

Some of the proteins which are known to include C2H2-type zinc fingers are

listed below. We have indicated, between brackets, the number of zinc finger

regions found in each of these proteins; a '+' symbol indicates that only

partial sequence data is available and that additional finger domains may be

present.

- Saccharomyces cerevisiae: metallothionein expression activator ACE2 (3),

transcriptional activator ADR1 (2), transcriptional factor SWI5 (3).

- Aspergillus nidulans: developmental protein brlA (2).

- Drosophila: Cf2 (4+), ci-D (5), Disconnected (2), Glass (5), Hunchback (6),

Kruppel (5), Kruppel-H (4+), Odd-skipped (4), Ref(2)P (1), Snail (5),

Serependity locus beta (6), delta (7), h-1 (8), Suppressor of hairy wing

su(Hw) (12), Tramtrack (2).

- Xenopus: transcription factor TFIIIA (9), p43 from RNP particle (9), Xfin

(37 !!), Xsna (5), gastrula XlcGF5.1 to XlcGF71.1 (from 4+ to 11+), Oocyte

XlcOF2 to XlcOF22 (from 7 to 12).

- Mammalian: transcription factor Sp1 (3), Wilms'tumor protein (4), YY1 (4),

ZfX (13), ZfY (13), Zfp-35 (18), EGR1/Krox24 (3), EGR2/Krox20 (3), Evi-1

(10), GLI1 (5), GLI2 (4+), GLI3 (3+), KR1 (9+), KR2 (9), KR3 (15+), KR4

(14+), KR5 (11+), HF.10 (10), HF.12 (6+), REX-1 (4).

-Consensus pattern: C-x(2,4)-C-x(12)-H-x(3,5)-H

-Sequences known to belong to this class detected by the pattern: ALL.

-Other sequence(s) detected in SWISS-PROT: 25 other proteins.

-Note: generally, but not always, the residue in position +4 after the second

cysteine is an aromatic residue, and that in position +10 is a leucine.

-Note: in proteins that include many copies of the C2H2 zinc finger domain it

is not rare to find one or more incomplete copies of the domain (generally at

the extremity of the region(s) containing zinc fingers) or some degenerated

copies of the domain (which have either lost one or more of zinc-coordinating

residues, or which are interrupted by insertions or deletions). Our pattern

will not detect these incomplete or degenerate finger domains.

-Last update: June 1992 / Text revised.

[ 1] Klug A., Rhodes D.

Trends Biochem. Sci. 12:464-469(1987).

[ 2] Evans R.M., Hollenberg S.M.

Cell 52:1-3(1988).

[ 3] Payre F., Vincent A.

FEBS Lett. 234:245-250(1988).

[ 4] Miller J., Mc Lachlan A.D., Klug A.

EMBO J. 4:1609-1614(1985).

[ 5] Berg J.M.

Proc. Natl. Acad. Sci. U.S.A. 85:99-102(1988).

{END}

(Then user selects pattern searching for PS00028. The pattern number is prefixed with PROSITEP to define the correct path and followed by .PAT to give the pattern file name.)

Search for patterns of motifs

Pattern searcher

Select pattern definition mode

X 1 Use keyboard

2 Use pattern file

3 Use file of pattern file names

? Selection (1-3) (1) =2

? Pattern definition file=PROSITEP/PS00028.PAT

Select results display mode

X 1 Motif by motif

2 Inclusive

3 Graphical

4 EMBL feature table

? Selection (1-4) (1) =2

Zinc finger, C2H2 type, domain.

Probability of score 1.0000 = 0.964E-01

Zinc finger, C2H2 type, domain.

Probability of score 2.0000 = 0.783E-02

Zinc finger, C2H2 type, domain.

Probability of score 1.0000 = 0.812E-01

Pattern description

Zinc finger, C2H2 type, domain.

Motif 1 named 00028 is of class 6

Which is membership of a set with score 1.000

Motif 2 named 00028 is of class 6

Which is membership of a set with score 2.000

It is anded with the previous motif.

Motif 3 named 00028 is of class 6

Which is membership of a set with score 1.000

It is anded with the previous motif.

Probability of finding pattern = 0.6136E-04

Expected number of matches = 0.1093E+00

? Maximum pattern probability (0.00-1.00) (1.00) =

? Minimum pattern score (-9999.00-9999.00) (-9999.00) =

Working

8

CSECGKCFIKSSELTVHQMTH

36

CSECGKCFASLSHLRVHQKIH

64

CSECGKCFLNRGSLVRHHRTH

92

CSECGKRFAASSDLRVHRRTH

120

CSECEKRFLNPWSLVRHYRTH

148

CSECGKCFARSSDLTVHRRRSH

177

CSECGKCFTSSSELTVHLRTH

Total matches found 7

Minimum and maximum observed scores 4.00 4.00

Figure 13.2 An example of browsing through PROSITE and performing a search for a PROSITE entry.

2. Methods

2.1 Creating a pattern file containing a weight matrix motif and a membership of a set motif.

1. Select "Pattern searcher"

2. Select "Pattern definition mode" as "Use keyboard".

3. Select "Results display mode" as "Inclusive". The alternatives are listed in the introduction.

4. Select "Motif definition mode" as "Weight matrix"

5. Define "Motif name". Each motif can be given an 8 character name

6. Define "Weight matrix file name". Type in the name of the file containing the weight matrix. The program will display the probability of finding the motif.

7. Select "Motif definition mode" as "Membership of a set".

8. Define "Motif name".

9. Select "Logical operator" as "AND". The alternatives are "OR" and "NOT".

10. Select "Number of reference motif". At this stage the only choice is 1 and this is the default.

11. Define "Relative start position". The base position relative to the "Reference motif". See the introduction.

12. Define "Number of extra positions".

13. Select input mode as "Keyboard". The alternative is an existing file in the form of a weight matrix.

14. Define "String". Type in the sets of allowed residue types using the one letter code. See note 1

15. Define the "Minimum matches". This is the number of positions within the motif that must match. The default is that all positions must match but users may want to allow some flexibility by giving a lower score.

The program now cycles round to step 7 and all subsequent passes round the loop to add further motifs to the pattern would differ only in the details for the different motif "classes".

16. Select "Pattern complete"

17. Accept "Save pattern in a file". The alternative does not save the pattern and so it can only be used once on the current sequence.

18. Define "Pattern definition file". Give a name for the new file.

19. "Define "Pattern title". All patterns can have a 60 character title that can be displayed when the pattern file is read and the sequence searched.

20. Define "Weight matrix file name". The membership of a set motifs are stored in the form of weight matrices, and so the program needs the user to define a file name.

21. Define "Title". Type in a title for the weight matrix like file. The title will be displayed when the file is read.

The program will now display a detailed textual description of the pattern, the "Probability of finding the pattern" and the "Expected number of matches" (see figure 13.3).

22. Define "Maximum pattern probability". Yes maximum: any match with a greater probability of being found will be rejected. If no value is specified the search will be quicker (see notes). Pattern searcher

Select pattern definition mode

X 1 Use keyboard

2 Use pattern file

3 Use file of pattern file names

? Selection (1-3) (1) =1

Select results display mode

X 1 Motif by motif

2 Inclusive

3 Graphical

4 SWISSPROT feature table

? Selection (1-4) (1) =2

Select motif definition mode

X 1 Exact match

2 Percentage match

3 Cut-off score and score matrix

4 Cut-off score and weight matrix

5 Direct repeat

6 Membership of set

7 Pattern complete

? Selection (1-7) (1) =4

? Motif name=atp

? Weight matrix file name=atpbinding.wts

ATP binding

Probability of score -47.8010 = 0.302E-04

Select motif definition mode

1 Exact match

2 Percentage match

3 Cut-off score and score matrix

X 4 Cut-off score and weight matrix

5 Direct repeat

6 Membership of set

7 Pattern complete

? Selection (1-7) (4) =6

? Motif name=hydro

Select logical operator

X 1 And

2 Or

3 Not

? Selection (1-3) (1) =

? Number of reference motif (1-1) (1) =

? Relative start position (-1000-1000) (20) =22

? Number of extra positions (0-1000) (0) =5

Select input mode

X 1 Keyboard

2 File

? Selection (1-2) (1) =

Separate sets with commas

? String=ivl,ivl,,,rkhde

? Minimum matches (1.00-5.00) (3.00) =

Probability of score 3.000 = 0.145E-01

Select motif definition mode

1 Exact match

2 Percentage match

3 Cut-off score and score matrix

4 Cut-off score and weight matrix

5 Direct repeat

X 6 Membership of set

7 Pattern complete

? Selection (1-7) (6) =7

? Save pattern in a file (y/n) (y) =

? Pattern definition file=_paper.pat

? Pattern title=atpbinding plus

? Weight matrix file name=_hydro.wts

Weight matrix needs a title

? Title=hydrophobic and + spot

Pattern description

atpbinding plus

Motif 1 named atp is of class 4

Which is a match to a weight matrix with score -47.801

Motif 2 named hydro is of class 6

Which is membership of a set with score 3.000

It is anded with the previous motif.

Probability of finding pattern = 0.4368E-06

Expected number of matches = 0.1350E-02

? Maximum pattern probability (0.00-1.00) (1.00) =

? Minimum pattern score (-9999.00-9999.00) (-9999.00) =

162

GQRELIIGDRQTGKTALAIDAIINQR

Total matches found 1

Minimum and maximum observed scores -38.35 -38.35

Figure 13.3 The creation and use of a pattern containing a weight matrix motif and a membership of a set motif. 23. Define "Minimum pattern score". A minimum pattern score only makes sense if all the motifs in the pattern are defined with compatible scoring methods. For example membership of a set motifs and weight matrices using sums of logs are incompatible. Searching will now commence and any matches displayed using the chosen method. In figure 13.3 we show a typical run in which a pattern containing a weight matrix and a membership of a set motif is created and stored on disk. Figure 13.4 shows the contents of the pattern file. atpbinding plus

A4 atp Class

atpbinding.wts

A6 hydro Class

1 Relative motif

22 Relative start position

5 Number of extra positions

_hydro.wts

Figure 13.4 The pattern file created in the worked example shown in figure 13.3.

2.2 Searching a sequence using a pattern file

1. Select "Pattern searcher"

2. Select "Pattern definition mode" as "Use pattern file".

3. Select "Results display mode" as "Inclusive"

4. Define "Pattern definition file". Type the name of the file containing the pattern. The program will read the file then display its title, a detailed textual description of the pattern, the "Probability of finding the pattern", and the "Expected number of matches".

5. Define "Maximum pattern probability".

6. Define "Minimum pattern score". Searching will now commence and any matches displayed using the chosen method. Figure 13.5 shows a typical run using a pattern file and output in the form of a SWISSPROT feature table. To use a PROSITE entry the accession number must be preceded by PROSITEP/ and followed by .PAT. For example PROSITEP/PS00034.PAT defines the pattern with accession number PS00034.

2.3 Comparing a sequence against a library of patterns including PROSITE

This mode of operation allows a sequence to be searched, in turn, for any number of patterns each stored in a separate pattern file. The names of the files containing the individual patterns must be stored in a simple text file. This file is called "a file of pattern file names" and its name is the only user input required to define the search. The file of file names could contain references to entries in the PROSITE motif library and also include the names of other patterns. To use the whole of the PROSITE library users should type PROSITENAMES for the name of the file of pattern file names. 1. Select "Pattern searcher"

2. Select "Pattern definition mode" as "Use file of pattern file names".

3. Select "Results display mode" as "Inclusive"

4. Define "File of pattern file names". Type the name of the file containing the list of pattern file names. The program will read the file and then, in turn, all the pattern files it names. Each of these patterns will be compared against the current sequence but only those that give matches will produce any output. The pattern title and each match will be displayed. Pattern searcher

Select pattern definition mode

X 1 Use keyboard

2 Use pattern file

3 Use file of pattern file names

? Selection (1-3) (1) =2

? Pattern definition file=_paper.pat

Select results display mode

X 1 Motif by motif

2 Inclusive

3 Graphical

4 SWISSPROT feature table

? Selection (1-4) (1) =4

ATP binding sequences

Probability of score -47.8010 = 0.302E-04

hydrophobic and + spot

Probability of score 3.0000 = 0.145E-01

Pattern description

atpbinding plus

Motif 1 named atp is of class 4

Which is a match to a weight matrix with score -47.801

Motif 2 named hydro is of class 6

Which is membership of a set with score 3.000

It is anded with the previous motif.

Probability of finding pattern = 0.4368E-06

Expected number of matches = 0.1350E-02

? Maximum pattern probability (0.00-1.00) (1.00) =

? Minimum pattern score (-9999.00-9999.00) (-9999.00) =

FT atp 162 187 Program

Total matches found 1

Minimum and maximum observed scores -38.35 -38.35

Figure 13.5 Worked example of using a pattern file to search a sequence, and writing the results in the form of a SWISSPROT feature table.

2.4 Searching libraries for patterns

The program PIPL can be used to search whole sequence libraries for patterns. Its use is similar to the pattern search routine described above, except that it does not have the facility for creating pattern files, so they must be created beforehand using PIP. In addition to its obvious application of finding new occurrences of patterns or checking on their frequency it is a useful way of obtaining sequence alignments. It can restrict its search to a list of named entries or can search all but those on a list of entries. It can restrict its output to showing the highest scoring match in each sequence, but by default it will show all matches.

Of its modes of output two require further description. The first "Padded sections" creates a new file for each match. The file will contain the sequence between and including the two outermost motifs in the pattern. It will be gapped to the furthest extent defined by the pattern, which means that if all the files were subsequently written one above the other all the motifs in the pattern would be exactly aligned, with the sections between them containing the requisite numbers of padding characters. The second such mode of output is called "Complete padded sequences". Here the user must know the maximum distance between the leftmost motif and the start of all the sequences that match. A trial run in which only the positions of matches are reported is usually required. The user gives this maximum distance to the program. The program then writes a new file containing the full length of all matching sequences, again maximally gapped (including their left ends) so that they would all align if written above one another. For both of these modes of output the files created are named "entryname" where "entryname" is the name given to the sequence in the sequence library. These modes are best used with the option "Report all matches" rejected, so that only the best match for each sequence is reported. The sequences can be lined up using the sequence assembly program SAP.

The searches, which have recently been recoded, are very rapid. For example a search of the current SWISSPROT library for a pattern defining the globin family as 6 weight matrices with widely varying gaps, finds only globins and takes less than 4 minutes using a single processor on an Alliant FX2800. This time includes reading in the whole library as stored in EMBL CDROM format. 1. Select PIPL.

2. Define "Name for results file."

3. Select a library.

4. Select "Search whole library". The alternatives are "Search only a list of entries" and "Search all but a list of entries". The files containing the list of entries should contain one entry name per line, left justified.

5. Select "Results display mode" as "Inclusive". The alternatives include "Motif by motif", "Scores only", "Complete padded sequences" and "Padded sections".

6. Accept "Report all matches". The alternative only shows the best match for each sequence.

7. Define "Pattern definition file". The name of the file containing the pattern created using PIP.

The program displays a textual description of the pattern and the expected number of matches per 1000 residues assuming an average amino acid composition.

8. Define "Maximum pattern probability". The program will run much more quickly if none is given.

9. Define "Minimum pattern score". The search will start.

A typical run is shown in figure 13.6 PIPL (Protein interpretation program (library)) V4.1 Jul 1991

Author: Rodger Staden

Searches protein libraries for patterns of motifs

? Name for results file=globin.res

Select a library

1 EMBL nucleotide library

X 2 SWISSPROT protein library

3 Personal file in PIR format

4 Personal file in FASTA format

? Selection (1-4) (2) =

Library is in EMBL format with indexes

Select a task

X 1 Search whole library

2 Search only a list of entries

3 Search all but a list of entries

? Selection (1-3) (1) =

Select results display mode

X 1 Motif by motif

2 Inclusive

3 Scores only

4 Complete padded sequences

5 Padded sections

? Selection (1-5) (1) =5

? (y/n) (y) Report all matches n

? Pattern definition file=globin.pat

globin 1

Probability of score -34.5300 = 0.197E-02

globin 2

Probability of score -44.6000 = 0.409E-02

globin 3

Probability of score -75.1000 = 0.293E-01

globin 4

Probability of score -36.1000 = 0.147E-01

globin 5

Probability of score -73.7000 = 0.375E-01

globin 6

Probability of score -55.9000 = 0.483E-01

Pattern description

Globin pattern file

Motif 1 named g1 is of class 4

Which is a match to a weight matrix with score -34.530

Motif 2 named g2 is of class 4

Which is a match to a weight matrix with score -44.600

and the N-terminal residue can take positions 17 to 22

relative to the N-terminal end of motif 1

It is anded with the previous motif.

Motif 3 named g3 is of class 4

Which is a match to a weight matrix with score -75.100

and the N-terminal residue can take positions 27 to 35

relative to the N-terminal end of motif 2

It is anded with the previous motif.

Motif 4 named g4 is of class 4

Which is a match to a weight matrix with score -36.100

and the N-terminal residue can take positions 29 to 53

relative to the N-terminal end of motif 3

It is anded with the previous motif.

Motif 5 named g5 is of class 4

Which is a match to a weight matrix with score -73.700

and the N-terminal residue can take positions 12 to 16

relative to the N-terminal end of motif 4

It is anded with the previous motif.

Motif 6 named g6 is of class 4

Which is a match to a weight matrix with score -55.900

and the N-terminal residue can take positions 29 to 33

relative to the N-terminal end of motif 5

It is anded with the previous motif.

Probability of finding pattern = 0.6273E-11

Expected number of matches per 1000 residues = 0.2119E-03

? Maximum pattern probability (0.00-1.00) (1.00) =

? Minimum pattern score (-9999.00-9999.00) (-9999.00) =

Figure 13.6 A typical run of PIPL using a pattern of 6 weight matrices to search the SWISSPROT library.

2.5 Preparing the PROSITE motif library for use by the programs

Only the program SPLITP3 is essential for preparing the PROSITE library for use by our programs. Change directory to the one in which the pattern files will be stored (In the distribution this is $STADTABL/prosite/pats). 1. Select SPLITP3

2. Define "Prosite library file". Type the name of the file containing the prosite library (on the distribution ../indices/dat/prosite.dat).

3. Define "Name for file of pattern file names". This is the file of file names that users will employ to search the whole library. Environment variable PROSITENAMES is defined for this file name.

4. Define "Path name of motif directory". This is the full path name, including the final /, to the directory in which the converted library will be stored. Environment variable PROSITEP should be used.

3. Notes

1. The "exact match" motif class requires a consensus sequence. The "percentage match" motif class requires a consensus sequence and a cutoff score. The "score matrix" motif class uses the MDM78 matrix and requires a consensus sequence and a cutoff score. The "weight matrix" search only requires the name of the file containing the matrix. The "direct repeat" motif class requires a repeat length, the minimum and maximum gap between the two occurrences of the repeat, and a minimum score. The "membership of a set" motif class defines sets of residue types that are allowed at each position in the motif. When they are first entered into the pattern they are normally typed on the keyboard, but when they are stored in a file, they are written in the same format as a weight matrix. To enter them on the keyboard use the following format. Type the one letter codes for the set of residue types allowed at each position terminated by a comma (,). For positions where any residue type is allowed simply type an extra comma. For example VLI,FY,,,DE means any of Valine, Leucine or Isoleucine in the first position, either Phenylalanine or Tyrosine in the next position, anything in the next two positions, and Aspartic acid or Glutamic acid in the next. When the pattern is stored on the disk the program will request a name for the file and a title for the motif.

2. The details of the probabilty calculations are outside the scope of this article. They are quite rapid and are essential both for assessing the statistical significance of any matches found and for allowing meaningful cutoffs to be applied to patterns. Obviously, in general, cutoff scores are inappropriate for patterns containing a mixture of motif classes.

3. The program calculates the "Probability of finding the pattern" and the "Expected number of matches". The first figure is actually the product of the individual motif probabilities but the latter figure is more useful because it takes into account the allowed variation in spacing between motifs and the length of the current sequence. In both cases the composition of the current sequence is also used so that different probabilities would be calculated for other sequences.

4. The pattern definition system is very flexible. Assume that a laboratory has a large library of patterns stored in its computer. Different groups or users may want to screen their sequences against different subsets of a pattern library. Each group therefore uses its own "File of pattern file names" which contains only the names of the pattern files that are relevant to their sequences. Of course a pattern may contain only one motif. Hence a library of patterns can include both simple and complex patterns. In the same way a laboratory may have a large library of weight matrices defining different motifs and different users may want to combine them in different ways to produce their own patterns.

Also, of course, a library does not have to be used solely for performing mass screenings: each individual entry can be used as a single pattern by giving the name of its pattern file - eg PROSITEP/PS00002.PAT.

5. Note that 5 of the PROSITE motifs contains the symbols > or < which means that the motifs must appear exactly at the N or C termini of the sequences. Currently our methods have no mechanism for such definitions and, for example KDEL motifs, will be permitted to occur anywhere throughout a sequence.

4. References

1. Staden, R. 1988. Methods to define and locate patterns of motifs in sequences. CABIOS 4(1):53-60.

2. Staden, R. 1989. Methods for calculating the probabilities of finding patterns in sequences. CABIOS 5(2):89-96.

3. Staden, R. 1990. Searching for patterns in protein and nucleic acid sequences. (in) Methods in Enzymology R.F. Doolittle (ed.), 183:193-211 (Academic Press, New York).

14. Comparing Sequences

Table of contents

1. Introduction

2. Methods

2.1 Producing a dot matrix plot (or list) of exact matches

2.2 Producing a dot matrix plot using the proportional algorithm

2.3 Producing a dot matrix plot using the quick scan algorithm

2.4 Producing a list of all matching segments using the proportional algorithm

2.5 Calculating the expected scores for the proportional algorithm

2.6 Calculating the observed scores for the proportional algorithm

2.7 Producing an optimal alignment

2.8 Comparing a sequence against a library of sequences

3. Notes

4. References

1. Introduction

In this chapter we describe methods for comparing and aligning pairs of nucleic acid or protein sequences. The program described (SIP), the original version of which was first described in 1982 (1), is based around several methods for producing "dot matrix" plots and includes routines for assessing the statistical significance of the plots, plus a dynamic programming algorithm for finding optimal alignments. At the end of the chapter we describe a program SIPL that is used for comparing a single sequence against a whole library of sequences.

We assume the reader is familiar with the general principle of dot matrix diagrams. The program uses a number of different algorithms to calculate the score for each point in a dot matrix and the user defines a minimum score so that only those points in the diagram for which the score is at least this value will be marked with a dot. The first scoring method finds uninterrupted sections of perfect identity i.e. those that contain no mismatches, insertions or deletions. Generally this method, termed "the identities algorithm" is of limited value, but runs very quickly.

The second method looks for sections where a proportion of the characters in the sequence are similar, again allowing no insertions or deletions. For a thorough analysis this method, termed "the proportional algorithm", is the best. The original method, of this type was first described by McLachlan (2) and involves calculating a score for each position in the matrix by summing points found when looking forwards and backwards along a diagonal line of a given length (the window). The algorithm does not simply look for identity but uses a score matrix that contains scores for every possible pair of characters. For comparing amino acid sequences we usually use the score matrix MDM78 (3) which is shown in figure 14.1.. It is also possible to use other matrices, including an identity matrix for proteins. For nucleic acids we usually use an identity matrix. C S T P A G N D E Q B Z H R K M I L V F Y W - X ?

C 22 10 8 7 8 7 6 5 5 5 5 5 7 6 5 5 8 4 8 6 10 2 10 10 10 10

S 10 12 11 11 11 11 11 10 10 9 10 10 9 10 10 8 9 7 9 7 7 8 10 10 10 10

T 8 11 13 10 11 10 10 10 10 9 10 10 9 9 10 9 10 8 10 7 7 5 10 10 10 10

P 7 11 10 16 11 9 9 9 9 10 9 10 10 10 9 8 8 7 9 5 5 4 10 10 10 10

A 8 11 11 11 12 11 10 10 10 10 10 10 9 8 9 9 9 8 10 6 7 4 10 10 10 10

G 7 11 10 9 11 15 10 11 10 9 10 10 8 7 8 7 7 6 9 5 5 3 10 10 10 10

N 6 11 10 9 10 10 12 12 11 11 12 11 12 10 11 8 8 7 8 6 8 6 10 10 10 10

D 5 10 10 9 10 11 12 14 13 12 13 12 11 9 10 7 8 6 8 4 6 3 10 10 10 10

E 5 10 10 9 10 10 11 13 14 12 12 13 11 9 10 8 8 7 8 5 6 3 10 10 10 10

Q 5 9 9 10 10 9 11 12 12 14 11 13 13 11 11 9 8 8 8 5 6 5 10 10 10 10

B 5 10 10 9 10 10 12 13 12 11 13 11 11 10 10 8 8 6 8 5 7 4 10 10 10 10

Z 5 10 10 10 10 10 11 12 13 13 11 14 12 10 10 8 8 8 8 5 6 4 10 10 10 10

H 7 9 9 10 9 8 12 11 11 13 11 12 16 12 10 8 8 8 8 8 10 7 10 10 10 10

R 6 10 9 10 8 7 10 9 9 11 10 10 12 16 13 10 8 7 8 6 6 12 10 10 10 10

K 5 10 10 9 9 8 11 10 10 11 10 10 10 13 15 10 8 7 8 5 6 7 10 10 10 10

M 5 8 9 8 9 7 8 7 8 9 8 8 8 10 10 16 12 14 12 10 8 6 10 10 10 10

I 8 9 10 8 9 7 8 8 8 8 8 8 8 8 8 12 15 12 14 11 9 5 10 10 10 10

L 4 7 8 7 8 6 7 6 7 8 6 8 8 7 7 14 12 16 12 12 9 8 10 10 10 10

V 8 9 10 9 10 9 8 8 8 8 8 8 8 8 8 12 14 12 14 9 8 4 10 10 10 10

F 6 7 7 5 6 5 6 4 5 5 5 5 8 6 5 10 11 12 9 19 17 10 10 10 10 10

Y 10 7 7 5 7 5 8 6 6 6 7 6 10 6 6 8 9 9 8 17 20 10 10 10 10 10

W 2 8 5 4 4 3 6 3 3 5 4 4 7 12 7 6 5 8 4 10 10 27 10 10 10 10

- 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

X 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

? 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

Figure 14.1 The amino acid score matrix MDM78. For the proportional method plotting dots at the centres of windows that reach the cutoff leads to a persistence effect that, to some extent, can be mitigated by a variation on the method. If, for example, all the high scoring amino acids are clustered at the left end of a particular diagonal segment, dots will continue to be plotted to their right until the window score drops below the cutoff. Instead of plotting a single point for each window that reaches the cutoff score, the variant method plots points for all the identities that lie in windows that reach the cutoff. Obviously the persistence effect can be more pronounced for long windows and low cutoff scores, but note that the variant method will plot nothing if there are no identities present, and so similar regions could be missed! A further variant, useful for comparing a sequence against itself, ignores the main diagonal.

The third comparison method called "quick scan" is really a combination of the first two, and is similar to the FASTP program of Lipman and Pearson (4), but produces a dot matrix diagram. The algorithm is as follows. The dot matrix positions are found for all words of some minimum length (obviously length 1 is most sensitive) that are common to both sequences. Imagine a diagonal line running from corner to corner of the diagram, at right angles to the diagonals in the dot matrix, The scores for the common words (according to the current score matrix, e.g. MDM78) are accummulated at the appropriate positions on that imaginary line, hence producing a histogram. The histogram is analysed to find its mean and standard deviation. The diagonals that lie above some cutoff score (defined in standard deviation units), are rescanned using the proportional algorithm, and a diagram produced. The method is very fast, and is also employed by the library comparison program (see below).

The dynamic programming alignment algorithm contained in the program is based on that of Myers and Miller (5). It guarantees to produce alignments with the optimum score given a score matrix, a gap start penalty, and a gap extension penalty. It is very useful to have the dot matrix methods and the alignment routine together in the same program because it allows users to produce a dot matrix diagram to help select which regions of the sequence they wish to align. Selection is made by use of the crosshair. The crosshair is positioned first at the bottom left hand end of the segment to be aligned and then at the top right of the segment. When the alignment routine is selected the segment will be aligned. The alignment can replace the original segment of the sequence. By repeated plotting of dot matrices, followed by alignment, very long sequences can easily be aligned.

2. Methods

2.1 Producing a dot matrix plot (or list) of exact matches

This method is relatively fast and can be useful for very similar sequences. It marks the position of every exact match of some minimum length with a dot or lists out the matching segments. 1. Select "Apply identities algorithm".

2. Define "Identity score".

3. Select "Plot or List". The plot will appear as in figure 14.2, which shows a comparison of two protein sequences using a score of 2. Listed output displays the matching segments and defines their positions.

Figure 14.2 A dot-matrix for two related protein sequences using the "Identities algorithm" and a score of 2. Notice that the similarity is not apparent.

2.2 Producing a dot matrix plot using the proportional algorithm

This method gives the most thorough analysis. 1. Select "Apply proportional algorithm".

2. Define "Odd window length". The size of window over which the scores for each point are summed.

3. Define "Proportional score". All points achieving at least this score will be marked with a dot in the diagram. The plot will appear as in figure 14.3.

Figure 14.3 A dot-matrix for the two related protein sequences shown in figure 14.2, but here using the "Proportional algorithm" with a window of 21 and a score of 240. Notice that the similarity is now apparent.

2.3 Producing a dot matrix plot using the quick scan algorithm

This method is very fast. Using the current score matrix it accumulates the scores for all the exact matches that lie on each diagonal. The mean diagonal score and its standard deviation is calculated, and those diagonals that have scores more than a chosen number of standard deviations above the mean are rescanned using the proportional algorithm and the points above the proportional algorithms cutoff are plotted. 1. Select "Apply quick scan algorithm".

2. Define "Identity score". The minimum number of consecutive identical sequence symbols that count as a match.

3. Define "Odd window length". The size of window over which the scores for each point are summed when the proportional algorithm is applied to the best diagonals.

4. Define "Proportional score". For the best diagonals all points achieving at least this score will be marked with a dot in the diagram.

5. Define "Number of s.d. above mean". Diagonals with scores above the minimum number of standard deviations are rescanned using the proportional algorithm. The plot will appear as in figure 14.4.

Figure 14.4 A dot-matrix for the two related protein sequences shown in figures 14.2 and 14.3, but here using the "Quick scan algorithm" with an identity score of 1 and a window of 21 and a score of 240 for the proportional algorithm. Notice that the similarity is now apparent but the absence of background "noise" is misleading.

2.4 Producing a list of all matching segments using the proportional algorithm

1. Select "List matching segments".

2. Define "Odd window length". The size of window over which the scores for each point are summed.

3. Define "Proportional score". All segments achieving at least this score will be listed out with the two sequences written one above the other. See figure 14.5.

2.5 Calculating the expected scores for the proportional algorithm

This function calculates the probability of achieving each possible score using the proportional algorithm. Hence it provides a method of setting cutoff scores and assessing the statistical significance of the scores found. The algorithm calculates the "Double matching probability" described by McLachlan (2) which is defined as the probability of finding the scores in two infinitely long sequences of the same composition as the pair being compared. It is very much faster than the alternative of repeatedly scrambling and recomparing the sequences. The program offers three ways for the user to see the results of the calculation: the user can type a List matching segments

? Odd window length (1-401) (11) =

? Proportional score (1-567) (252) =

Working

62

GLRRGLDVKDLEHPIEVPVGK

DLAEGMKVKCTGRILEVPVGR

81

63

LRRGLDVKDLEHPIEVPVGKA

LAEGMKVKCTGRILEVPVGRG

82

65

RGLDVKDLEHPIEVPVGKATL

EGMKVKCTGRILEVPVGRGLL

84

66

GLDVKDLEHPIEVPVGKATLG

GMKVKCTGRILEVPVGRGLLG

85

67

LDVKDLEHPIEVPVGKATLGR

MKVKCTGRILEVPVGRGLLGR

86

Figure 14.5 A typical run of "List matching segments. score and the program will display its probability; the user can type a probability and the program will display the corresponding score, alternatively the program will list the full range of scores and probabilities. 1. Select "Calculate expected scores".

2. Define "Odd window length".

The calculation takes a noticeable time.

3. Select "List scores and probabilities".

4. Define "Number of steps between scores". This allows, say, every fifth score to be listed if the user defines the number of steps to be 5. The list will appear as in figure 14.6.

2.6 Calculating the observed scores for the proportional algorithm

This function applies the proportional algorithm, but instead of producing a dot matrix it accumulates the scores and their frequencies of occurrence. It provides a method of setting cutoff scores and assessing the statistical significance of the scores found. The program offers three ways for the user to see the results of the calculation: the user can type a score and the program will display its frequency; the user can type a frequency and the program will display the corresponding score, alternatively the program will list the full range of scores and frequencies. The frequencies are expressed as percentages. 1. Select "Calculate observed scores".

2. Define "Odd window length".

The calculation takes a noticeable time. Calculate expected proportional scores

? Odd window length (1-401) (21) =

Working

Average score= 196.99062

Select probability display mode

1 Show probability for a score

X 2 Show score for a probability

3 List scores and probabilities

? Selection (1-3) (2) =3

? Number of steps between scores (1-10) (5) =

5 0.10000E+01 200 0.40004E+00 395 0.00000E+00

10 0.10000E+01 205 0.24037E+00 400 0.00000E+00

15 0.10000E+01 210 0.12555E+00 405 0.00000E+00

20 0.10000E+01 215 0.56905E-01 410 0.00000E+00

25 0.10000E+01 220 0.22402E-01 415 0.00000E+00

30 0.10000E+01 225 0.76821E-02 420 0.00000E+00

35 0.10000E+01 230 0.23031E-02 425 0.00000E+00

40 0.10000E+01 235 0.60614E-03 430 0.00000E+00

45 0.10000E+01 240 0.14064E-03 435 0.00000E+00

50 0.10000E+01 245 0.28888E-04 440 0.00000E+00

55 0.10000E+01 250 0.52741E-05 445 0.00000E+00

60 0.10000E+01 255 0.85917E-06 450 0.00000E+00

65 0.10000E+01 260 0.12534E-06 455 0.00000E+00

70 0.10000E+01 265 0.16433E-07 460 0.00000E+00

75 0.10000E+01 270 0.19425E-08 465 0.00000E+00

80 0.10000E+01 275 0.20772E-09 470 0.00000E+00

85 0.10000E+01 280 0.20155E-10 475 0.00000E+00

90 0.10000E+01 285 0.17801E-11 480 0.00000E+00

95 0.10000E+01 290 0.14353E-12 485 0.00000E+00

100 0.10000E+01 295 0.10599E-13 490 0.00000E+00

105 0.10000E+01 300 0.71886E-15 495 0.00000E+00

110 0.10000E+01 305 0.44920E-16 500 0.00000E+00

115 0.10000E+01 310 0.25938E-17 505 0.00000E+00

120 0.10000E+01 315 0.13881E-18 510 0.00000E+00

Figure 14.6 A typical run of "Calculate expected proportional scores." The scores are listed in three columns alongside their probabilities. e.g. score 250 has a probability 0.527x10-5. 3. Select "List scores and percentages".

4. Define "Number of steps between scores". This allows, say, every fifth score to be listed if the user defines the number of steps to be 5. The list will appear as in figure 14.7.

2.7 Producing an optimal alignment

This function produces an optimal alignment for any segments of the two sequences using the algorithm of Myers and Miller (5). It guarantees to produce alignments with the optimum score, given a score matrix, a "gap start penalty" and a "gap extension penalty". That is starting a gap costs a fixed penalty F and each residue added to the gap costs a further penalty E, so for

Calculate observed proportional scores

? Odd window length (1-401) (21) =

Working

Maximum observed score is 285

Select score display mode

X 1 Show percentage reaching a score

2 Show score for a percentage

3 List scores and percentages

? Selection (1-3) (1) =3

? Number of steps between scores (1-10) (5) =

156 236949 0.99998E+02

161 236938 0.99993E+02

166 236792 0.99932E+02

171 235882 0.99548E+02

176 232582 0.98155E+02

181 222875 0.94058E+02

186 203232 0.85769E+02

191 171507 0.72380E+02

196 131216 0.55376E+02

201 89194 0.37642E+02

206 52791 0.22279E+02

211 27315 0.11528E+02

216 12117 0.51137E+01

221 4890 0.20637E+01

226 1774 0.74867E+00

231 656 0.27685E+00

236 263 0.11099E+00

241 111 0.46845E-01

246 66 0.27854E-01

251 36 0.15193E-01

256 23 0.97065E-02

261 16 0.67524E-02

266 15 0.63303E-02

271 10 0.42202E-02

276 6 0.25321E-02

281 2 0.84405E-03

Figure 14.7 A typical run of "Calculate observed scores." The scores are followed by their observed number of occurrences expressed both absolutely and as a percentage of the total number of points. gap of length K residues the penalty is F + KE. Gaps at the ends of sequences incur no penalty. The size of the segments of sequence that can be aligned at once is limited to 5000 characters. The user can select the start and end of the segments by use of the crosshair simply by clicking on any dot matrix plot. After the alignment has been produce the user can elect to have it replace the original sequence segments. By alternate use of dot matrix plotting and alignment, very long sequences can be aligned. 1. Select "Align sequences". The crosshair will appear in the graphics window.

2. Position the crosshair on the bottom left of the segment to be aligned and hit the space bar on the keyboard. The bell will ring.

3. Position the crosshair on the top right of the segment to be aligned and hit the space bar on the keyboard. The bell will ring.

4. Define "Penalty for starting each gap".

5. Define "penalty for each residue in gap".

A noticeable time will elapse before the alignment is displayed on the screen. A typical alignment is shown in figure 14.8.

6. Reject "Keep alignment". If the alignment is "kept" the padded sequences from the alignment will replace the original sequences in the active region. Align the sequences

Aligning region 1 to 461

with region 1 to 514 Working

V 1 11 21 31 41 51

MA--TGKIVQ VIGA------ VVDVEFPQDA VPRVYDALEV QNG------N ERLVL-----

* * * ** * * * * *

MQLNSTEISE LIKQRIAQFN VVSEAHNEGT IVSVSDGVIR IHGLADCMQG EMISLPGNRY

H 1 11 21 31 41 51

V 61 71 81 91 101 111

EVQQQLGGGI VRTIAMGSSD GLRRGLDVKD LEHPIEVPVG KATLGRIMNV LGEPVDMKGE

* * ** * * ** ***** *** * ** * * **

AIALNLERDS VGAVVMGPYA DLAEGMKVKC TGRILEVPVG RGLLGRVVNT LGAPIDGKGP

H 61 71 81 91 101 111

V 121 131 141 151 161 171

IGEEERWAIH RAAPSYEELS NSQELLETGI KVIDLMCPFA KGGKVGLFGG AGVGKTVNMM

* ** * ** * * * * * * ***

LDHDGFSAVE AIAPGVIERQ SVDQPVQTGY KAVDSMIPIG RGQRELIIGD RQTGKTALAI

H 121 131 141 151 161 171

V 181 191 201 211 221 231

ELIRNIAIEH SGYS-VFAGV GERTREGNDF YHEMTDSNVI DKVSLVYGQM NEPPGNRLRV

* * ** * * *

DAI--INQRD SGIKCIYVAI GQKASTISNV VRKLEEHGAL ANTIVVVATA SESAALQYLA

H 181 191 201 211 221 231

V 241 251 261 271 281 291

ALTGLTMAEK FRDEGRDVLL FVDNIYRYTL AGTEVSALLG RMPSAVGYQP TLAEEMGVLQ

* * *** * * * * * * ** * * *

RMPVALMGEY FRDRGEDALI IYDDLSKQAV AYRQISLLLR RPPGREAFPG DVFYLHSRLL

H 241 251 261 271 281 291

V 301 311 321 331 341 351

ERITST---- ---------- -KTGSITSVQ AVYVPADDLT DPSPATTFAH LDATVVLSRQ

** **** * * * * * *

ERAARVNAEY VEAFTKGEVK GKTGSLTALP IIETQAGDVS AFVPTNVISI TDGQIFLETN

H 301 311 321 331 341 351

V 361 371 381 391 401 411

IASLGIYPAV DPLDSTSRQL DPLVVGQEHY DTAR----GV QSILQRYQEL KDIIAILGMD

** *** * * ** * * * * * **

LFNAGIRPAV NPGISVSR-- ---VGGAAQT KIMKKLSGGI RTALAQYREL AAFSQFAS--

H 361 371 381 391 401 411

V 421 431 441 451 461 471

ELSEEDKLVV ARARKIQRFL SQ----PFFV AE----VFTG SPGKYVSLKD --TIRGFKGI

* * * * * * * * * * * *

DLDDATRKQL DHGQKVTELL KQKQYAPMSV AQQSLVLFAA ERG-YLADVE LSKIGSFEAA

H 421 431 441 451 461 471

V 481 491 501 511 521

MEG--EYDHL P-EQAFYMVG SIEEAVE--- --------KA KKL*

** * * * * *

LLAYVDRDHA PLMQEINQTG GYNDEIEGKL KGILDSFKAT QSW*

H 481 491 501 511 521

Conservation 22.5%

Number of padding characters inserted 63 and 10

Figure 14.8 A typical output from "Align the sequences". The horizontal and vertical sequences are labelled H and V.

2.8 Comparing a sequence against a library of sequences

The program SIPL is used for comparing a probe sequence against a whole library of sequences. The searches are very fast and use the "Quick scan" algorithm described above to produce a list of matching sequences sorted in score order, and optionally, this is followed by the production of optimal alignments using the Myers and Miller (5) algorithm. The program will search the whole of a library or restrict its search using a list of entry names. The list of entry names can be used either as a list of sequences to search or conversely as a list of sequences to exclude from a search. 1. Select SIPL.

2. Select "Personal file".

3. Select "Format".

4. Define "Name of sequence file". The name of the file containing the probe sequence.

5. Define "Name of results file".

6. Accept "Display alignments". The alternative will stop after producing a list of the best matching sequences.

7. Define "Minimum library sequence length". This permits the search to skip sequences that are too short to be of interest.

8. Define "Maximum number of scores to list". The maximum number of sequences that will be included in the results file.

9. Define "Identity score". This is the minimum number of consecutive sequence characters that will be counted as a match. Only matches of at least this length will be included in the overall score. For proteins maximum sensitivity is gained using a value of 1, but for nucleic acids values of 4 or 6 are necessary to achieve reasonable speed.

10. Define "Number of sd above mean". This means the number of standard deviations above the mean that a diagonal must score in order for it to be scanned using the proportional algorithm.

11. Define "Odd window length". This is the window size for the rescanning of high scoring diagonals using the proportional algorithm.

12. Define "Proportional score". The score used by the proportional algorithm. It depends on the window length and the score matrix.

13. Define "Minimum global score". This is the total score achieved using the proportional algorithm when all the diagonals scoring the defined number of standard deviations above the mean, are rescanned.

14. Define "Penalty for starting a gap". This is for the alignment algorithm.

15. Define "Penalty for each residue in gap". See above.

16. Select a library to search. The default library will reflect the composition of the probe sequence. That is, a probe sequence that is less than 85% acgt will be guessed to be a protein.

17. Select "Search whole library". The alternatives allow the search to be restricted using a list of entry names. The search will start. A large number of parameters are required but for normal use the default value can be taken for them all. A worked example is shown in figure 14.9. SIPL (Similarity investigation program (Library)) V3.0 June 1991

Author: Rodger Staden

Compares a probe protein or nucleic acid

sequence against a library of sequences

Select probe sequence

Select sequence source

X 1 Personal file

2 Sequence library

? Selection (1-2) (1) =2

Select a library

1 EMBL nucleotide library

X 2 SWISSPROT protein library

3 PIR protein library

? Selection (1-3) (2) =

Library is in EMBL format with indexes

Select a task

X 1 Get a sequence

2 Get annotations

3 Get entry names from accession numbers

4 Search titles for keywords

5 Search keyword index for keywords

? Selection (1-5) (1) =

? Entry name=bacr$halha

DE BACTERIORHODOPSIN PRECURSOR (BR) (GENE NAME: BOP).

Sequence length= 262

Sequence composition

A C S T P A G N D E Q B Z H

N 0. 14. 19. 12. 30. 26. 3. 10. 11. 4. 0. 0. 0.

% 0.0 5.3 7.3 4.6 11.5 9.9 1.1 3.8 4.2 1.5 0.0 0.0 0.0

W 0. 1219. 1921. 1165. 2132. 1483. 342. 1151. 1420. 513. 0. 0. 0.

A R K M I L V F Y W - X ?

N 7. 7. 10. 15. 39. 23. 13. 11. 8. 0. 0. 0. 0.

% 2.7 2.7 3.8 5.7 14.9 8.8 5.0 4.2 3.1 0.0 0.0 0.0 0.0

W 1093. 897. 1312. 1697. 4413. 2280. 1913. 1795. 1490. 0. 0. 0. 0.

Total molecular weight= 28256.254

? Results file=sipl.res

? Display alignments (y/n) (y) =

? Minimum library sequence length (10-20000) (209) =

? Maximum number of scores to list (1-10000) (20) =10

? Identity score (1-3) (1) =

? Number of sd above mean (0.00-10.00) (3.00) =

? Odd window length (1-31) (11) =

? Proportional score (1-297) (132) =

? Minimum global score (1-69168) (1729) =

? Penalty for starting a gap (1-100) (10) =

? Penalty for each residue in gap (1-100) (10) =

Select a library

1 EMBL nucleotide library

X 2 SWISSPROT protein library

3 PIR protein library

4 Personal file in PIR format

? Selection (1-4) (2) =

Library is in EMBL format with indexes

Select a task

X 1 Search whole library

2 Search only a list of entries

3 Search all but a list of entries

? Selection (1-3) (1) =3

? File of entry names=skip.nam

21794 entries processed, 25 above cutoff, sorting now

Entries exceeding sd cutoff= 4439

Mean number of diagonals above span cutoff 1.32012

List in score order

31007 BACA$HALSA DE ARCHAERHODOPSIN PRECURSOR (AR).

12177 BACH$NATPH DE HALORHODOPSIN PRECURSOR (HR) (GENE NAME: HOP).

10999 BACH$HALSP DE HALORHODOPSIN PRECURSOR (HR) (GENE NAME: HOP).

3999 HYAC$ECOLI DE HYPOTHETICAL 27.6 KD PROTEIN IN HYAB 3'REGION (GENE NAM

2670 OPS4$DROME DE OPSIN RH4 (INNER R7 PHOTORECEPTOR CELLS OPSIN) (GENE NA

2573 PYR1$MESAU DE CAD PROTEIN (CONTAINS: GLUTAMINE-DEPENDENT CARBAMOYL-PH

2328 PFLA$ECOLI DE PYRUVATE FORMATE-LYASE ACTIVATING ENZYME.

2194 DCOP$CANAL DE OROTIDINE 5'-PHOSPHATE DECARBOXYLASE (EC 4.1.1.23) (OMP

2145 BCM1$HUMAN DE LYMPHOCYTE ACTIVATION MARKER BLAST-1 PRECURSOR (BCM1 SU

2103 LAG3$HUMAN DE LAG-3 PROTEIN PRECURSOR (FDC PROTEIN) (GENE NAME: LAG3

BACA$HALSA DE ARCHAERHODOPSIN PRECURSOR (AR).

V 1 11 21 31 41 51

MLELLPTAVE GVSQAQITGR PEWIWLALGT ALMGLGTLYF LVKGMGVSDP DAKKFYAITT

* ** ** ** ** ** ** ** *** ** * * * **

M-DPIALTAA VGADLLGDGR PETLWLGIGT LLMLIGTFYF IVKGWGVTDK EAREYYSITI

H 1 11 21 31 41 51

V 61 71 81 91 101 111

LVPAIAFTMY LSMLLGYGLT MVPFGGEQNP IYWARYADWL FTTPLLLLDL ALLVDADQGT

*** ** * *** * *** * * * ** ******* ********** *** *

LVPGIASAAY LSMFFGIGLT EVQVGSEMLD IYYARYADWL FTTPLLLLDL ALLAKVDRVS

H 61 71 81 91 101 111

V 121 131 141 151 161 171

ILALVGADGI MIGTGLVGAL TKVYSYRFVW WAISTAAMLY ILYVLFFGFT SKAESMRPEV

* *** * ** ******* * * * ** * ** * * ***

IGTLVGVDAL MIVTGLVGAL SHTPLARYTW WLFSTICMIV VLYFLATSLR AAAKERGPEV

H 121 131 141 151 161 171

V 181 191 201 211 221 231

ASTFKVLRNV TVVLWSAYPV VWLIGSEGAG IVPLNIETLL FMVLDVSAKV GFGLILLRSR

**** * *** *** * ** **** * * ***** ****** *** *** ******

ASTFNTLTAL VLVLWTAYPI LWIIGTEGAG VVGLGIETLL FMVLDVTAKV GFGFILLRSR

H 181 191 201 211 221 231

V 241 251 261

AIFGEAEAPE PSAGDGAAAT SD

** * **** **** * *

AILGDTEAPE PSAG-AEASA AD

H 241 251 261

Conservation 56.1%

Number of padding characters inserted 0 and 2

Figure 14.9 A run of SIPL using an entry from a sequence library and a file of entries to be excluded from the search.

3. Notes

1. The variants on the proportional algorithm are selected by setting parameters using a special menu. This includes the facility to switch off the main diagonal for all options, which is useful when comparing a sequence against itself.

2. For nucleotide sequences the program also has a function to complement a sequence. If the sequence on one axis is the complement of that on the other, the plots will show possible base pairing.

3. When the cross hair is being employed, in addition to the standard special keys, the letter m will produce a display showing all the identical sequence characters around the cross hair position. The display is in the form of a matrix.

4. Users should not be misled by the "Quick scan" algorithm. Its function is to perform rapid comparisons. The plots it produces may look quite striking because they will contain almost no background, however such plots tell nothing about the significance of the similarities displayed.

5. By using the "Reposition plots" function users can display several dot matrix plots on the screen at the same time. In this way plots from several pairs of sequence comparisons can be viewed together.

6. The library search program SIPL is included in the package only for those who are unable to obtain FASTA, or who want to avoid having to support yet another sequence library format. It is of limited use for searching the nucleic acid libraries because it does not deal properly with sequences longer than 50,000 characters, but simply truncates them. Note that FASTA can read the EMBL CDROM copies of EMBL nucleic acid library and SWISSPROT.

4. References

1. Staden, R. 1982. An interactive graphics program for comparing and aligning nucleic acid and amino acid sequences. Nucl. Acids Res. 10(9):2951-2961.

2. McLachlan, A.D. 1971. Test for comparing related amino acid sequences. J. Mol. Biol. 61:409-424.

3. Schwartz, R.M. and Dayhoff, M.O. 1978. Matrices for detecting distant relationships. (in) Atlas of Protein Sequence and Structure, 5 suppl. 3:353-358, Nat. Biomed. Res. Found., Washington D.C.

4. Lipman, D.J. and Pearson, W.R. 1985. Rapid and sensitive protein similarity searches. Science 227:1435-1441.

5. Myers, E.W. and Miller, W. 1988. Optimal alignments in linear space. Comput. Applic. Biosci., 4, 11-17.