Fundamentals of Sequence Analysis, 1998-1999

Lecture 10. Web site survey

In this lecture we'll take a look at some of the more interesting web sites and web based tools.
Some of the web pages described in this lecture may not work with all browsers as they require very recent versions of Java.  This lecture was prepared using Netscape 4.5 on Windows NT, which worked with all of the pages shown.

Genome, Contig, Gene, and Sequence Viewers

The vast amount of sequence and sequence related information which the many genome projects have been accumulating has led several sites to develop and distribute viewers.  This is useful because the data in text form is often too hard to "digest" quickly.  Ultimately the goal is to provide the end user community with tools which can easily move from a chromosome scale view all the way down to a base pair view, and which can easily and quickly locate and display any feature, on any of these scales, that the user might want to see.  Designing a viewer which works well in displaying genetic features at sizes which vary by 8 orders of magnitude is no easy task!  Here are some of the currently available viewers.
 


AceDB

The forerunner of all of the web based viewers is AceDB (AC. elegansData Base), which was developed for the C. elegans sequencing project.  Originally the name referred to both the software and the database it coordinated.  Later, as other genome projects adopted the software, the name usually came to refer just to the software, with a  different name for the database.  In any case, AceDB has both GUI (using the X11 window system) and text versions.  The GUI version is pretty straightforward to use, but unfortunately, only the text version of the browser seems to be available on the web at this time.  Still, it will output maps in graphical format.  Here is one example of how it is used.

Starting from http://probe.nal.usda.gov:8000/other/aboutacedb.html click on  browse, then click on  ACEDB Query Language (at the bottom of the screen).  Enter find Locus unc-9, then click on submit.  Select unc-9 by clicking on it,  then click on the button labeled Select one or more objets and press HERE.  The page which comes up in the browser (not shown) has links to all sorts of data about the unc-9 gene.  For instance, on the resulting page click on display graphic to see:


The Genome Channel v2.0 and v1.0

The Genome Channel, from the Oak Ridge National Laboratory Computational Biosciences group is an extremely ambitious project to provide a viewer which will work for all genomes.   V2.0 is the current prototype, but it requires a more recent Java version in the web browser than is common at this time, so many people will have to use V1.0, or do without.  Here is the opening page for this site (V2.0):

The main frame shows the various chromosomes for the organism.  The upper left frame controls which organism is shown. The lower left frame color codes the data shown in the main frame.  Here yellow regions are from TIGR.  Click on the first chromosome to bring up the chromosome map.  Select the magnifying glass to expand the view, which is usually required to read the contigs.  In the next image, the tip of human chromosome 1 is shown, after magnification, with three contigs indicated.

The next frame shows what happens, in Genome channel V2.0 if the contig Chr_1ctg158 is selected.  The clones are shown in red, the STS (sequence tagged sites) in orange, and the genes, as called by GRAIL, in green.  The sequence of each gene is available, in either nucleotide or peptide form. Summary information is also available (not shown.)


BDGP (Berkeley Drosophila Genome Project)

The BDGP has several Map viewers available.  Click on ArmView, select chromosome 3R, and check the Show Contigs box to obtain the display shown here.  Green boxes are contigs of clones placed by in situ hybridization, blue boxes are contigs which have been, or are being sequenced.  The black boxes numbered 82 to 100 correspond to the numbered regions on the Drosophila chromosome maps (visual cytology maps).

Check as many types of objects as desired, then select one of the black boxes.  The Get Checked items in button will become available and the area next to it will be filled in with the name of the cytological region.  Click it to download the information, which comes in a web page (not shown.)  The download may take several minutes to complete, so be patient.  Select the blue box under cytological region 89 and the Show map for contig button will become available, and the box next to it will be filled in with the name of the selected contig (here Fas1).  Click on that button to obtain the next view

In this view the STS are shown as white dots.  A report on one may be obtained by selecting it and clicking the Report button.  Similarly, a report on any other object (colored box) may be retrieved by selecting that object, and again, clicking the Report button.  The reports come up in another browser window (not shown.)


Entrez Graphical Display

The Entrez search tool  also has a built in viewer.  Retrieve the Genbank entry for the Drosohila white gene, accession number X02974, and select graphical view to obtain the following display:

Zoom in on any element by clicking on it, for instance, click on the rightmost purple arrow to obtain this next view:

Move the mouse around the figure, and the message window will either say Click to zoom in, or it will show the name of the object.  Clicking when the first message is shown will bring up the sequence around the point of the click.  Clicking when the latter message is shown will bring up a Genbank report describing the object.  For instance, clicking on any of the purple region above will bring up a report, but clicking on the white areas will bring up just the DNA sequence at that location.
 


Other types of databases and their viewers

All the world is not a genome project, and there are many other databases around which have different display requirements than do the genome databases.  Many of these make do with simple text interfaces, since the data they contain consists of either tables, or long descriptions, but others have viewers. A few of these databases are covered here.
 


The BodyMap project

The BodyMap project is an attempt to describe the levels of expression of genes in a wide variety of tissues and developmental stages.  cDNA libraries are made from various tissues, and random clones selected and 3' end sequenced.  Those sequences are then clustered by similarity with each other and with the known genes.  After enough clones have been sequenced, a dataset results showing which genes are expressed,
and to what levels, in the tissue from which the cDNA library was made.  Here is a fragment of one page from this database (the whole thing isn't shown because it is very large):

The GS num indicates different classes of tyrosine kinases.  For instance, 8362 corresponds to granulocyte colony-stimulating factor receptor / D-7.  10 clones were found from this class, of these,
all 10 were isolated from a single tissue specific library made from granulocytes (gr).


The Visible Human Project

There is not yet a viewer for the BodyMap project, but we can guess what it might look like by visiting the Visible Human Project.  This database consists of a set of voxels obtained by serially sectioning and digitizing an entire human body.  Currently the voxel information consists solely of color, where the color is that of the sectioned tissue itself.   Fat is white, muscle is red, and so forth.  By classifying the tissues in this project to match that in the BodyMap project, it would be possible to map one data set onto the other, and so produce a visual image of the distribution of gene expression.  There are already several different viewers available for the Visible Human data.  The following image using the NPAC viewer shows some of the tissues measured in the BodyMap project.


OMIM (Online Mendelian Inheretance of Man)

Each new gene discovered, in any organism, is immediately compared using BLAST, FASTA, or some other search tool, to the compendium of known sequences.  Frequently a similar protein is discovered in humans, in which case a visit to the OMIM database may prove useful, especially concerning the range of possible phenotypes for mutant forms of this gene.  For instance, imagine a gene causing early death is discovered in the brain of zebrafish, and found to be quite similar to human alpha-hexosaminidase.  Visit OMIM, enter hexosaminidase in the search menu, and retrieve a list of entries.  The first two entries describe Tay-Sachs and Sandhoff diseases, both of which result from hexosaminidase deficiency.  The articles describe both diseases in some detail, which would suggest in this case that the fish were probably dying due to the accumulation of gangliosides.
 


Web sites offering sequence analysis tools

There are a few web sites which offer a variety of web based sequence analysis tools, and many, many, more sites which have links which point to these "active" sites.  Even the "active" sites also have links to other sites, so the two groups cannot be cleanly separated.  As with the rest of the web, links to defunct sites are very common. If you encounter one of these dead links, you will often be able to find the current location of the tool, if it still exists, by using AltaVista or one of the other general purpose web search engines.  Here are some of the better sites:

"Active" web sites

NCBI:  Pubmed, Entrez, Blast, BankIt, Databases, ...
The place to go for all things Genbank related, including data entry, data searches by keyword, data searches by sequence similarity (using BLAST), and so forth.  Their ftp site has Genbank distributions and versions of many other databases.
BCMSearch Launcher, Gene Finder, Secondary Structure Prediction, ...
Some of the links on this site run locally, many run elsewhere.  In any case, there are some very handy pages here.  The Search Launcher page provides an easy method for a user to send the same sequence to several different active sites for processing.  That is, if a multiple alignment is desired, it may be directed to ClustalW, CAP, MAP, and so forth, with the target choice effected by a checkbox menu.
CGG:  Nucleotide Seq. Analysis, Protein Seq. Analysis , 3-D, Infogen
Assorted tools provided by the Sanger Centre.
CBRG: (Computational Biochemistry Research Group)
Peptide related tools.  MassSearch is a particularly useful tool, as it offers a method to identifying a peptide by its molecular weight.
EBI:  (European Bioinformatics Institute)
Assorted tools.
ExPASy: (Swiss Institute of Bioinformatics)
Assorted protein and peptide tools.  Those that run locally on their server are marked with a red script uppercase letter L.

"Link" web sites

ABIM (Atelier BioInformatique)
Links to active sites, arranged by subject.
Amos' WWW links page
One of the most comprehensive list sites, maintained by Amos Bairoch, of Swiss-Prot fame.
Pedro's BioMolecular Research Tools
Links to active sites, arranged alphabetically by service name. One of the old favorites.  Still useful, but it has not been updated since June 1996.
Bio-wURLd
A superset of Pedro's list.   It may be used by either searching everything, or by selecting one of several lists and displaying it.