Fundamentals of Sequence Analysis, 1998-1999
Lecture 10. Web site survey
In this lecture we'll take a look at some of the more interesting web sites
and web based tools.
Some of the web pages described in this lecture may not work with all
browsers as they require very recent versions of Java. This lecture
was prepared using Netscape 4.5 on Windows NT, which worked with all of
the pages shown.
Genome, Contig, Gene, and Sequence Viewers
The vast amount of sequence and sequence related information which the many genome
projects have been accumulating has led several sites to develop and distribute
viewers. This is useful because the data in text form is often
too hard to "digest" quickly. Ultimately the goal is to provide the
end user community with tools which can easily move from a chromosome scale
view all the way down to a base pair view, and which can easily and quickly
locate and display any feature, on any of these scales, that the user might
want to see. Designing a viewer which works well in displaying genetic
features at sizes which vary by 8 orders of magnitude is no easy task!
Here are some of the currently available viewers.
The forerunner of all of the web based viewers is AceDB (AC.
elegansData
Base), which was developed for the C. elegans sequencing project.
Originally the name referred to both the software and the database it coordinated.
Later, as other genome projects adopted the software, the name usually
came to refer just to the software, with a different name for the
database. In any case, AceDB has both GUI (using the X11 window
system) and text versions. The GUI version is pretty straightforward
to use, but unfortunately, only the text version of the browser seems to
be available on the web at this time. Still, it will output maps
in graphical format. Here is one example of how it is used.
Starting from http://probe.nal.usda.gov:8000/other/aboutacedb.html
click on browse,
then click on ACEDB
Query Language (at the bottom of the screen). Enter find Locus
unc-9, then click on submit.
Select unc-9 by clicking on it, then click on the button
labeled Select one or more objets and press HERE. The page
which comes up in the browser (not shown) has links to all sorts of data
about the unc-9 gene. For instance, on the resulting page
click on display
graphic to see:
The Genome Channel v2.0
and v1.0
The Genome Channel, from the
Oak Ridge National Laboratory Computational Biosciences group is an extremely ambitious project to provide a viewer which will work
for all genomes. V2.0 is the current prototype, but it requires
a more recent Java version in the web browser than is common at this time,
so many people will have to use V1.0, or do without. Here is the opening
page for this site (V2.0):
The main frame shows the various chromosomes for the organism.
The upper left frame controls which organism is shown. The lower left frame
color codes the data shown in the main frame. Here yellow regions
are from TIGR. Click on the first chromosome to bring up the chromosome
map. Select the magnifying glass to expand the view, which is usually
required to read the contigs. In the next image, the tip of human
chromosome 1 is shown, after magnification, with three contigs indicated.
The next frame shows what happens, in Genome channel V2.0 if the contig
Chr_1ctg158 is selected. The clones are shown in red,
the STS (sequence tagged sites) in orange, and the genes, as called by
GRAIL, in green. The sequence of each gene is available, in
either nucleotide or peptide form. Summary information is also available
(not shown.)
BDGP (Berkeley Drosophila Genome Project)
The BDGP has several Map viewers
available. Click on ArmView, select chromosome 3R, and check
the
Show Contigs box to obtain the display shown here. Green
boxes are contigs of clones placed by in situ hybridization, blue boxes
are contigs which have been, or are being sequenced. The black boxes
numbered 82 to 100 correspond to the numbered regions on the Drosophila
chromosome maps (visual cytology maps).
Check as many types of objects as desired, then select one of the black
boxes. The Get Checked items in button will become available
and the area next to it will be filled in with the name of the cytological
region. Click it to download the information, which comes in a web
page (not shown.) The download may take several minutes to complete,
so be patient. Select the blue box under cytological region 89 and
the Show map for contig button will become available, and the box
next to it will be filled in with the name of the selected contig (here
Fas1). Click on that button to obtain the next view
In this view the STS are shown as white dots. A report on one
may be obtained by selecting it and clicking the Report button.
Similarly, a report on any other object (colored box) may be retrieved
by selecting that object, and again, clicking the Report button.
The reports come up in another browser window (not shown.)
Entrez Graphical Display
The Entrez search tool also has a built in viewer. Retrieve
the Genbank entry for the Drosohila white gene, accession number X02974,
and select graphical
view to obtain the following display:
Zoom in on any element by clicking on it, for instance, click on the
rightmost purple arrow to obtain this next view:
Move the mouse around the figure, and the message window will either
say Click to zoom in, or it will show the name of the object.
Clicking when the first message is shown will bring up the sequence around
the point of the click. Clicking when the latter message is shown
will bring up a Genbank report describing the object. For instance,
clicking on any of the purple region above will bring up a report, but
clicking on the white areas will bring up just the DNA sequence at that
location.
Other types of databases and their viewers
All the world is not a genome project, and there are many other databases
around which have different display requirements than do the genome databases.
Many of these make do with simple text interfaces, since the data they
contain consists of either tables, or long descriptions, but others have
viewers. A few of these databases are covered here.
The BodyMap project is an attempt to describe the levels of expression
of genes in a wide variety of tissues and developmental stages. cDNA
libraries are made from various tissues, and random clones selected and
3' end sequenced. Those sequences are then clustered by similarity with each other
and with the known genes. After enough clones have been sequenced,
a dataset results showing which genes are expressed,
and to what levels, in the tissue from which the cDNA library was made.
Here is a fragment of one page
from this database (the whole thing isn't shown because it is very large):
The GS num indicates different classes of tyrosine kinases. For
instance, 8362 corresponds to granulocyte colony-stimulating factor
receptor / D-7. 10 clones were found from this class, of
these,
all 10 were isolated from a single tissue specific library made from granulocytes
(gr).
There is not yet a viewer for the BodyMap project, but we can guess what
it might look like by visiting the Visible Human Project. This database
consists of a set of voxels obtained by serially sectioning and digitizing
an entire human body. Currently the voxel information consists solely
of color, where the color is that of the sectioned tissue itself.
Fat is white, muscle is red, and so forth. By classifying the tissues
in this project to match that in the BodyMap project, it would be possible
to map one data set onto the other, and so produce a visual image of the
distribution of gene expression. There are already several different
viewers available for the Visible Human data. The following image
using the NPAC
viewer shows some of the tissues measured in the BodyMap project.
Each new gene discovered, in any organism, is immediately compared using
BLAST, FASTA, or some other search tool, to the compendium of known sequences.
Frequently a similar protein is discovered in humans, in which case a visit
to the OMIM database may prove useful, especially concerning the range
of possible phenotypes for mutant forms of this gene. For instance,
imagine a gene causing early death is discovered in the brain of zebrafish,
and found to be quite similar to human alpha-hexosaminidase. Visit
OMIM, enter hexosaminidase in the search menu, and retrieve a list of entries.
The first two entries describe Tay-Sachs and Sandhoff diseases, both of
which result from hexosaminidase deficiency. The articles describe
both diseases in some detail, which would suggest in this case that the
fish were probably dying due to the accumulation of gangliosides.
Web sites offering sequence analysis tools
There are a few web sites which offer a variety of web based sequence analysis
tools, and many, many, more sites which have links which point to these
"active" sites. Even the "active" sites also have links to other
sites, so the two groups cannot be cleanly separated. As with the
rest of the web, links to defunct sites are very common. If you
encounter one of these dead links, you will often be able to find the
current location of the tool, if
it still exists, by using AltaVista or
one of the other general purpose web search engines.
Here are some of the better sites:
"Active" web sites
NCBI: Pubmed,
Entrez,
Blast,
BankIt,
Databases,
...
The place to go for all things Genbank related, including data
entry, data searches by keyword, data searches by sequence similarity (using
BLAST), and so forth. Their ftp site has Genbank distributions and
versions of many other databases.
BCM: Search
Launcher, Gene
Finder, Secondary
Structure Prediction, ...
Some of the links on this site run locally, many run elsewhere.
In any case, there are some very handy pages here. The Search
Launcher page provides an easy method for a user to send the same sequence
to several different active sites for processing. That is, if a multiple
alignment is desired, it may be directed to ClustalW, CAP, MAP, and so
forth, with the target choice effected by a checkbox menu.
CGG: Nucleotide Seq. Analysis,
Protein Seq. Analysis , 3-D, Infogen
Assorted tools provided by the Sanger Centre.
CBRG: (Computational
Biochemistry Research Group)
Peptide related tools. MassSearch is a particularly useful
tool, as it offers a method to identifying a peptide by its molecular weight.
EBI: (European
Bioinformatics Institute)
Assorted tools.
ExPASy: (Swiss Institute
of Bioinformatics)
Assorted protein and peptide tools. Those that run locally
on their server are marked with a red script uppercase letter L.
"Link" web sites
ABIM (Atelier
BioInformatique)
Links to active sites, arranged by subject.
Amos' WWW links
page
One of the most comprehensive list sites, maintained by Amos
Bairoch, of Swiss-Prot fame.
Pedro's
BioMolecular Research Tools
Links to active sites, arranged alphabetically by service name. One of
the old favorites. Still useful, but it has not been updated since
June 1996.
Bio-wURLd
A superset of Pedro's list. It may be used by either searching
everything, or by selecting one of several lists and displaying it.