[Chicago-talk] A Book Review: Sequence Analysis in a Nutshell
Patrick Fleury
pfleury at medicine.bsd.uchicago.edu
Tue Nov 18 11:38:22 CST 2003
A couple of months ago, I went to a meeting and was given a book called
_Sequence Analysis in a Nutshell_ and asked to write a review of it. Here
it is:
--PatF
========================================================================================
Title: Sequence Analysis in a Nutshell (SAIAN)
Authors: Markel, S and Leon, D.
Publisher: O'Reilly and Associates
Year: 2003
The basic idea behind sequence analysis is the classification of DNA or
protein sequences in terms of other known DNA or protein sequences. To
take a simple case, suppose there is a laboratory team that decodes a
section of human - or mouse or rat - DNA and finds it corresponds to a
sequence of letters, perhaps something like AGTTCGATTGATTGCA. (This is a
fairly small sequence.) The team might want to find out is what is already
known about this particular sequence. To do this, they would compare their
sequence to a known database of sequences.
This database searching is not a trivial matter because, not only would
they want to find out if there are any exact matches for their sequence,
they might also want to find out if there are any approximate
matches. Here, approximate takes on a new meaning because it not only
means sequences that share a large number of exact matches, but also
sequences where parts of their sequence appear separated by other
letters. For example, if you consider the above sequence, it appears in
the sequence
ATAGTAATTCGAGCTTTGAATTTTGCA
except that there are a few other letters interspersed within
it. Or, they might be happy to find a sequence like the above except some
of the letters have been transposed to other letters. For example, the
sequence ATTTCGGTAGATGCA is the above sequence with a couple of random
letter changes.
Such alignments, although highly unintuitive to the uninitiated, might be
useful to the biological researcher.
The team might also want to search not only databases of human DNA but also
mouse DNA, rat DNA or perhaps even the worm, C Elegans.
I could go on with this, but I am merely trying to convince you that
searching for one sequence among other sequences is not just a matter of
bringing up a regular expression engine and letting it do its
job. Instead, it's a very sophisticated process with lots of variations
and parameters. Indeed a lot of work has gone into tweaking the particular
types of algorithm to use in such searches. These algorithms have been
codified into families with titles such as BLAST (Basic Local Alignment
Search Tool) and BLAT (BLAST-Like Alignment Tool) and ClustalW and they are
available in various places on the web.
This brings us to the volume under discussion. While it is possible to
find out about these tools by searching the net, it would be useful to have
one source that contained information about all of them in one easy to use
format. This volume is that source.
This is another of O'Reilly's Nutshell series. Like the others in the
series such as "Perl in a Nutshell", "C++ in a Nutshell" etc., the volume
does not have as its main point the explication of the theory of sequence
analysis. You will need to look elsewhere for that. Instead, it collects
in one place a lot of information about the tools that are useful.
The first five chapters are devoted to clear descriptions of the common
data formats you will run into in sequence analysis. These include, FASTA,
SWISS-PROT, GenBank and some of their relatives.
The next few chapters are devoted to the tools that make these analyses
work. Surprisingly, BLAST, one of the most popular of the search
algorithms gets pretty short shrift. It only has about seven pages devoted
to it. This might be due to the fact that O'Reilly recently published a
book devoted entirely to BLAST. (There will be more about that later.)
The short space given to BLAST might also be because the authors wanted to
save a lot of space for EMBOSS (European Molecular Biology Open Software
Suite). EMBOSS is a suite of over 100 programs for sequence analysis that
have been released as open source and whose code is available on the
web. Anyone who wants to see real working C-code to perform sequence
analysis matching would do well to down load these programs and study
them. Markel and Leon devote almost 170 pages to this suite and all of its
possible options and flags. By the way, the section on EMBOSS is really
the only place a where a particular programming language appears in the
book and it doesn't really appear because you need to download the code to
see it. There is no Perl in "SAIAN".
Besides data formats and descriptions of tools, the book also has some
other useful parts. For example, it has appendices devoted to amino acid
and nucleotide tables, and genetic codes. It also lists a lot of websites
where interested parties can go to find more information.
This book looks useful for anyone who would like to have good single
reference for sequence analysis tools.
All of the above notwithstanding, the book is a manual and sometimes
reading it is just like reading a Unix Man page. It may be informative,
but, if you really want to know what is going on, you may need to look
elsewhere for some further explanation. In particular, the treatment of
BLAST in "SAIAN" does not really tell you what is going on. I would be much
harder on "SAIAN" were it not for the fact the O'Reilly recently published
another book titled simply "BLAST".
"BLAST", which was written by Ian Korff, Mark Yandell and Joseph Bedell, is
subtitled "An Essential Guide to the Basic Local Alignment Search Tool" and
it is indeed that. It contains not only a detailed introduction to BLAST,
but also a short introduction to the theory behind BLAST. If you want to
find out a little bit about basic genetics and how BLAST works into
sequence alignment, you could do a lot worse then read this book. It goes
through the algorithms in some detail and actually shows you some
elementary Perl code to carry out some of the algorithms. Furthermore, it
contains an introduction to some of the statistical methods behind the
code. (If you want to go deeply into the theory behind the algorithms, I
recommend the book by Durbin, Krogh, etc referenced at the end of this review.)
In summary, "Sequence Analysis in a Nutshell" is a useful tool.
It collects in one place common data formats.
It also collects references to common algorithms such as BLAST and BLAT.
It has a large section on EMBOSS.
It has appendices on genetic codes and nucleotides.
It has a lot of references to URLS for finding more information and for
downloading code.
It does not have enough about BLAST but, the book called "BLAST", also
from O'Reilly, provides a very good reference for that tool along with
other more theoretical information.
Finally, I want to point out the animal on the cover of SAIAN works as
symbolism on several levels. It is a liger a cross between a male lion
and a female tiger. (A cross between a male tiger and a female lion is
called a tigon. Ah, the wonderful things you learn from reading the
colphon of an O'Reilly book.) It is not only fitting that such a mixture
of genes be on the cover of this book but it is nice to note that the
authors work for LION bioscience.
Patrick Fleury
Books referenced in the above
Durbin, R. Eddy, S., Krogh, A. and Mitchison, G. 1998, Biological Sequence
Analysis, New York: Cambridge University Press
Korf, I, Yandell, M. and Bedell, J., 2003, BLAST, Sebastopol: O'Reilly
Markel, S. and Leon, D., 2003, Sequence Analysis in a Nutshell, Sebastopol:
O'Reilly
More information about the Chicago-talk
mailing list