[Chicago-talk] A Book Review: Sequence Analysis in a Nutshell

Tue Nov 18 11:38:22 CST 2003

A couple of months ago, I went to a meeting and was given a book called 
_Sequence Analysis in a Nutshell_ and asked to write a review of it.  Here 
it is:

--PatF
========================================================================================
Title: Sequence Analysis in a Nutshell  (SAIAN)
Authors: Markel, S and Leon, D.
Publisher: O'Reilly and Associates
Year: 2003

The basic idea behind sequence analysis is the classification of DNA or 
protein sequences in terms of other known DNA or protein sequences.  To 
take a simple case, suppose there is a laboratory team that decodes a 
section of human - or mouse or rat - DNA and finds it corresponds to a 
sequence of letters, perhaps something like AGTTCGATTGATTGCA.  (This is a 
fairly small sequence.)  The team might want to find out is what is already 
known about this particular sequence.  To do this, they would compare their 
sequence to a known database of sequences.

This database searching is not a trivial matter because, not only would 
they want to find out if there are any exact matches for their sequence, 
they might also want to find out if there are any approximate 
matches.  Here, approximate takes on a new meaning because it not only 
means sequences that share a large number of exact matches, but also 
sequences where parts of their sequence appear separated by other 
letters.  For example, if you consider the above sequence, it appears in 
the sequence

ATAGTAATTCGAGCTTTGAATTTTGCA

except that there are a few other letters interspersed within 
it.  Or,  they might be happy to find a sequence like the above except some 
of the letters have been transposed to other letters.  For example, the 
sequence ATTTCGGTAGATGCA is the above sequence with a couple of random 
letter changes.

Such alignments, although highly unintuitive to the uninitiated, might be 
useful to the biological researcher.

The team might also want to search not only databases of human DNA but also 
mouse DNA, rat DNA or perhaps even the worm, C Elegans.

  I could go on with this, but I am merely trying to convince you that 
searching for one sequence among other sequences is not just a matter of 
bringing up a regular expression engine and letting it do its 
job.  Instead, it's a very sophisticated process with lots of variations 
and parameters.  Indeed a lot of work has gone into tweaking the particular 
types of algorithm to use in such searches.  These algorithms have been 
codified into families with titles such as BLAST (Basic Local Alignment 
Search Tool) and BLAT (BLAST-Like Alignment Tool) and ClustalW and they are 
available in various places on the web.

This brings us to the volume under discussion.  While it is possible to 
find out about these tools by searching the net, it would be useful to have 
one source that contained information about all of them in one easy to use 
format.  This volume is that source.

This is another of O'Reilly's Nutshell series.  Like the others in the 
series such as "Perl in a Nutshell", "C++ in a Nutshell" etc., the volume 
does not have as its main point the explication of the theory of sequence 
analysis.  You will need to look elsewhere for that.  Instead, it collects 
in one place a lot of information about the tools that are useful.

The first five chapters are devoted to clear descriptions of the common 
data formats you will run into in sequence analysis.  These include, FASTA, 
SWISS-PROT, GenBank and some of their relatives.

The next few chapters are devoted to the tools that make these analyses 
work.  Surprisingly, BLAST, one of the most popular of the search 
algorithms gets pretty short shrift.  It only has about seven pages devoted 
to it.  This might be due to the fact that O'Reilly recently published a 
book devoted entirely to BLAST.  (There will be more about that later.)

The short space given to BLAST might also be because the authors wanted to 
save a lot of space for EMBOSS (European Molecular Biology Open Software 
Suite).  EMBOSS is a suite of over 100 programs for sequence analysis that 
have been released as open source and whose code is available on the 
web.  Anyone who wants to see real working C-code to perform sequence 
analysis matching would do well to down load these programs and study 
them.  Markel and Leon devote almost 170 pages to this suite and all of its 
possible options and flags.  By the way, the section on EMBOSS is really 
the only place a where a particular programming language appears in the 
book  and it doesn't really appear because you need to download the code to 
see it.  There is no Perl in "SAIAN".

  Besides data formats and descriptions of tools, the book also has some 
other useful parts.  For example, it has appendices devoted to amino acid 
and nucleotide tables, and genetic codes.  It also lists a lot of websites 
where interested parties can go to find more information.

This book looks useful for anyone who would like to have good single 
reference for sequence analysis tools.

All of the above notwithstanding, the book is a manual and sometimes 
reading it is just like reading a Unix Man page.  It may be informative, 
but, if you really want to know what is going on, you may need to look 
elsewhere for some further explanation.  In particular, the treatment of 
BLAST in "SAIAN" does not really tell you what is going on. I would be much 
harder on "SAIAN" were it not for the fact the O'Reilly recently published 
another book titled simply "BLAST".

"BLAST", which was written by Ian Korff, Mark Yandell and Joseph Bedell, is 
subtitled "An Essential Guide to the Basic Local Alignment Search Tool" and 
it is indeed that.  It contains not only a detailed introduction to BLAST, 
but also a short introduction to the theory behind BLAST.  If you want to 
find out a little bit about basic genetics and how BLAST works into 
sequence alignment, you could do a lot worse then read this book.  It goes 
through the algorithms in some detail and actually shows you some 
elementary Perl code to carry out some of the algorithms.  Furthermore, it 
contains an introduction to some of the statistical methods behind the 
code.  (If you want to go deeply into the theory behind the algorithms, I 
recommend the book by Durbin, Krogh, etc referenced at the end of this review.)

In summary, "Sequence Analysis in a Nutshell" is a useful tool.
	It collects in one place common data formats.
	It also collects references to common algorithms such as BLAST and BLAT.
	It has a large section on EMBOSS.
	It has appendices on genetic codes and nucleotides.
	It has a lot of references to URLS for finding more information and for 
downloading code.
	It does not have enough about BLAST but, the book called "BLAST", also 
from O'Reilly, provides a very good reference for 	that tool along with 
other more theoretical information.

Finally, I want to point out the animal on the cover of SAIAN works as 
symbolism on several levels.  It is a liger  a cross between a male lion 
and a female tiger.  (A cross between a male tiger and a female lion is 
called a tigon.  Ah, the wonderful things you learn from reading the 
colphon of an O'Reilly book.)  It is not only fitting that such a mixture 
of genes be on the cover of this book but it is nice to note that the 
authors work for LION bioscience.

Patrick Fleury

Books referenced in the above

Durbin, R. Eddy, S., Krogh, A. and Mitchison, G. 1998, Biological Sequence 
Analysis, New York: Cambridge University Press
Korf, I, Yandell, M. and Bedell, J., 2003, BLAST, Sebastopol: O'Reilly
Markel, S. and Leon, D., 2003, Sequence Analysis in a Nutshell, Sebastopol: 
O'Reilly