, Feature
Genomics and Proteomics:
A Signal
Processor’s Tour
P. P. Vaidyanathan
Abstract
The theory and methods of signal pro-
cessing are becoming increasingly
important in molecular biology. Digi-
tal filtering techniques, transform
domain methods, and Markov models
have played important roles in gene
identification, biological sequence
analysis, and alignment. This paper
contains a brief review of molecular
biology, followed by a review of the
applications of signal processing the-
ory. This includes the problem of gene
finding using digital filtering, and the
use of transform domain methods in
the study of protein binding spots.
The relatively new topic of noncoding
genes, and the associated problem of
identifying ncRNA buried in DNA
sequences are also described. This
includes a discussion of hidden
Markov models and context free
grammars. Several new directions in
genomic signal processing are briefly
outlined in the end.
© EYEWIRE
Keywords—Genomic-signal-process-
ing, bioinformatics, genes, protein-
coding, DNA, and ncRNA.
6 IEEE CIRCUITS AND SYSTEMS MAGAZINE 1531-636X/04/$20.00©2004 IEEE FOURTH QUARTER 2004
, 1. Introduction
S
ubsequent to the sensational announcement of
the double helix structure for the DNA molecule
more than fifty years ago by Watson and Crick [1], G C
there has been phenomenal progress in genomics in the
last five decades. With the enormous amount of genom- G C
ic and proteomic data available to us in the public T A
domain, it is becoming increasingly important to be able G C
to process this information in ways that are useful to
A T
humankind. Traditional as well as modern signal pro-
A T Sugar Phosphate
cessing methods have played an important role in these
Backbone
fields. Genomic signal processing is primarily the pro- 3.4nm
cessing of DNA sequences, RNA sequences, and pro- or 34 Å C
G
teins. A DNA sequence is made from an alphabet of four T A
elements, namely A, T, C, and G. For example C G
A T
. . . ATC C C AAGT AT AAG AAGT A . . . G C
The letters A, T, C, G represent molecules called A T
nuclotides or bases (to be described soon). Since DNA
contains the genetic information of living organisms, we
see that life is governed by quarternary codes. Another (a)
example of discrete-alphabet sequences in life forms is
the protein. A large number of functions in living organ- 5′ 3′
isms are governed by proteins. A protein can be regard- C T
A G A G A A
ed as a sequence of amino acids. There are twenty
distinct amino acids, and so a protein can be regarded as G A
T C T C T T
a sequence defined on an alphabet of size twenty. The 3′ 5′
twenty letters used to denote the amino acids are the let-
ters from the English alphabet except B, J, O, U, X, and Z. A T C G
For example a part of the protein sequence could be (b)
. . . PPV AC AT DE E D AF G G AY PQ . . .
Phosphate Sugar Sugar Phosphate Backbone
Notice that some letters representing amino acids are 5′ 3′
identical to some letters representing bases. For example C A Base
Base G G
the A in the DNA is a base called adenine, and the A in the Sequence
protein is an amino acid called alanine.
Nucleotide
If we assign numerical values to the four letters in the = Base+Sugar+Phosphate
DNA sequence, we can perform a number of signal pro- (c)
cessing operations such as Fourier transformation [26, 3],
digital filtering [27], time-frequency plots such as wavelet Figure 1. (a) The DNA double helix, (b) linearized schemat-
transformations [17], and Markov modelling [4]. Some of ic, and (c) details of the sugar-phosphate backbone. In part
(b) bottom strand is complementary to the top strand in
those are quite interesting and in fact have important
the sense that A and T are paired and so are C and G. This
practical applications. Similarly, once we assign numeri- is because of a weak bonding called hydrogen bonding
cal values to the twenty amino acids in protein sequences between these pairs of molecules.
we can do useful signal processing.
P. P. Vaidyanathan 1 is with the Department of Electrical Engineering, 136-93, California Institute of Technology, Pasadena, CA 91125. Email:
1 Work supported in part by the ONR grant N00014-99-1-1002.
FOURTH QUARTER 2004 IEEE CIRCUITS AND SYSTEMS MAGAZINE 7