| Current
Bioinformatics
ISSN: 1574-8936

Current Bioinformatics
Volume 2, Number 2, May 2007
Contents

Advances in Exploration of Machine Learning Methods
for Predicting Functional Class and Interaction Profiles of
Proteins and Peptides Irrespective of Sequence Homology
Pp. 95-112
Juan Cui, Lianyi Han, Honghuang Lin, Zhiqun Tang, Zhiliang
Ji, Zhiwei Cao, Yixue Li and Yuzong Chen
[Abstract]
A Decade of Computing to Traverse the Labyrinth of
Protein Domains Pp. 113-131
Rajani R. Joshi
[Abstract]
Gene Set Enrichment Analysis (GSEA) for Interpreting
Gene Expression Profiles Pp. 133-137
Jing Shi and Michael G. Walker
[Abstract]
Inference of Gene Regulatory Networks and its Validation
Pp, 139-144
Fang-Xiang Wu
[Abstract]
Spectral Estimation Techniques for DNA Sequence and
Microarray Data Analysis Pp. 145-156
Hong Yan and Tuan D. Pham
[Abstract]
Abstracts

[Back to top]
Advances in Exploration of Machine Learning
Methods for Predicting Functional Class and Interaction Profiles
of Proteins and Peptides Irrespective of Sequence Homology
Juan Cui, Lianyi Han, Honghuang Lin, Zhiqun Tang, Zhiliang
Ji, Zhiwei Cao, Yixue Li and Yuzong Chen
Various computational methods have been used for predicting
protein function from clues contained in protein sequence.
A particular challenge is the functional prediction of proteins
that show low or no sequence similarity to proteins of known
function. Recently, machine learning methods have been explored
for predicting functional class of proteins from a variety
of sequence-derived structural and physicochemical properties
independent of sequence similarity, which showed promising
potential for a broad spectrum of proteins including those
that show low and no similarity to other proteins. These methods
can thus be explored as potential tools to complement similarity-based,
clustering-based and structure-based methods for predicting
protein function. This article reviews the strategies, algorithms,
current progresses, available software and web-servers, and
underlying difficulties in using machine learning methods
for predicting the functional class of proteins and peptides,
and protein-protein interactions. The reported prediction
performances in the application of these methods are also
presented.
[Back to top]
A Decade of Computing to Traverse the Labyrinth of
Protein Domains
Rajani R. Joshi
Detection and characterization of structural domains of proteins
is crucial for determination of its tertiary structure, elucidation
of its functions and design and production of its biologically
active analogs. Identification of domain-segments at the sequence
level is also important in deciphering protein structural
genomics and in evolutionary studies. The diversity of domain
folds and sequences and high structural flexibility of the
inter-domain linker regions pose great challenges for determination
of multi-domain protein structures even from X-ray crystallographic
or NMR spectroscopic data or by homology modeling. The problems
get manifold in the absence of any such data or sequence homologies.
Interestingly though, identification of protein domains is
a unique research problem where ab-intio computational
investigations supersede the experimental ones or offer better
applications of the latter. Advancement of Bioinformatics
and Computational Biology in post-genomic research has led
to plethora of approaches, algorithms and web-server developments
for prediction of protein domains using — 3D co-ordinates,
partial structural information including secondary structure
or only the primary sequence. Here we assess the state-of-art
developments in the field. Trend-setting as well as widely
used computational methods and web-servers/databases are reviewed
here with a focus on their applicability, novelty and strength
in mining the multiple features of sequence/structure that
contribute to formation and distinctions and diversity of
protein domains. Future possibilities of a unified system
with optimal decision support are highlighted.
[Back to top]
Gene Set Enrichment Analysis (GSEA) for Interpreting
Gene Expression Profiles
Jing Shi and Michael G. Walker
Gene set enrichment analysis (GSEA) is a statistical method
to determine if predefined sets of genes are differentially
expressed in different phenotypes. Predefined gene sets may
be genes in a known metabolic pathway, located in the same
cytogenetic band, sharing the same Gene Ontology category,
or any user-defined set. In microarray experiments where no
single gene shows statistically significant differential expression
between phenotypes, GSEA has identified significant differentially
expressed sets of genes, even where the average difference
in expression between two phenotypes is only 20% for genes
in the gene set. The gene set identified in the first GSEA
analysis (oxidative phosphorylation genes differentially expressed
in diabetic versus non-diabetic patients) was subsequently
confirmed by independent laboratory studies published in the
New England Journal of Medicine. Since the first paper on
GSEA was published, many extensions and alternative methods
have been described in the literature. In this paper, we describe
the original GSEA algorithm, subsequent extensions and alternatives,
results of some of the applications, some limitations of the
methods and caveats for users, and possible future research
directions. GSEA and related methods are complementary to
conventional single-gene methods. Single gene methods work
best when individual genes have large effects and there is
small variance within the phenotype. GSEA is likely to be
more powerful than conventional single-gene methods for studying
the large number of common diseases in which many genes each
make subtle contributions. It is a tool that deserves to be
in the toolbox of bioinformatics practitioners.
[Back to top]
Inference of Gene Regulatory Networks and its Validation
Fang-Xiang Wu
Genes encode proteins, some of which in turn regulate other
genes. Such interactions make up a gene regulatory network.
The understanding and unraveling of gene regulatory networks
have been proven very useful in disease diagnosis and genomic
drug design. Due to the complexity of gene regulatory networks,
the completely understanding of their dynamics is difficult
to achieve only through biological experiments without any
computational aids. As a consequence, computational models
for gene regulatory networks are indispensable. Recently a
wide variety of different computational models have been proposed
for inferring gene regulatory networks. This paper surveys
some of computational models for inferring large gene regulatory
networks, in particular, Boolean network model, differential/
difference equation models, and state-space models. Some advantages
and disadvantages of these models are commented on. Some criteria
for validating the inferred gene regulatory networks are also
discussed from the bioinformatics perspective. Finally, several
directions of the future work for modeling gene regulatory
networks are proposed.
[Back to top]
Spectral Estimation Techniques for DNA Sequence and
Microarray Data Analysis
Hong Yan and Tuan D. Pham
Spectral estimation techniques are widely used in modern signal
processing systems. Recently, they have found important applications
to the analysis of DNA data. In this paper, we review parametric
and non-parametric spectral estimation methods for DNA sequence
and microarray data analysis. The discrete Fourier transform
(DFT) is the most commonly used technique for spectral analysis
of digital signals. It can reveal the gene locations in a
DNA sequence. The DFT can also be used to detect repetitive
elements in a DNA sequence. The DFT produces the so-called
windowing or data truncation artifacts when it is applied
to a short data segment. Parametric spectral estimation methods,
such as the autoregressive (AR) model, overcome this problem
and can be used to obtain a high-resolution spectrum of the
input signal. In this paper, we demonstrate the advantages
of the AR model for the identification of protein coding regions
and the detection of DNA repeats. We also review DFT and AR
models and other spectral estimation techniques for the analysis
of microarray time series data.
|