| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |




* Department of Psychiatry, Washington University School of Medicine, St. Louis, Missouri 63110
Department of Human Biological Chemistry and Genetics, University of Texas Medical Branch, Galveston, Texas 77550
Shriners Hospital for Children, Department of Surgery, University of Texas Medical Branch, Galveston, Texas 77550
| ABSTRACT |
|---|
|
|
|---|
| I. Introduction and Outline |
|---|
|
|
|---|
While early microarray studies mostly focused on the basic methodological and technical aspects of DNA arrays (e.g., data normalization, error correction, replication), emphasis has shifted to biological, medical, and clinical applications. DNA chips are being used in pharmacogenomics and pharmacogenetics, toxicogenomics, human disease studies, disease screening, profiling and classification, diagnosis and clinical applications, and basic biological science studies. In each case, the experimental design has to be planned to fit the questions being addressed. Following is a brief list describing the most-frequently used experimental designs.
In a typical experimental situation with microarrays, one may want to:
1. Compare gene expressions obtained from two or more different tissues for example, healthy versus diseased tissue in order to compare or classify them. This type of experimental design has been used, for instance, to get clues regarding mechanisms and causes of disease processes or to classify specific clinical varieties of cancers/tumors according to their expression profiles, in order to better predict prognosis (Golub et al., 1999; Dirix and van Oosterom, 2002).
2. Compare gene expression data obtained from the same tissue or cell line at different time points, in order to follow the time course of expression. This design can be used to monitor temporal gene expression patterns during the cell cycle (Spellman et al., 1998; Tamayo et al., 1999) or during development (Wen et al., 1998), temporal progression of a disease (Agrawal et al., 2002; Pomeroy et al., 2002; Spies et al., 2002), or response to a treatment (Nesic et al., 2002; Sotiriou et al., 2002).
3. Compare gene expression data obtained from different parts of the same tissue, in order to reconstruct spatial distribution patterns of gene expressions. An expanded version of this design, called voxelation, was used recently by Brown et al. (2002), who correlated microarray data with the site of gene expression in tissues by creating signatures of expression patterns in coronal hemisections at the level of the hippocampus of the human brain. By combining the data for the entire surface of a volume of brain section, a three-dimensional spatial pattern of gene expression was generated. This important study (Peterson, 2002) combines DNA array technology and brain-imaging technique, like functional magnetic resonance imaging (fMRI), to represent the expression patterns of the whole organ.
Irrespective of the type of microarray employed (e.g., cDNA, oligonucleotide, spotted), such experiments generate tens of thousands of data points per each measurement. In addition, depending on the experimental design, or the number of samples, or the number of time points, the complete data set to be analyzed often contains hundreds of thousands of gene expression levels. These data are most commonly presented in tabular form (Quakenbush, 2001), called an expression matrix (see Table I).
|
The main difficulty in statistical analysis of such data sets stems primarily from the fact that one must deal with a small number of samples or "Experiments" (i.e., cell lines, patients, time points), relative to the large number of probes ("Genes"). Moreover, the unnormalized, raw expression levels of different genes in the same experiment (or under one condition, or at one time) (i.e., the numbers E1,1, E2,1, E3,1...) may have values that range over several orders of magnitude from values close to unity to values on the order of 105. The ultimate goals are to establish how the expression level of some gene changes from experiment to experiment and to identify groups of genes that exhibit similar coexpression patterns. Statistical methods designed to deal with these issues continue to be adapted and developed, since they are crucial for providing useful data and for extracting reliable biological information from DNA array experiments. This chapter reviews some of these methods, starting from the most basic and working towards more complex ones. Some of our results, obtained by using Affymetrix GeneChips, are described briefly in the form of illustrative examples.
| II. Fold Changes in Expression Levels |
|---|
|
|
|---|
2-fold change in expression level. But even with this simple analysis, care must be taken because of the aforementioned "large number of probes vs. small number of experiments" problem. This is especially important if one wants to attach statistical significance to the observed changes. For instance, to determine the expression level of a single gene in one experiment (or under one condition), one needs to make several replicate measurements the more the better. Performing many replicate experiments, however, often is not feasible, due to the high cost of DNA chips or the limited amount of RNA or DNA material available. Nevertheless, it has been our experience, in agreement with more-formal reliability studies (Lee et al., 2000), that at least three replicates per experiment must be made to have reasonable statistical confidence in the expression values obtained. Once the expression value of a gene has been established through replicate experiments under one condition, one wants to compare that with the expression value of that same gene under some other condition. A usual way to make the comparison is through a 2-sided t-test, assuming normal distribution of replicated expressions, or by some other nonparametric method. With three or so replicates per experiment, the statistical significance of the difference between the two experiments typically is not very impressive and only those genes that exhibit large up- or downregulation between the two experiments can be identified with some confidence. Thus, due to the various sources of errors or chance variations between two measurements, DNA arrays cannot be used with great confidence to detect small (i.e., less than 1.5-fold changes) in expression levels across experiments. Even with this constraint, one is left with, for example, 1000 genes that are identified as significantly changed. To assign some confidence level to this finding, one can perform t-tests, one for each gene, requiring 1000 total tests. Then, to correct for the repeated testing, one can impose the usual Bonferroni correction to the individual significance levels (i.e., require that the p-value for each gene be 1000 times smaller than, say, 0.05). Unfortunately, under these settings, the Bonferroni condition turns out to be extremely restrictive and almost no genes with significantly changed expression levels are detected with required statistical confidence.
One way out of this impasse was suggested by Tusher and coworkers (2001), who proposed a new method, significance analysis of microarrays (SAM). The SAM procedure assigns "observed score" to each gene, depending on that genes expression level scaled by the standard deviation of replicated measurements. Next, a number of "balanced" permutations of expression values are performed and a similar score in each case is assigned, which is then finally averaged over all permutations to compute the "expected score." The scatter plot of observed vs. expected scores is then used to identify significant changes in gene expression. With the additional adjustable threshold parameter, parameter
in the original article (Tusher et al., 2001), one can control the overall false discovery rate (FDR), the percentage of genes discovered to be potentially significant by chance alone. With an FDR of
510%, which is deemed acceptable, one is still left with dozens of genes that show statistically significant changes in expression levels. SAM analysis has become a standard statistical technique for detecting groups of genes with potentially significant change in their expression levels. (SAM software is available at http://www-stat.stanford.edu/
tibs/SAM/index.html.) Following such analysis, one can compile a table of significantly over- or underexpressed genes, under different circumstances, with the expectation that these genes most actively participate in the phenomenon under study.
| III. Data Classification and Clustering |
|---|
|
|
|---|
To illustrate the issues involved, consider Table I. With each "Gene" (e.g., "Gene 2") from this table, one can associate an "expression vector" with entries "E2,1, E2,2, E2,3..." that are simply read off from the row corresponding to that gene. In other words, the expression vector of a gene contains expression levels of that gene in different experiments. The number of components (dimension) of the expression vector equals the number of experiments (NE) and the number of expression vectors equals the number of genes (NG). Geometrically, one can think of the expression vector as a point (tip of the expression vector) in the NE-dimensional "expression space," so that each gene is uniquely assigned a single point. The dimensionality of the expression space is equal to the number of experiments (typically, between 10 and 100), while the number of points in this space is equal to the number of genes (typically, several thousand). In order to group the genes (or points in expression space) into clusters, one needs to define some measure of distance between them. The most-straightforward and most commonly used one is the geometric, Euclidean distance between the two points (expression vectors) i and j, the square of which is defined as
![]() |
As a somewhat "orthogonal" procedure to gene clustering, it often is useful to perform experiment clustering. To achieve this, one can represent each experiment by an "experiment vector," with its entries read off from the corresponding column of the expression matrix (see Table I). Thus, "Experiment 2" would be represented by an experiment vector with entries "E1,2, E2,2, E3,2, etc." The number of experiment vectors equals the number of different experiments, while the length (dimension) of this vector equals the number of genes. Each point in this "experiment space" corresponds to one experiment. By introducing appropriate distance measure between two experiment vectors, one then can cluster experiments according to their similarity.
Clustering experiments is particularly useful as a preliminary step to discover, for instance, eventual gross discrepancies between microarrays that may occur with faulty arrays or because of other systematic errors. As an illustration, we recently reported on a microarray experiment involving burn injury in rats (Spies et al., 2002), where gene expressions in the skin tissues from burned rats and normal rats were compared at four time points (2 hours, 6 hours, 24 hours, and 240 hours after the injury). We used three replicate experiments for each group: thus, 3 replicates x 2 groups x 4 time points = 24 experiments (arrays). After clustering of experiments (arrays), it was discovered that one of the 24 arrays differed markedly from the rest. In this particular array, only about 800 genes were expressed, while in all others, the number of expressed genes averaged around 4000. This difference was immediately visible in the clustering of experiments. The faulty array was discarded from further analysis and a proper one was substituted.
Experiment clustering also can be used to determine the overall effect of treatment, or healing, on global expression profiles. For instance, in a recent study of spinal cord injury (SCI) in rats (Nesic et al., 2002), we compared expression levels from spinal cord tissues of 1) rats with injured cord, 2) rats with injured cord that were treated with N-methyl-D-aspartate (NMDA) receptor antagonist MK-801, and 3) control (sham) animals. We used three replicates per group and performed hierarchical clustering of nine experiments. The resulting dendrogram, shown in Figure 1, correctly demonstrated that, overall, the injured and MK-801-treated groups are more similar to each other than to the sham group.
|
With the defined distance measure, whether in the expression space or in the experiment space, the next step is to select the appropriate clustering algorithm.
A. HIERARCHICAL CLUSTERING ALGORITHMS
Most gene-clustering algorithms are hierarchical. These methods are derived (Eisen et al., 1998) from algorithms used to construct phylogenetic trees; the most-similar genes are clustered first, while those with more-diverse profiles are subsequently included in a stepwise hierarchy of increasing diversity. This means that, in the first clustering step, the single most-similar expression profiles are linked to form nodes, the most similar of which are linked further in the second clustering step, and so forth, until all nodes finally are linked and the complete hierarchical tree of proximities (dendrogram) is obtained. Starting from the second clustering step and higher, each node may consist of two or more objects. The distances between nodes must be recomputed at each step. This can be done, for example, by computing the distance between the nodes as the average distance between its objects, as in the average linkage procedure, or as the distance between two of its closest objects, as in the nearest-neighbor linkage procedure. Other options include distances computed between the centers of mass of clusters or their modifications. In most cases, however, average linkage procedure is considered acceptable.
These different linkage choices are made to compensate for potential problems with hierarchical clustering. Namely, as clusters grow in size, at higher levels of hierarchy, the expression vector that represents the cluster may no longer be representative of any of the genes in the cluster. Thus, actual expression patterns of the genes themselves become less relevant on higher levels of hierarchy. If a gene is assigned to the "wrong" cluster, this error cannot be corrected later under hierarchical clustering.
B. NONHIERARCHICAL CLUSTERING ALGORITHMS
1. K-means Clustering
Sometimes, when a priori knowledge exists about the number of clusters that should be obtained, one can use nonhierarchical K-means clustering to partition the data. In this procedure, one first specifies the number of clusters (K), then randomly assigns expression vectors to them. Distances between clusters are recomputed, expression vectors are reassigned to the nearest cluster, and the procedure is iterated until the point is reached when no new assignments are made. The K-means clustering procedure simply partitions expression data into K groups and does not produce a dendrogram, although one can be constructed later by a hierarchical procedure.
2. Self-organizing Maps
Another frequently used nonhierarchical procedure is self-organizing maps (SOMs), a neural network-based procedure for clustering. In this algorithm, one also specifies in advance the number of clusters, chosen usually as the nodes of a grid. The nodes are mapped into K-dimensional space, initially at random, and then iteratively adjusted. During each iteration, a data point is randomly selected and the node is moved towards that point by the amount proportional to its proximity, so that more distant nodes are moved the least amount. In this way, neighboring points in the initial geometry are mapped to nearby points in the data space. This process usually is iterated tens of thousands of times. SOMs are particularly useful for exploratory data analysis, in order to expose the global patterns in the data.
C. PRINCIPAL COMPONENT ANALYSIS
A somewhat more-familiar method for data reduction is the singular value decomposition (SVD), or principal component analysis (PCA) as it is known in statistics (Alter et al., 2000; Yeung and Ruzzo, 2001). In this procedure, expression data from the "genes x experiments" expression space are transformed to diagonalized "eigengenes x eigenexperiments" space, where eigengenes (or eigenexperiments) are unique orthonormal superpostion of genes (or experiments). PCA is essentially a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, with each succeeding component accounting for as much of the remaining variability as possible. This transformation represents the data in the new reduced coordinate space, in which individual genes or experiments appear to be classified into groups of similar functions or similar cellular state or phenotype. A simple illustration of PCA is given in Figure 2, in which the first principal component of a two-dimensional data set is shown by a straight line.
|
D. BIOMEDICAL APPLICATIONS OF CLUSTERING
Use of clustering/classification procedures in microarray experiments has been particularly fruitful in cancer research because cancers are complex, multigenic diseases with a natural control group for the analysis noncancerous tissue (Alon et al., 1999; Lin et al., 2002). This was studied in prostate cancer behavior (Singh et al., 2002), where a set of gene expression differences between healthy and diseased tissues was detectable at the time of diagnosis. Alternatively, clustering procedure can be used to compare cancerous tissues of the same type and to distinguish between clinical subtypes, as was done in two types of breast cancer (Hedenfalk et al., 2001). The procedure also was very efficient in finding genes that distinguish between small blue cell tumors and leukemias (Tibshirani et al., 2002) as well as in the discovery of a new subset of melanomas (Bittner et al., 2000). The general conclusion drawn from these and other studies is that different cancers can be classified by the characteristic expression patterns of not more than dozens of genes. With more than 200 types of cancer, DNA microarray experiments are becoming an important tool to distinguish between their types and subtypes on the molecular level.
E. CLUSTERING OF TIME-COURSE EXPERIMENTS
An important class of DNA array experiments, in which data classification and clustering have been used successfully, are time-course experiments. In this setup, genome-wide expressions are measured at different time points in order to discover the temporal pattern in the course of development, or during a response to a treatment, or during a healing process. In this context, we mention the important pioneering work by Tamayo et al. (1999), where the temporal patterns of gene expression during the yeast cell cycle were classified by the SOM algorithm. The expression measurements were taken at 16 equally spaced, 10-minute intervals over two cell cycles (160 minutes), yielding a total of 30 different patterns. The classification was able to successfully extract yeast cell-cycle periodicity as the most prominent feature in the data and to select the appropriate group of genes that participate in the cycling process.
Following this work, numerous articles have reported results of temporal gene expression patterns under a variety of conditions. These include the temporal gene expression mapping of central nervous system development in rats cervical spinal cord (Wen et al., 1998); response of human bronchial cells to smoke and hydrogen peroxide (Yoneda et al., 2001); differentially expressed genes in human myometrium during pregnancy and labor (Aguan et al., 2000); and a range of experiments in toxicogenomics that measure response following exposure to toxicants, to identify drugs that provoke adverse reaction (Castle et al., 2002). A large-scale study of development and metabolic pathways in mice, with approximately 1.8 million measurements of gene expressions based on 294 microarray analyses of 49 adult and embryonic tissues (Miki et al., 2001), is perhaps the best illustration of the versatility of time-course DNA array experiments.
The time points where expression levels are measured in time-course experiments need not be equally spaced, since biologically important events often occur over different time scales. To give an example (Spies et al., 2002), we analyzed the time course of healing and recovery of burn wounds in rats, with measurements made at the following four time points: 2 hours, 6 hours, 24 hours, and 240 hours after the burn injury.
The goal of our study was to identify local responses and initial cellular responses to skin thermal injury by comparing expression profiles in burned and unburned rat skin tissue. The associated genomic events include differential expression of genes involved in cell survival and death, growth regulation, metabolism, inflammation, and immune response. The dynamics of these events is most clearly seen when genes with similar temporal expression patterns are clustered together.
With only four data collection time points, the temporal change of the ratio of burned vs. unburned expressions can be analyzed in considerable detail. Note that, in this case, there are 27 possible dynamical patterns of temporal development. At each time point, the value of the expression ratio (burned vs. unburned) can be 1) increased with respect to the previous time, 2) decreased, or 3) remain the same as the value at previous time. Thus, starting at some base value at time 1, the expression ratio can change/not change at later time points 2, 3, and 4, giving 33 = 27 possibilities of development (Figure 2).
More generally, with t time points, there are 3(t-1) dynamical patterns. This number can become quite large quickly with increasing numbers of time points. For instance, with 17 time points equally spread over a 160-minute interval during two yeast cell cycles, as used in the already-mentioned work of Tamayo et al. (1999), there are over 43 million (316) mathematically possible different patterns. Of course, in this and similar cases, it is quite unrealistic and, indeed, completely unnecessary to consider all possible patterns in detail. What one needs is to classify existing data into a small number of characteristic patterns (clusters) with some global features like "clusters with peak expressions at 2545 minutes and 85105 minutes" (Tamayo et al., 1999) that correspond to some meaningful biological events. This is precisely what a clustering algorithm such as SOM performs: it searches through the large "pattern space" for the small set of characteristic patterns that reflects global features of the entire set. Alternatively, in dynamical system parlance, one can think of the final characteristic clusters as attractors and of sets of patterns assigned to them as points that fall into their domains of attraction.
Returning to the example at hand, the complete set of 27 patterns that can be obtained with four time points is shown in Figure 3. The total number of genes in our data is 781.
|
The particular set of genes participating in these processes can be analyzed in detail by examining each cluster separately. Consider, for illustration, cluster 6 (i.e., pattern 6) that includes 24 genes, as shown in Figure 4. The numbers 124 in this figure label the particular genes that belong to the cluster (their names are listed in the separate table, not included). In this and other figures, the time points are drawn as equidistant to simplify the drawing.
|
|
|
| IV. Beyond Simple Clustering: Genetic Regulatory Networks |
|---|
|
|
|---|
In order to move beyond simple coexpression, one has to establish which genes in some cluster also share common regulatory elements that control their expression levels (a group of genes regulated by a common element has been dubbed "regulons" in recent literature). The final goal of such analysis is to construct genetic regulatory networks and to identify the function of many thousands of novel genes (Tavazoie et al., 1999). This approach has been successful in yeast (Lyons et al., 2000), for which the complete sequence of promoter regions is known. Unfortunately, this is not the case with mammalian or other systems, where untranslated first exons, followed by introns greater than 10 kb in size, can make promoter identification extremely difficult. In many organisms, the promoter regions have not been fully sequenced. To construct the network for the phenomenon in question, one must use statistical algorithms for clustering and motif discovery in combination with genomic data, cis-regulatory analysis, and known molecular biology of the process studied. In spite of all the difficulties, several genetic regulatory networks, or parts of them, have been constructed. Davidson et al. (2002) have mapped a gene regulatory network in sea urchin embryo that controls the specification of endoderm and mesoderm. Such studies reveal that, in addition to comprehensive gene expression maps (Kim et al., 2001) obtained by DNA array measurements, one needs as much other genome-wide information as can be mustered to unravel the intricate patterns of genetic interactions in biological processes.
Simultaneously with this development, additional knowledge is accumulating regarding the statistical nature of naturally occurring networks. Many biological networks (e.g., genetic, metabolic) exhibit "small world"-scale free behavior (Watts and Strogatz, 1998). This means that although the network may possess thousands of nodes, the path leading from one node to another is remarkably short. Such architecture may serve to minimize transition times between metabolic states or provide robustness against mutations (Fell and Wagner, 2000; Jeong et al., 2000; Wagner, 2000). These new insights, combined with the knowledge of biological processes, may lead us for the first time towards understanding biology at the systems level (Kitano, 2002).
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
J. Fan and Y. Ren Statistical Analysis of DNA Microarray Data in Cancer Research Clin. Cancer Res., August 1, 2006; 12(15): 4469 - 4473. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Arner Novel Target Genes for Catecholamines in Skeletal Muscle J. Clin. Endocrinol. Metab., May 1, 2004; 89(5): 1998 - 1999. [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Endocrinology | Endocrine Reviews | J. Clin. End. & Metab. |
| Molecular Endocrinology | Recent Prog. Horm. Res. | All Endocrine Journals |