Analysing breast cancer microarrays from African Americans using shrinkage-based discriminant analysis

Breast cancer tumours among African Americans are usually more aggressive than those found in Caucasian populations. African-American patients with breast cancer also have higher mortality rates than Caucasian women. A better understanding of the disease aetiology of these breast cancers can help to improve and develop new methods for cancer prevention, diagnosis and treatment. The main goal of this project was to identify genes that help differentiate between oestrogen receptor-positive and -negative samples among a small group of African-American patients with breast cancer. Breast cancer microarrays from one of the largest genomic consortiums were analysed using 13 African-American and 201 Caucasian samples with oestrogen receptor status. We used a shrinkage-based classification method to identify genes that were informative in discriminating between oestrogen receptor-positive and -negative samples. Subset analysis and permutation were performed to obtain a set of genes unique to the African-American population. We identified a set of 156 probe sets, which gave a misclassification rate of 0.16 in distinguishing between oestrogen receptor-positive and -negative patients. The biological relevance of our findings was explored through literature-mining techniques and pathway mapping. An independent dataset was used to validate our findings and we found that the top ten genes mapped onto this dataset gave a misclassification rate of 0.15. The described method allows us best to utilise the information available from small sample size microarray data in the context of ethnic minorities.


Introduction
Breast cancer is the most commonly diagnosed cancer in women of all ethnic groups in the United States. It is also the second leading cause of cancer deaths in women. The Surveillance, Epidemiology, and End Results (SEER) database of the National Cancer Institute shows that African-American women, by comparison with Caucasian women, have a higher mortality rate for breast cancer, despite a lower incidence. Between 2000 and 2004, the age-adjusted breast cancer incidence rates were 118.3 cases per 100,000 African-American women and 132.5 cases per 100,000 Caucasian women. 1 By contrast, mortality rates were worse for African Americans, with 33.8 deaths per 100,000 women compared with 25.0 deaths per 100,000 Caucasian women. 1 In addition, a greater proportion of African-American women are diagnosed at a younger age compared with Caucasian women. The median age at breast cancer diagnosis is 57 years for African-American women and 62 years for Caucasian women. 2 Between 1996 and 2004, the five-year breast cancer survival rates were 77.1 per cent for African-American women and 89.9 per cent for Caucasian women. 1 These statistics highlight the disproportionate burden of breast cancer among African-American women. 3 One reason for this ethnic cancer disparity may be due to lower socioeconomic status. Roetzheim et al. mentioned that the lower percentage of health insurance among African Americans has led to late-stage diagnosis, which results in higher mortality rates. 4 In their review article, Gerend and Pai suggested that in addition to socioeconomic status, cultural factors may also play a role. 5 Another potential reason may be the lack of access to mammography. 6 Smigal et al. also reported that the rate mammography uptake varies among ethnic groups. 7 These previous reports collectively suggest that disparities in breast cancer survival may be attributed to lower socioeconomic status. Multivariate modelling approaches show that ethnic differences remain a significant independent risk factor for survival, however, even after adjustments for co-morbidity and socioeconomic variables. 8 -10 This indicates that socioeconomic variables are not sufficient in capturing the survival disparity.
The variation may instead be explained by differences in the underlying ethnic-specific tumour biology. For example, the incidence rate for oestrogen receptor-negative (ER-) breast cancer is higher in African-American women than in Caucasian women. In the study by Joslyn, 39 per cent of African-American women had ER-tumours compared with 23 per cent of Caucasian women. 11 The fact that women with ER-tumours usually have a worse prognosis than those with oestrogen receptor-positive (ERþ) tumours may serve as one avenue for explaining the differences in breast cancer survival rates. 12 In the era of public health genomics, and with the lowering costs of high-throughput technologies in recent years, research among ethnic minority populations in this area is still lagging behind. Moreover, there is a biological basis for ethnic differences in breast cancer 13,14 and a pressing need to understand the biological mechanism of the disease by utilising the widely available highthroughput data and technologies. Gene expression analyses have been used extensively to characterise breast cancer subtypes; 15,16 however, there has not been any research looking specifically at classifying ER status among African-American patients with breast cancer. In this paper, we review data gathered from one of the largest cancer genomics studies and apply a recently developed discriminant method for small sample size data to help to identify genes and biomarkers of interest.

Materials and Methods
Data were obtained from the International Genomics Consortium's expression project (GEO2109) for oncology, an ongoing project to collect gene expression data with a clinically annotated set of de-identified tumour samples. We obtained over 300 samples of microarray data with demographic and clinical information. The chipset used for these gene expressions was based on Affymetrix HG-U133 plus 2.0 with 54,613 probe sets. We looked at the distribution of different tumour types by ethnicity in the 1392 tumour samples of the microarray data sets made publicly available on or before 31st December 2008. A validation dataset with African-American breast tumour gene expression samples and oestrogen receptor status was obtained from the GEO public repository (GSE5847) uploaded by Boersma et al. 17 This gene expression dataset was based on an older Affymetrix HG-U133A with 22,215 probe sets.
For breast cancer, there were 310 and 20 female samples for Caucasians and African Americans, respectively. Roughly 65 per cent of both ethnicities had pathological ERþ/ER-status, giving a total of 201 Caucasian and 13 African-American patients for our analysis. The validation dataset consisted of 18 African-American samples with pathological ERþ/ ER-status.

Oestrogen receptor status
We chose to study breast cancer and ERþ/ERstatus because breast cancer has been studied extensively in the literature. One of the earlier publications that used ER status for classification was that of West et al. 18 They demonstrated the use of gene expression data for determining clinically relevant phenotypes in breast tumour samples. More recent publications include the identification of prognostic gene expression classifiers based on ERþ breast cancer; 19,20 an increased risk of ER-breast cancer among Hispanic women with a family history of the disease; 21 and a gene expression profile of good prognosis subtype in ER-breast cancer. 22 ER status has also been used to guide breast cancer therapy, predict breast cancer survival rates and estimate the risk of breast cancer. 23 -25 ERþ breast cancers are usually treated with hormone therapy, whereas ER-breast cancers are treated using chemotherapy. Not all breast carcinomas are responsive to treatment, however. Therefore, it is important to know the biological mechanisms behind the disease to help identify therapeutic targets avoid develop novel agents.

Small sample size classification
In the microarray setting, classification is performed on a matrix of n samples by p genes. Linear discriminant analysis (LDA) is a well-known method for classification, a technique for distinguishing between groups of samples. For the two-group classification problem, it finds a projection for the samples so that the two groups are well separated. LDA has been extended to diagonal linear discriminant analysis (DLDA). The difference between LDA and DLDA is that DLDA assumes no correlation among genes. Only the diagonal of covariance matrix is used in the classification rule, hence the word 'diagonal' before LDA. DLDA is known to be one of the best classifiers. 26 The performance of DLDA itself can be unsatisfactory, however, due to the unreliable estimates of the commonly used sample variances when the sample size of each group is much smaller than the number of genes. The dataset in this study is an illustrative example of such a situation, in which the number of African-American breast cancer samples is only 13 compared with the number of probe sets at 54,613.
In 2009, Pang et al. pointed out that regularisation and shrinkage techniques could help to enhance and improve the performance of the diagonal discriminant analysis. 27 Specifically, they described two strategies in their paper. First, they applied shrinkage to DLDA, which, in essence, is a method for borrowing information across genes to improve the estimation of the gene-specific variances by shrinking them toward a pooled variance. Secondly, they applied regularisation to the shrinkage-based diagonal discriminant rules, which is essentially a weighted version of the shrinkage-based DLDA and shrinkage-based diagonal quadratic discriminant analysis (DQDA). Combining shrinkage-based variance in diagonal discriminant analysis and regularisation resulted in a new classification scheme which showed improvement over the original DLDA, DQDA and other commonly used classifiers. This new scheme was named regularised shrinkage-based diagonal discriminant analysis (RSDDA). For more details on this algorithm, see Appendix 1.
Given the seven ERþ and six ER-African-American patients, we performed the nested leave-pair-out cross-validation approach. Specifically, one ERþ sample and one ERsample were left out of the training set, which was used to build the classifier. The classifier was then used on the test set that consisted of the left-out pair of ERþ and ER -patients. Thereby, we considered 42 different combinations for this procedure. Error rates were checked to ensure that results were not due to chance using permutation. The permutation was performed by allocating three samples of ERþ and three samples of ER-in one group, with the remaining seven in another. This procedure was repeated 100 times to obtain a permutation p-value, which represented the counts of the number of times that the misclassification rate fell below the observed misclassification rate.

Gene selection
At each cross-validation run, probe sets were selected using the ratio of between-group to within-group sums of squares. We selected the top ten and 20 probe sets and performed the classification. Despite a small number of top probe sets chosen, ten probe sets had been shown to perform well in practice. 27 In addition, we standardised the expression data; that is, the observations (arrays) had mean 0 and variance 1 across probe sets.

Subset analysis
To refine our set of probe sets further, we looked at the top ten probe sets from each cross-validation run and identified a unique set that we referred to as 'uniqueAA'. We performed a subset analysis to see how many of these probe sets would be selected by chance for a similarly sized sample of Caucasians. Seven ERþ and six ER-Caucasian patients were randomly selected from a pool of 135 ERþ and 66 ER -samples by resampling 100 times. We counted and tabulated the number of times that the selected probe sets belonged to the 'uniqueAA' set. To identify the probe sets that could potentially be common targets for both African Americans and Caucasians, we investigated the probe sets that had high counts.

Identification of ethnic-specific probe sets
The top set of unique probe sets was further refined to determine a list of probe sets that were common in identifying both ethnic groups. Using the probe set 'uniqueAA', we looked at how this set predicted the Caucasian and the African-American samples. Misclassification rates were obtained to see how well the 'uniqueAA' set performed in classification. Since there were a larger number of Caucasian samples in the dataset, we were able to perform this by subsetting the dataset into 105 ERþ , 36 ER-for the training set and 30 ERþ , 30 ER-for the test set. This procedure was repeated 100 times.
To identify the probe sets that were unique to the African-American patients, we took the bottom two probe sets from each run of the 100 repetitions. A unique set of these probe sets was obtained, which we referred to as 'uniqueAAbottomC'. This gave us a set of probe sets that were good predictors of African Americans' ER status but were less valuable for discrimination among Caucasians. Once again, we performed classification to see if this set had any predictive ability in the Caucasian samples. If what we had expected were true, then this set should have a misclassification rate of around 0.5; that is, close to being classified as at random. Using 'uniqueAAbottomC', we re-ran the procedure using the African-American data. At each run, the top two probe sets from this set would be identified as potential unique targets for African Americans in this sample.

Validation
We validated our findings by taking 'uniqueAA' probe sets found from the procedure described above and mapping these probe sets to the validation dataset. Since the validation set contains only a subset of the probe sets in our primary dataset, not all probe sets are mapped. Of those that are mapped, the top ten probe sets from the new dataset were chosen. We then performed the leave-pair-out cross-validation approach on the five ERþ and 13 ER-samples to obtain a misclassification rate.

Literature mining
We utilised PubMatrix to compare the discovered gene lists from the previous section to keywords such as 'breast cancer' and 'oestrogen receptor'. Moreover, we mapped the unique probe set 'uniqueAA' described in the previous section to pathways from Kyoto Encyclopedia of Genes and Genomes (KEGG) 28 and BioCarta (http://www. biocarta.com/) to see which pathway had the most probe sets mapped onto it. Fisher's Exact tests were performed to assess the significance of mapping to these pathways relative to the chipset as a whole. Pathways were sets of genes that served a particular biological or physiological function. Table 1  Small sample size African-American microarrays Table 2 shows the misclassification rates for the top ten and 20 probe sets chosen from cross-validation of different methods. All of the methods performed better when ten probe sets were chosen. For the top 20 probe sets chosen, the misclassification rate was near 0.5. The results were similar when a larger number of top probe sets were chosen. RSDDA performed best for the top ten probe sets, with a 0.32 misclassification rate. Five probe sets were selected more than 30 per cent of the time in the training sets and two of them, 1570001_at (CASP8AP2) and 202653_s_at (MARCH7), were selected over 75 per cent of the time. A permutation, as described in the Methods section, was performed to ensure that this was not due to chance. Out of 100 permutations of ERþ and ER-status, none of the misclassification rates fell below 0.32, giving a permutation p-value of 0. This p-value measures the number of times the misclassification rates falls below the observed misclassification rate from the 100 permutations.

Subset analysis
From the small sample size classification, we identified a small group of 156 probe sets by keeping unique probe sets that were identified as top ten probe sets at each iteration during feature selection in the training set. We called this set 'uniqueAA'. A subset analysis was performed to see whether these probe sets would also be selected as top probe sets for the Caucasian samples. Only two, 229578_at and 243338_at, were selected once as the top ten in the Caucasian samples and the remaining sets were not selected at all for the top ten. When increased to 156 probe sets, by lowering the threshold, we found that 34 probe sets were selected at least once, with three pairs of probes selected at least twice and one selected five times. A set of the top genes, with probes that were mapped, can be found in Table 3.
To check for possible common targets, we used the 156 'uniqueAA' probe sets to investigate the top probe sets chosen from the run using Caucasian samples. Ten genes that are common targets were mapped and each was selected at least once. Six of these ten genes had literature evidence related to 'breast cancer' or 'oestrogen receptor' (see Table 4). For example, the testis derived transcript gene (TES) was found to be a tumour suppressor gene related to breast cancer. 29 Using the 'uniqueAA' set of 156 probe sets, we again performed classification on the African-American samples using RSDDA and found that the misclassification rate had fallen to 0.16 using a second nested leave-pair-out cross-validation with the top 100 probe sets. By comparison, when we performed the classification on the Caucasian samples, we obtained a misclassification rate of 0.23. This indicated that the 156 probe sets have much stronger discriminant power in distinguishing between ERþ and ER-for African Americans than for Caucasians.

Unique probe sets for African Americans
To refine our probe set further, to probe sets that were unique to African-American patients, we picked the bottom two probe sets from the run using Caucasian samples. We ended up with a set of 28 probe sets that gave a 0.51 misclassification rate for Caucasian samples and a 0.17 misclassification rate for African-American samples. Seven of the 28 probe sets were mapped to genes, of which four of the seven had literature evidence relating them to 'breast cancer' or 'oestrogen receptor' (see Table 4). For example, RAB31 was one gene for which there was literature evidence relating it to both breast cancer and the ER. 30,31 Validation We mapped the 156 probe sets 'uniqueAA' onto the validation dataset. Since the validation dataset came from an older Affymetrix chipset, 63 probe sets could be mapped. Using the top ten probe sets, we obtained a misclassification rate of 0.15 in distinguishing between ERþ and ER -for African-American breast tumour samples (see Table 5).

Biological pathways
We mapped the 156 probe sets 'uniqueAA' onto the KEGG and BioCarta pathways. The mitogenactivated protein kinase (MAPK) signalling pathway, the Wnt signalling pathway, purine metabolism and oxidative phosphorylation were found to have three, three, three and four mapped probe sets, respectively. The p-values for corresponding test of association between these pathways compared with the whole chipset were, 0.44, 0.08, 0.07 and 0.00015, for the MAPK signalling pathway, the Wnt signalling pathway, purine metabolism and oxidative phosphorylation, respectively. In the literature, biological evidence suggests that oxidative phosphorylation and mitochondrial mutation may play a role in the development of both breast and prostate cancers in African Americans. 32

Discussion
The goal of this paper was to help to identify possible novel biomarker targets for further investigation. The percentage of breast cancer tumour samples from African-American women in the microarray data was only slightly over 5 percent and did not reflect the age-adjusted incidence rate in Table 3. Sixty-two probes with mapped gene symbols (out of 156 probe sets) and counts of occurrences in the top ten and top 156 using Caucasian dataset by resampling 100 times

Common targets Top 10
Top 156 The genes below all have zero counts (ie genes unique to African Americans) the population. This paper illustrates an example of health disparities among ethnic minorities in the genomics field and a possible solution to the lack of available gene expression data with ethnic information in public repositories. Moreover, by applying the regularised shrinkage-based discriminant method, we were able best to utilise the information from small sample size breast cancer microarray data for African Americans.
As seen in the Results section, RSDDA obtained the lowest misclassification rate among the methods compared. Since we had a small sample size, we performed permutation analysis and subset analysis to confirm the significance of our findings. When using the 156 probe sets identified, we were able to achieve a misclassification rate as low as 0.16 in distinguishing between ERþ and ER-among African-American patients with breast cancer. These findings were further validated using an external dataset, which gave a misclassification error of 0.15, close to what we found for the training set. Furthermore, we showed potential biological relevance of our findings using literature-mining methods and mapping genes to biological pathways.
African-American breast cancer tumours are usually more aggressive and are associated with higher mortality rates than those found in Caucasian populations. 8,9,11,12,14 Although mortality rates in both ethnic groups have declined over the past decade, in a 2002 study, African-American women still showed a 37 per cent higher mortality rate than Caucasian women. 7 Despite efforts to eliminate this disparity, the African-American population is still under-represented in clinical research protocols. This is evident in the difference in proportions between African-American and Caucasian women in the number of breast cancer samples collected, as noted in the Results section. These numbers are disproportionate when compared with the age-adjusted incidence rates of breast cancer, as cited earlier.
Unique to African Americans

Breast cancer
Oestrogen receptor *Genes-to-Systems Breast Cancer (G2SBC) database Although the ideal situation would be to have a larger study, our approach may serve as a solution to a situation in which only a small number of microarrays is available. Additionally, the biomarkers identified should be confirmed biologically using real-time polymerase chain reaction. Another way to tackle the small sample size problem would be to perform meta-analyses. A meta-analysis can be performed on high-throughput data by pooling across different datasets and platforms to form a larger sample. One such example can be found in the paper by Ochs-Balcom et al., in which the authors looked at the association of breast cancer with a particular gene of interest. 33 While such efforts help to increase the power in discrimination, attention needs to be paid to ensure that the results are not due to batch effects.

Conclusions
Breast cancer tumours in African Americans are known to be more aggressive in nature than in the general population. 8,9,14 Few studies have been conducted to identify genes that are good at distinguishing ERþ and ER-patients among African-American women. New strategies for targeted screening and preventive measures can be employed with the identification of biomarkers that help to determine the risk associated with aggressive breast cancer in African-American women. Other factors, such as socioeconomic status or cultural background, may also contribute to higher mortality rates among African Americans, and further research examining the impact of these factors deserves attention. 34 There have been efforts to improve breast cancer screening, which can help to diagnose patients at earlier stages, but a weaker association of population screening rates with early diagnosis has been seen in African Americans compared with Caucasians. 35 Researchers have also suggested various strategies to improve patient participation in breast cancer clinical studies. 36,37 Without efforts to improve enrolment in cancer genetics registries and to provide high-quality prevention and screening, the goal of eliminating ethnic disparities in breast cancer cannot be achieved. 38,39 In this paper, we have presented the use of ER status as the binary outcome for classification. Like ER status, progesterone receptor status and Her2/neu status may also help to assess breast cancer risk and determine treatment options for patients. Given the heterogeneity of breast cancer, ethnic variations can, in part, be explained by differences in molecular and genetic clues. A recent study presented illustrative examples of how genomics could help to eliminate ethnic health disparities. 40 For example, gene expression data have provided us with new understanding of biological pathways to help to address ethnic disparities and other differences among breast cancer patients. To facilitate better use of these high-throughput data, gene expression data uploaded to public repositories should contain corresponding ethnicity information. Apart from gene expression data, there is also a need to collect genotypic and phenotypic information better to understand and assess the risk for African-American and Caucasian patients with breast cancer in genetic association studies. 2 Answers to genetic differences across ethnicities and other risk factors influencing breast cancer incidence and survival may be elucidated with large-scale data from research efforts such as the Carolina Breast Cancer Study and the Clinical Breast Care Project. 15,41 Similarly, shrinkage-based DQDA (SDQDA) is defined as arg min k X p j¼1 ðx j Àm kj Þ 2 =s À2 kj ðâ Ã Þ À X p j¼1 lns À2 kj ðâ Ã k Þ À 2 lnp ! Regularised shrinkage-based diagonal discriminant analysis Regularisation techniques as in [44] give rise to regularised discriminant analysis. To achieve this, we replaces À2 j ðâÞ with a weighted version of SDQDA and SDLDA. For more details regarding this method, please refer to Pang et al. 27 R code for RSDDA is available from the authors upon request.