Functional constraints on the constitutive androstane receptor inferred from human sequence variation and cross-species comparisons

Members of the NR1I subfamily of nuclear receptors play a role in the transcriptional activation of genes involved in drug metabolism and transport. NR1I3, the constitutive androstane receptor (CAR), mediates the induction of several genes involved in drug response, including members of the CYP3A, CYP2B and UGT1A subfamilies. Large inter-individual variation in drug clearance has been reported for many drug metabolising enzyme genes. Sequence variation at the CAR locus could potentially contribute to variation in downstream targets, as well as to the substantial variation in expression level reported. We used a comparative genomics-based approach to select resequencing segments in 70 subjects from three populations. We identified 21 polymorphic sites, one of which results in an amino acid substitution. Our study reveals a common haplotype shared by all three populations which is remarkably similar to the ancestral sequence, confirming that CAR is under strong functional constraints. The level and pattern of sequence variation is approximately similar across populations, suggesting that interethnic differences in drug metabolism are not likely to be due to genetic variation at the CAR locus. We also identify several common non-coding variants that occur at highly conserved sites across four major branches of the mammalian phylogeny, suggesting that they may affect CAR expression and, ultimately, the activity of its downstream targets.


Introduction
The constitutive androstane receptor (CAR, NR1I3) belongs to the NR1I subfamily of nuclear receptors (NRs), members of which play a role in transcriptional activation of genes involved in drug metabolism and transport. Other members of the NR1I subfamily include the pregnane X receptor (PXR, NR1I2) and the vitamin D receptor (VDR, NR1I1). CAR is expressed in the liver and intestine and, through heterodimerisation with the retinoic acid receptor (RXR), mediates the induction of important drug metabolising enzyme (DME) genes such as CYP3A4, CYP2B6 and UGT1A1, 1 as well as drug transport genes including multidrug resistance-associated protein 2 (MRP2). 2 In addition, CAR enhances the metabolic clearance of endogenous bile acids and bilirubin, protecting against toxicity, 3,4 and has recently been implicated in thyroid hormone regulation. 5 By contrast with PXR, which is found in a variety of species including fish -and is able to bind a diverse variety of exogenous and endogenous ligands, including bile salts, 6 CAR has only been identified in mammals and is activated by a much smaller set of ligands. 7 Interestingly, only one xenosensor gene has been identified in non-mammalian species, such as chicken 8 and pufferfish. 9 The protein sequences of the chicken and pufferfish xenosensor genes are roughly equally related to those of mammalian PXR and CAR sequences. 10 Handschin et al. used several experimental approaches to confirm that the chicken has only one xenosensor gene, CXR, but were unable to identify additional genes, suggesting that CXR represents the ancestral gene that diverged into CAR and PXR in response to different environmental and nutritional factors. 10 CAR genes have been sequenced from a total of eight mammals, including the partial sequences of cow and dog analysed in this study. CAR has the typical nuclear hormone receptor organisation of an amino-terminal DNA binding domain (DBD) and a carboxy-terminal ligand binding domain (LBD); however, in comparing CAR sequences among mammals, a striking feature is high cross-species sequence divergence in the LBD. 11,12 The LBD of CAR shares amino acid identities of only 74 -79 per cent between human and rodent sequences, which are unusually low when compared with other NRs. 11,12 The variation in the LBD of CAR is more striking when DNA sequences are analysed, in particular by comparing the rate of non-synonymous (resulting in an amino acid change) and synonymous (not resulting in an amino acid change) nucleotide substitution rates. The rates of non-synonymous substitution normalised to the synonymous substitution rate, d N /d S (v), for CAR and PXR are 5.6 and 4.0 times higher than the average for other NR genes, respectively. 12 The elevated v ratios of the CAR and PXR LBDs suggest adaptive evolution 13 and may be explained in part by the biological role played by CAR and PXR in the metabolism of different xenobiotic and environmental substances, 11,12 although the nature of such ligands is currently unknown. Relaxation of functional constraints may also explain the elevated v values at CAR and PXR, but this seems less plausible, given their critical role in activating enzyme genes involved in the metabolism of important endogenous and exogenous compounds.
The role of NRs (especially PXR and CAR) in the regulation of genes involved in drug metabolism is of particular importance to clinical pharmacology. Substantial interindividual variation in mRNA expression level has been reported for CAR. 14,15 The hepatic enzyme CYP3A4which is believed to play a significant role in adult drug metabolism -exhibits wide variation in gene expression and activity, only partially explained by known genetic variants. 16 For CYP2B6, which is also regulated by CAR, large interindividual variabilities in mRNA level and activity have been reported. 17,18 Strong positive correlations in mRNA levels have been observed between CAR and CYP3A4, 19 as well as CAR and CYP2B6, 14,15 suggesting that the interindividual variability observed for some DME genes could be associated with variability in upstream regulators such as CAR. A better understanding of sequence variation in regulatory genes such as CAR, PXR or RXR, could conceivably explain increased or decreased expression and/or activity profiles of DME genes.
We characterised patterns of genetic variation in three ethnically diverse human population samples through a resequencing survey of CAR coding regions and conserved non-coding sequences (CNSs). Our study reveals a common haplotype shared by all three populations which is strikingly similar to the ancestral sequence, confirming that CAR is under strong functional constraints. In addition, the level and pattern of sequence variation is approximately similar across populations, suggesting that interethnic differences in drug response are not due to genetic variation in CAR. We also identify several common non-coding variants that occur at highly conserved sites across four major branches of the mammalian phylogeny, suggesting that they may affect CAR expression and, in turn, the activity of its downstream targets.

Re-sequencing survey design
Human primer sequences were designed based on the reverse complement of Genbank accession number AL509714. CAR human genomic sequence (AL509714) was aligned with orthologous chimpanzee (AADA01309121), dog (AAEX01049487 and AAEX01049488), cow (Scaffold 30573, 10th December, 2004), rat (AB105071) and mouse (AC084821) sequences using PipMaker. 20 All exons were included in the resequencing survey, along with intronic CNSs. The human genomic sequence was analysed using Cluster Buster 21 in order to predict clusters of the following liver-enriched transcription factors: HNF1a, HNF4, PXR/ CAR, OCT-1, PPAR/RXR, CEBP, HNF3b, HNF4/ COUP-TF. One region predicted by Cluster Buster with a high probability falls approximately 11 kilobases (kb) upstream of CAR and was included in the survey.

Population samples
Human population samples included 24 European individuals from the Centre d'Etude du Polymorphisme Humain families, 23 individuals from the Human Variation Panel of African Americans and 23 individuals from the Human Variation Panel of the Han People of Los Angeles. All DNA samples were obtained from the Coriell Cell Repository. Sample information can be found in PharmGKB (http://www.pgkb.org), as well as at the Di Rienzo laboratory website (http://genapps. uchicago.edu/labweb/programss.html). One Western chimpanzee (Pan troglodytes verus) was sequenced for use as an outgroup.
This study was carried out in accordance with the Declaration of Helsinki (2000) of the World Medical Association and was approved by the Institutional Review Board of the University of Chicago.

Data analysis
Summary statistics of DNA sequence variation were calculated using SLIDER. Haplotypes were inferred using PHASE2 23,24 in each population sample separately. F ST , a measure of allele frequency variation among samples, was calculated between all pairwise combinations of populations for each polymorphic site. 25 Individual transcription factor binding sites were predicted using the liver-specific profiles available with Match 1.0. 26 Sorting Intolerant From Tolerant (SIFT) 27 , version 2 was used to predict the consequence of the single amino acid substitution. Given an amino acid sequence, SIFT searches for and performs alignments of similar sequences, then calculates the probabilities for amino acid changes affecting protein function based on cross-species conservation at each position. ESEfinder was used to predict the locations of exon splice enhancers. 28 Human sequence comprising the entire resequenced region was aligned to the genomic sequences from the five mammalian species listed above using Multi-Lagan. 29 Phylogenetic analysis by maximum likelihood Sequences were aligned with Clustal X. Estimation of v ratios was carried out by maximum likelihood using a codon-based substitution model in Phylogenetic Analysis by Maximum Likelihood (PAML) version 3.13. 30 The input to PAML is a treefile of the phylogeny of the sequences to be studied and a file with aligned sequences. The phylogeny is based on the known phylogenetic relationships between the species to be studied, determined by a consensus of morphological and molecular data.
PAML can determine estimates of v for models of varying complexity. The most commonly applied models are as follows (the PAML model numbers are shown in parentheses; 'sites' refers to codons): 31,32 model M0 (null model with a single v ratio among all sites); M3 ('discrete' model, with two or more categories of sites with the v ratio free to vary for each site); M7 ('beta model', ten categories of sites with ten v ratios in the range 0 -1 taken from a discrete approximation of the beta distribution); and M8 ('beta plus v model, ten categories of sites from a beta distribution as in M7 plus an additional category of sites with a v ratio that is free to vary from 0 to greater than 1). PAML estimates the v ratios that are allowed to vary in these models, as well as the proportion of sites (codons) with each ratio.
Of the PAML models listed above, only M3 and M8 can detect positive selection (ie v . 1). Each PAML model generates a log-likelihood, indicating how well the models fit the input data. Some PAML models are 'nested' within each other (eg M0 within M3, M7 within M8). In those cases, twice the log-likelihood difference between the two models is compared with a x 2 distribution with degrees of freedom equal to the difference in degrees of freedom between the two models; p values for sites potentially under positive selection are obtained using a Bayesian approach in PAML. 33 The accuracy and power of PAML models increases with more sequences and longer-length sequences. 34 Simpler PAML models are preferred unless a more complex model fits the observed data significantly better.
Analysis was performed on programslicly available coding sequences, either from cDNA sequences or predicted from genomic DNA sequences. Four other genes in the NR1 family of NRs were chosen for comparison with CAR: the gene encoding thyroid receptor-a (TRa), the gene encoding farnesoid X receptor (FXR), VDR and PXR. These genes all have sequence data for four or more mammalian species. VDR and PXR are the genes most closely related to CAR; these three genes are classified in the NR1I subfamily. Separate PAML analyses were performed on datasets of full-length sequence and restricted to either DBD or LBD, the two major domains of NRs. Information regarding all sequences used for PAML analysis can be found in Supplementary Table 1.

Survey design and summary statistics
The design of the resequencing survey was based in part on cross-species sequence conservation, as shown in Figure 1. The total evolutionary time spanned by the six mammalian species represented in the alignment is greater than 370 million years. 35 Thus, elements showing a high degree of sequence similarity across this set of species are more likely to be due to active conservation resulting from functional constraints (eg regulation of CAR expression) rather than limited divergence time. The predicted gene located downstream of the CAR locus (FLJ12770; Genbank accession NM_032174) is likely to account for some of the increased sequence identity flanking CAR exon 9 ( Figure 1). BLASTing the mRNA sequences of the two genes against one another indicates that the 3 0 ends of the oppositely transcribed genes do in fact overlap, although the coding regions do not. We also used predictions of clusters of liver entyme enriched transcription factor binding sites to guide our selection of genomic segments for resequencing. One cluster was predicted with high probability approximately 11.4 kb upstream of the CAR transcription start site; a 364 base pair (bp) segment encompassing this region was included in our survey. A total of 4.1 kb were surveyed in each of three human population samples (African Americans, European Americans and the Han People of Los Angeles), including 1,051 bp of coding sequence and 3,053 bp of CNS. The same segments were sequenced in one chimpanzee, revealing that human -chimpanzee divergence at CAR is relatively low (0.76 per cent) compared with the genome-wide average of 1.24 per cent. 36 Summary statistics of sequence variation at the CAR locus are shown in Table 1. Levels of polymorphism are summarised by nucleotide diversity (p) -based on the mean number of pairwise differences between samples -and u W , based on the number of polymorphic sites and sample size. 39 The frequency spectrum of polymorphic sites at a given locus is captured by the Tajima's D statistic. 40 An excess of rare variants is indicated by a negative Tajima's D value, while a positive value signals an excess of intermediate frequency variants. The Tajima's D values observed for the non-African populations are not significantly different from zero and indicate that the spectrum of allele frequencies in these populations is consistent with the neutral-equilibrium expectation. The more negative Tajima's D value observed in the African-American sample may be reflective of recent population growth or admixture between African and European populations. 41 Many standard neutrality tests are based on the assumption of populations at equilibrium. Most human populations, however, do not fit these assumptions, making a comparison of our results with an empirical distribution a more appropriate measure of deviations from neutrality. To this end, we compared our results with the resequencing data of the UW-FHCRC Variation Discovery Resource (Seattle SNP). The European and African-American samples from the Seattle SNP dataset are the same as those used in our study, resulting in a direct comparison; however, no Asian data are available from this project. Our results for nucleotide diversity (p and u W ), Tajima's D, and u W normalised by human -chimpanzee divergence (to account for differences in the neutral mutation rate across loci) were all compared with the distribution of these values from 159 genes in the Seattle SNP dataset -the results are shown in Table 1. Although the Tajima's D value in the African-American sample is at the low end of the distribution, most of the CAR statistics are well within the range observed in the larger dataset, suggesting that it evolved largely according to neutral expectations.

Coding and non-coding sequence variation
We identified a total of 21 polymorphic sites, including one singleton 24 bp indel. It is interesting that six out of 21 (29 per cent) single nucleotide polymorphisms (SNPs) occur within 374 bp in intron 2, and that five of the six disrupt predicted transcription factor binding sites ( Table 2). Of the 21 SNPs, 17 fell into non-coding sequence and four into coding, of which three were synonymous changes and one was non-synonymous. The non-synonymous SNP, Arg97Trp, is located in the , 20 amino acids comprising the linker region between the DBD and LBD and results in an intolerant change, as predicted by SIFT. 27 The Arg97Trp SNP has a SIFT score of 0.00, although, since the amino acid alignment could contain only mammalian sequences, there was insufficient power to assign a quality score based on evolutionary conservation. A predicted exon splice enhancer is also disrupted by the nonsynonymous SNP 28 (data not shown).
The ancestral allele at each polymorphic site was inferred by comparison to the chimpanzee sequence. Complete conservation of the ancestral allele across the three non-primate lineages included in our study (carnivore, artiodactyl and rodent) was observed for eight of the 21 SNPs (excluding sites where the position was missing in any lineage) ( Table 2). The ancestral haplotype is observed at high frequency in the African-American sample and is present in the Asian sample. Functional constraints on the constitutive androstane receptor Review PRIMARY RESEARCH  Five SNPs have a combined allele frequency of 10 per cent or higher across all populations, while the remaining 16 are rare. Four of the five common SNPs fall in positions that are conserved across all three non-primate lineages. If a gene is evolving neutrally, the ratio of within-species polymorphism to between-species divergence is expected to be similar across functional classes of sites -for example, synonymous and non-synonymous sites; deviation from this expectation can be detected using the McDonald -Kreitman (MK) test. 42 Any two distinct classes of sites can be assessed in this format, where skews in the ratio of levels of polymorphism and divergence are informative about the type of selection acting on the locus. The MK test applied to synonymous and non-synonymous sites was not significant. Next, we applied the test to entire surveyed segments, including coding and non-coding regions, by partitioning the sites into two classes defined by cross-species conservation. In this context, sequence conservation is taken as a measure of the probability of a site being functionally important. Sites were classified as conserved only if the ancestral allele was conserved across the three non-primate lineages; all other site configurations were considered non-conserved. For the classification of fixed differences, the human -chimpanzee alignment was used to locate all divergent sites; these were called either conserved or non-conserved, based on the same criteria as described for the polymorphic sites. We constructed a 2 £ 2 table with all polymorphic sites and observed no significant deviations from neutral expectation. Because slightly deleterious polymorphic sites tend to be rare, 43 we repeated the test by partitioning the polymorphic sites based on their frequency. As shown in Table 3, a significant deviation was observed in the distribution of common polymorphisms ( p ¼ 0.02), suggesting different selective pressures on rare versus common amino acid variants.

Haplotype structure and interpopulation differentiation
The inferred haplotypes shown in Figure 2 reveal a number of common haplotypes, but no defined structure overall at this locus. One haplotype, which is only one step away from the ancestral, is the most commonly observed among non-African samples and occurs at high frequency in the African- Table 3. McDonald -Kreitman test comparing levels of human polymorphism and divergence between human and chimpanzee at conserved versus non-conserved sites across distantly related mammals.

Non-conserved a Conserved
Rare polymorphic sites Common polymorphic sites a If the allele in chimpanzee (ancestral) was identical to that found in the same position in rodent, artiodactyl and carnivore, the site was considered conserved across three lineages. b The singleton indel was not included in this analysis. For Rare polymorphic sites, Fisher's exact test (FET) p ¼ 0.71; for common polymorphic sites, FET p ¼ 0.02; For rare polymorphic sites, polymorphisms identified in the resequencing survey were classified as either rare (minor allele frequency ,10 per cent in the combined sample) or common (minor allele frequency $10 per cent in the combined sample). Functional constraints on the constitutive androstane receptor Review PRIMARY RESEARCH American population. In all population samples, the vast majority of haplotypes (. 90 per cent) contain two or fewer changes from the ancestral. The strong constraint, as well as similar nucleotide diversity levels and haplotype structure across populations, suggests that the forces shaping diversity at this locus are not population specific.
The F ST statistic, which measures variation in allele frequency between populations, is useful for quantifying interpopulation differentiation. 44 We estimated F ST per site between pairs of population samples. At the CAR locus, only one SNP identified in this study has an F ST value greater than the 0.123 average estimated by Akey et al. 45  Phylogenetic analysis by maximum likelihood of CAR genes Previous phylogenetic investigations of CAR genes have only looked at two-sequence comparisons in calculating v ratios. 12,46 We performed phylogenetic analysis on amino acid sequences of several NRs in order to detect deviations in the rate of amino acid substitutions using PAML. Table 4 shows the results of PAML analysis of sequence data for CAR and four other genes in the NR1 family which have complete sequence data for four or more mammals: TRa, NR1A1; FXR, NR1H4; VDR, NR1I1; and PXR, NR1I2. A total of three analyses were performed for each gene: full-length sequence, DBD only and LBD only.
The analyses for the full-length sequence, DBD and LBD for the TRa, FXR and VDR genes show low v ratios, consistent with strong purifying (negative) selection as the dominant evolutionary force for these genes. The CAR and PXR analyses, for the full-length sequence and LBD show a minority of codons (sites) with v ratios approaching one or, in the case of PXR, exceeding one, however (results in bold in Table 4). By contrast, the v ratios for the DBD are low for CAR and PXR. These results illustrate that CAR genes have an elevated v ratio across the mammalian species for which CAR sequence data are available, compared with other NR genes. In this regard, CAR is similar to PXR, a closely related NR that also responds to diverse exogenous and endogenous ligands.

Discussion
Our survey of sequence variation at the CAR locus in three ethnically diverse human populations identified a number of polymorphic sites which are likely to affect function, including four common polymorphisms at highly conserved sites across distantly related mammalian species and one amino acid substitution. Most of the SNPs identified are rare and population specific, while the intermediate frequency SNPs tend to be shared by all three populations at similar frequencies.
These results suggest that CAR evolved under strong functional constraints and that interethnic variability in drug response is not likely to result from genetic variation at this locus. As an upstream regulator of many genes involved in drug metabolism and detoxification, as well as a regulator of bilirubin and thyroid hormone metabolism, CAR may be more likely to be strongly constrained and may not be as subject to interpopulation differences. The role of CAR in bile acid homeostasis -a mechanism unlikely to change across populations -as well as its obligate heterodimerisation with RXR, corroborate the idea that this gene is under strong evolutionary constraint. We observed only one non-synonymous variant in our survey of 70 individuals, suggesting that amino acid polymorphisms do not contribute to the substantial interindividual variability reported for CAR. 14,15 Alternative splice forms may underlie considerably more variation with regard to changes in expression pattern, expression level, protein binding and function. Many nuclear hormone receptors, including CAR, 47,48 have alternate splice forms, a mechanism which presumably offers greater complexity for this class of genes. No nucleotide variation was observed in any of the reported sequences spliced into alternate forms of CAR. Alternatively, variation in the expression level of CAR may be related to the collective effects of multiple rare polymorphisms; combinations of common and rare variation have recently been demonstrated to contribute to functional variation 49 and may play an important role in clinical drug response.
The results of a resequencing survey at PXR revealed similar findings with regard to human variation. 50 In this study, coding regions, as well as the 5 0 and 3 0 untranslated regions, the promoter and 50 bp of flanking intronic sequence, were surveyed in 170 chromosomes; only three non-synonymous SNPs were identified and two were found at low frequencies. None of the non-synonymous SNPs were located in the LBD. The diversity of the CAR and PXR LBDs between mammalian species, as evidenced by the PAML results, contrasts with the PXR and CAR resequencing studies showing very low nucleotide diversity in the CAR coding region. These results suggest that PXR and CAR may have critical roles in humans that do not vary across ethnic groups but that differ with regard to functions and/or ligands across species.
The disproportionately large number of polymorphisms located within 374 bp of intron 2 is striking and deserves further investigation. All of these SNPs fall in positions that are conserved across at least two non-primate lineages, suggesting a high level of conservation in this region overall. Estimates of diversification times of placental mammals suggest that rodents, carnivores and artiodactyls shared a common ancestor approximately 90 million years ago, leading to a total common ancestry spanning nearly 360 million years, 35 highlighting the potential importance of sequences conserved across four deep branches of the mammalian phylogeny and over such a length of evolutionary time. Five of the six SNPs disrupt predicted liver-enriched transcription factor binding sites. The results of this analysis suggest that intron 2 may contain regulatory elements conserved across several mammalian lineages.
We observe a skewed distribution of common polymorphisms relative to fixed substitutions between human and chimpanzee at sites highly conserved across distantly related mammals. This can be interpreted in several ways. One possibility is that we observe a deficit of polymorphism in non-conserved sequence, which might indicate increased evolutionary constraints in humans relative to the other mammals. Alternatively, the deviation may be due to an excess of fixed substitutions between human and chimpanzee at non-conserved sites. Perhaps a more likely explanation is that the observed deviation is due to an excess of common polymorphisms at conserved sites, suggesting that they are maintained at intermediate frequencies across populations. This implies a functional role for these SNPs. While these interpretations are highly speculative, they suggest that it may be interesting to investigate these common variants through functional assays and association studies.
Overall, our comparative genomics-based survey of sequence variation at the CAR locus reveals a history largely characterised by strong evolutionary constraints between species and across three population samples, with most F ST values falling well below the genome-wide average. Interestingly, however, there is a suggestion in the phylogenetic and the population genetics analyses that a small portion of sites might have evolved adaptively. This, coupled with the finding of a relatively large number of polymorphic sites in a small conserved non-coding region in intron 2, provides further motivation for functional studies of CAR variation.