Skip to main content

Fine-scale population structure of Malays in Peninsular Malaysia and Singapore and implications for association studies


Fine scale population structure of Malays - the major population in Malaysia, has not been well studied. This may have important implications for both evolutionary and medical studies. Here, we investigated the population sub-structure of Malay involving 431 samples collected from all states from peninsular Malaysia and Singapore. We identified two major clusters of individuals corresponding to the north and south peninsular Malaysia. On an even finer scale, the genetic coordinates of the geographical Malay populations are in correlation with the latitudes (R2 = 0.3925; P = 0.029). This finding is further supported by the pairwise FST of Malay sub-populations, of which the north and south regions showed the highest differentiation (FST [North–south] = 0.0011). The collective findings therefore suggest that population sub-structure of Malays are more heterogenous than previously expected even within a small geographical region, possibly due to factors like different genetic origins, geographical isolation, could result in spurious association as demonstrated in our analysis. We suggest that cautions should be taken during the stage of study design or interpreting the association signals in disease mapping studies which are expected to be conducted in Malay population in the near future.


Malaysia, a multi-ethnic, multi-lingual, multi-cultural and multi-religious country, is located at the crossroads of Southeast Asia. It is separated by the South China Sea into two land masses namely, the Peninsular Malaysia and East Malaysia (also known as the Borneo island). Malaysia has a total population of about 30 million people, of which approximately 26 million populate the Peninsular Malaysia. Among the major populations in Peninsular Malaysia, the Malays are the largest ethnic group and make up to 63% of the total population follow by Chinese, Indians and other minority ethnic groups.

Many Malays are of Malayo-Polynesian (Austronesian) origin that are culturally and historically heterogeneous [1]. The Malays from the west coast of Peninsular Malaysia are historically linked to Sumatera across the Straits of Malacca; while those from the south are thought to have migrated from Jawa, Sulawesi and other parts of Indonesia [2]. The Malays from the north Peninsular have a closer affinity to the Malay Muslims from the Southern Thai due to geographical location. The history of Singapore has never been separated from Peninsular Malaysia since the first century until the year 1965 when Singapore became an independent republic. Therefore, it is very likely that the Malays of Singapore have a similar history of origin as those from the southern part of Peninsular Malaysia [3].

Earlier studies had indicated potential genetic sub-structure among the different groups of Malays from Peninsular Malaysia [2, 4, 5], which could be possibly attributed to the migration history of these respective sub-groups. However, fine-scale sub-structure of the Malay population remained poorly described, especially, previous studies were based on very limited sample size. Indeed, this potentially poses confounding factors to the genetic association studies, in particular genome-wide association studies (GWAS), leading to spurious association signals [6]. Being one of the major populations in the Southeast Asia, characterizing population substructure is crucial in designing, analyzing and interpreting any genetic association study in this region.

In this study, we showed that the genetic diversity and population sub-structure of the Malays from Peninsular Malaysia are correlated to the geographical latitude. Notably, we observed the main differentiations between populations corresponding to the north and south Peninsular Malaysia. In addition, simulation analyses carried out also revealed that the genetic association is greatly affected by population sub-structure, suggesting that consideration of population stratification of samples at the stage of study design and careful interpretation of the association signals are necessary when mapping complex diseases in Malay populations.


Population substructure

We first compared the genetic diversity of the Peninsular Malays from a global scale with 6 populations from HapMap3 including YRI, CHB, JPT, CEU, MEX and GIH. PC plot indicated that the Malays clustered closely to the East Asian populations as expected, and showed a rather small genetic diversity. Several Malay individuals from northern Peninsular Malaysia (PMM) showed closer affinity to the South Asia populations (GIH) (Fig. 1a). We then performed PCA for the Peninsular and Singapore Malays, and revealed a seemingly homogenous cluster (Additional file 1: Figure S1). However, some level of differentiations were observed corresponding to three geographical regions (north, center and south), despite samples from center region that was scattered around (Fig. 1b). We subsequently excluded the samples form the center regions (Pahang and Selangor), and re-ran the smartPCA. Two clusters were observed representing the north and south regions, respectively (Fig. 1c).

Fig. 1

Principle Component Analysis (PCA) (a) Global PCA including populations from HapMap3. GIH, Gujarati India Houston; CEU, Northern and Western European from CEPH collection; YRI, Yoruba Ibadan from Nigeria; CHB, Chinese Beijing; JPT, Japanese Tokyo; MEX, Mexican ancestry from Los Angeles; MAS, Metropolitan Malays from Singapore; PMM, Malays from Peninsular Malay. The Malay populations are of East Asian descendant. (b) PCA plot including samples categorized into North vs Centre vs South; (c) PCA plot which included only North vs South. Symbols in red represent the northern region; symbols in blue represent southern region. Several outliers were excluded from the PCA plot

In ADMIXTURE analysis, a significant difference was observed between the Malays from the north and south in the major component, with 57% and 65% in the north and south, respectively (P < 0.0001; Fig. 2). At K=3, the newly appeared component (denoted in green) was seen slightly higher in the central Malays than in the south Malays (6.8% vs 3%; P = 0.0415).

Fig. 2

ADMXITURE analysis of the Malay populations classified according to regions. The bottom plots represented by percentages (Y-axis) indicates the average ADMIXTURE values for each region

Correlation of genetic and geographic coordinates

Given the fact that the PC1 as well as the ADMIXTURE analysis showed significant differences between northern and southern Malay samples, we then investigated if the genetic diversity between these sub-structure of Malays in Peninsular Malaysia were attributed to geographical coordinates. Average PC1 values of southern Malay samples (corresponding to Fig. 1b) were generally less than 0 (except for Johor), whilst all geographically defined northern regions with PC1 >0 (Fig. 3). When we compared the PC1 with geographical latitude of these sample locations, a significant correlation was observed (R2 = 0.3925; P = 0.029; Fig. 4). Due to the geographical nature, Peninsular Malaysia is divided into west coast and east coast by the Titiwangsa Ranges. We therefore asked if the genetic diversity could be attributed to the geographical longitude as well. Analysis between PC1 and geographical longitude, however showed no significant correlation (R2=0.0066; P = 0.7924; Addional file 1: Figure S2). We also evaluated if the genetic diversity was related to geographical distance between two populations, but found no significant correlation of FST between populations and the geographical distances between them (R2 = 0.01918; P = 0.1385; Additional file 1: Figure S3).

Fig. 3

Average PC1 values of the Malay sub-populations from Peninsular Malaysia and Singapore. Standard error of each population is indicated. The PC1 values correlated well to the geographical locations of each population except for Johor

Fig. 4

Correlation between PC1 and latitude coordinate (P = 0.029)

Genetic differentiation between northern and southern Malays

The regional FST values indicated the highest regional diversity between the north and the south after 1,000 times bootstrapping repeats (FST = 0.001; CI = 2.07E - 04) (Table 1; Additional file 2: Table S1). To further identify the genomic regions that are highly differentiated between northern and southern Peninsular Malay, we computed the FST values of the 41,400 SNPs between northern and southern samples, and identified 428 SNPs listed in the top 1% of the FST (Additional file 2: Table S2); of which 80 (0.1%) had an FST value >0.05 (Table 2). SNP with the highest FST value was rs4149264, residing in the candidate gene ABCA1 - a major gene responsible for high-density lipolipoprotein cholesterol (HDL-c) synthesis. Another highly differentiated SNP, rs4148475, is located at the candidate gene ABCC4. This gene is a member of the superfamily of ATP-binding Cassette (ABC) transporters, which may play a role in cellular detoxification [7]. A missense variant rs1056836 appears to be one of the four highly differentiated SNPs, leading to a change of valine to leucine in candidate gene CYP1B1, which had a minor allele frequency of 0.48 and 0.19 in northern and southern peninsular Malays, respectively (FST = 0.2037). This candidate gene is responsible in drug metabolism and synthesis of cholesterols, steroids and lipids. It was found to play a role in the susceptibility of glaucoma [8, 9]. We performed an enrichment analysis with DAVID ( by including the top 1% highly differentiated variants, but identified no significant enrichment after Benjamini correction (Additional file 2: Table S3).

Table 1 Pairwise FST bootstrap values of the Malay between the 3 regions of Peninsular Malaysia
Table 2 Top 0.1 % SNPs that are highly differentiated between the Malays from northern and southern region of Peninsular (total SNP = 42633)

We observed that 1,666 SNPs were presented in different minor alleles between the north and south Malays, and their allele frequencies in Malays were compared with that in South Asian (GIH) and East Asian (CHB) (Additional file 2). Although not substantial, differences in allele frequencies were observed between the South- and East- Asians, as well as the between the Malays and both South- and East- Asians. Notably, rs1126965 located at the candidate gene SLC39A11 revealed an alternative allele frequency of 0.8178 in the northern Malays and 0.4965 in the southern Malays. This gene has been reported to play a role in liver enzyme and smoking initiation [10, 11]. Whether or not this gene is under positive selection in the Malays, however, remain further investigation. We subsequently assessed if these SNPs play a role in phenotypic association, and found that 19 of these SNPs were reported in GWAS catalogue (Additional file 4).

To evaluate the potential effect of population sub-structure on a disease association study, a series of computer simulation studies were carried out with PLINK following a case–control GWAS design (Additional file 2: Table S4). The GWAS simulations revealed that the effect on false positive rate and statistical power were greater than expected [12].


We demonstrated in this study, that the Malays from Peninsular Malaysia and Singapore are essentially sub-structured. Although genetic correlation with geographical latitude had been previously reported in the Chinese populations [12, 13], it is indeed surprising to reveal such differentiation among the Malay populations even within a small region in Peninsular Malaysia and Singapore (~800 KM from north to south). In addition to that, the FST between the north and south Malays were similar to those of the earlier report between the northern and southern Han Chinese (FST = 0.0011) [12] but lower than those within Europeans (FST = 0.0033) [14] However, we observed higher diversity within the substructures of the Malays. For instance, the FST between two northern Peninsular Malays from Kedah and Kelantan was 0.017 (Table S4), which is in line with the finding in a recent study [4]. This suggests higher heterogeneity among Malays than previously expected, possibly be due to the recent migration and gene flow from the surrounding populations in this region.

The Pahang Malays were found to have a closer affinity to the north, although they were classified as the central region in this study. This is likely due to the reason that samples were collected from the Federal Land Development Authority (FELDA) settlers in the Pahang state, of which the majority of them were originated from Kelantan. On a separate note, Selangor, being as the most advanced and most populated state of Malaysia, is where the metropolitan city Kuala Lumpur located. PCA revealed that samples from this population was scattered across both the north and south regions (Fig. 1b & c). We believe that urbanization had likely blurred the boundaries. Similar findings were observed in Xu et al. (2009), where the populations from metropolitan areas showed more complicated composition with multiple ancestral origins compared with those from the rest of the area.

Essentially, identification a panel of ancestry informative markers (AIMs) would be an ideal strategy to correct the population stratification in future genetic association studies [15]. However, the SNP coverage and the sample size in the current study are insufficient for such purpose. Those highly differentiated SNPs between the north and south Malays could be possibly due to genetic drift or, to a lesser extent, natural selection. These SNPs, however may be considered as the putative set of variants as the AIMs for the Malay populations. The candidate gene ABCA1 is a major gene that plays an important role in high-density lipoprotein cholesterol (HDL-c) synthesis and cholesterol transport [16]. However, whilst we suspect the genetic drift is likely to be the cause, the reason of this gene being highly differentiated between northern and southern Malays remains further investigated. Cautions should be taken though when positive signals of HDL-c and ABCA1 are identified in the genetic association study of Malays.

We acknowledge several limitations in this study. Sample collection from several locations were small, hence might have resulted into outliers which confounded the outcome of the correlation between genetic differentiation and geographical coordinates. In addition, self-reported ancestry might have also confounded the analysis when assigning to their respective state of origin. However, the number of samples covering all states in Peninsular Malaysia (and Singapore), and the marker utilized in our study are larger than the previous reports, thus provides further insights into the genetic structure of the Malays in Peninsular Malaysia. Notably, we revealed close relationship between genetic and geographical coordinates in the Malay populations. In addition, our results and to which extent the admixtures in Southeast Asia could impact the population stratification thus affect the genetic association studies. Therefore we call for attention to look into alternative strategies for disease mapping in genetically complex populations particularly from Southeast Asia.


In summary, we revealed that the population substructure of the Malays was correlated to the latitude coordinate. The genetic diversity of the Malays is more heterogeneous than previously expected, and that we proved that such population sub-structure occurred even though within a small geographical region may potentially lead to spurious signals in disease based genetic association studies. Therefore cautions should be taken when carrying out such study design.


Population and samples

A total of 431 Malay samples were included in this study. These samples were self-identified Malays from Peninsular Malaysia, 116 of which were genotyped with Affymetrix Genome-Wide Human SNP Array 6.0, whilst the remaining samples were genotyped with Illumina 660W (Sample size, N = 90) and Illumina Omni Express (N = 119). The additional 17 Malays samples from Kelantan genotyped with Affymetrix Genome-Wide Human SNP Array 6.0 [17], and 89 samples of metropolitan Malays from Singapore (SGVP) were also included in this analysis [3]. The studies were approved by the research and ethics committees of Universiti Teknologi MARA and Universiti Sains Malaysia, and the design of this study followed the Helsinki Declaration 1975, as revised in year 2000. The collected samples covered all 11 states of Peninsular Malaysia (Fig. 5), of which were divided into 3 geographical regions for the purpose of this study namely, North, South and the Centre regions, according to their respective latitude coordinate (Table 3). The number of samples and their geographical locations are listed in Table 3. Six selected populations involving 805 samples from the International HapMap Project 3 (HapMap 3) [18] were included in the analysis to characterize the genetic variation of the Malays on a global scale: YRI, GIH, CEU, CHB, JPT and MEX.

Fig. 5

The geographical map of Peninsular Malaysia. The sampling locations are shown in red dots

Table 3 Regional categorization of the Peninsular Malaysia states according to geographical locations and final number of sample included after QC

Data assemblage

Data QC and assemblage were carried out with PLINK 1.07. Datasets from each platform were first filtered for individuals with >10% missing rate, > 10% SNP missing rate, minor allele frequencies (MAF) < 0.05, and Hardy-Weinberg Equilibrium (HWE) P < 0.002. Then the filtered datasets were subsequently merged, consisting 42,633 SNPs shared among all the 402 Malay samples. The dataset was further pruned down by removing any SNP with r2>0.8, leaving a total SNP of 41,400 for further analyses.

Analysis of population structure

Principal Component Analysis (PCA) was first carried out using the smartPCA in EIGENSOFT (ver 4.0) package. The genetic component of the Malay populations was inferred with ADMIXTURE ver 1.22 (Alexander et al., 2009) [19], with the 41,400 SNPs overlapped across all samples.

Latitude-PC correlation

Pearson’s correlation coefficient was calculated to evaluate the relationship between the genetic coordinates (PC values) and the geographic latitudes.

Pairwise FST

Unbiased estimation of FST was calculated according to Weir and Hill (2002) [20], with confidence intervals estimated by bootstrapping with 1,000 replications.

GWAS simulation

Simulations on genome-wide association study (GWAS) were performed using PLINK 1.07, following the procedure of Xu et al. (2009) [12].



Genome wide association


Federal land development authority


Single nucleotide polymorphism


Ancestry informative markers


High density lipoprotein cholesterol


High density lipoprotein cholesterol

HapMap 3:

Hapmap project 3


Unadjusted P value


Genomic control


Bonfferoni correction


False discovery rate


  1. 1.

    Bellwood P, Gamble C, Le Blanc SA, Pluciennik M, Richards M, Terrell JE. First Farmers: the Origins of Agricultural Societies, by Peter Bellwood. Malden (MA): Blackwell, 2005; ISBN 0-631-20565-9 hardback £60; ISBN 0-631-20566-7 paperback £17.99, xix+360 pp., 59 figs., 3 tables. Cambridge Archaeological Journal 2007:87

  2. 2.

    Hatin WI, Nur-Shafawati AR, Zahri M-K, Xu S, Jin L, Tan S-G, et al. Population genetic structure of peninsular Malaysia Malay sub-ethnic groups. PLoS One. 2011;6, e18312.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  3. 3.

    Teo Y-Y, Sim X, Ong RTH, Tan AKS, Chen J, Tantoso E, et al. Singapore Genome Variation Project: a haplotype map of three Southeast Asian populations. Genome Res. 2009;19:2154–62.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  4. 4.

    Hatin W, Nur-Shafawati A, Etemad A, Jin W, Qin P, Xu S, et al. A genome wide pattern of population structure and admixture in peninsular Malaysia Malays. Hugo J. 2014;8:5.

    Article  Google Scholar 

  5. 5.

    Consortium HP-AS. Mapping human genetic diversity in Asia. Science. 2009;326:1541–5.

    Article  Google Scholar 

  6. 6.

    Helgason A, Yngvadóttir B, Hrafnkelsson B, Gulcher J, Stefánsson K. An Icelandic example of the impact of population structure on association studies. Nat Genet. 2005;37:90–5.

    CAS  PubMed  Google Scholar 

  7. 7.

    Nandi S, Ma L, Denis M, Karwatsky J, Li Z, Jiang X-C, et al. ABCA1-mediated cholesterol efflux generates microparticles in addition to HDL through processes governed by membrane rigidity. J Lipid Res. 2009;50:456–66.

    CAS  PubMed  Article  Google Scholar 

  8. 8.

    Micheal S, Ms HA, Frcs SNZ, Bs BB, Fcps MA, Fcps FA, et al. Original Article Identification of novel CYP1B1 gene mutations in patients with primary congenital and primary open-angle glaucoma. Clin Experiment Ophthalmol. 2015;43(1):31–9.

    PubMed  Article  Google Scholar 

  9. 9.

    Bouyacoub Y, Ben Yahia S, Abroug N, Kahloun R, Kefi R, Khairallah M, et al. CYP1B1 gene mutations causing primary congenital glaucoma in Tunisia. Ann Hum Genet. 2014;78:255–63.

    CAS  PubMed  Article  Google Scholar 

  10. 10.

    Park T-J, Hwang J-Y, Go MJ, Lee H-J, Jang HB, Choi Y, et al. Genome-wide association study of liver enzymes in korean children. Genomics Inform. 2013;11:149–54.

    PubMed Central  PubMed  Article  Google Scholar 

  11. 11.

    Argos M, Tong L, Pierce BL, Rakibuz-zaman M, Islam T, Rahman M, et al. Genome-wide association study of smoking behaviours among Bangladeshi adults. J Med Genet. 2014;51:327–33.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  12. 12.

    Xu S, Yin X, Li S, Jin W, Lou H, Yang L, et al. Genomic dissection of population substructure of Han Chinese and its implication in association studies. Am J Hum Genet. 2009;85:762–74.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  13. 13.

    Suo C, Xu H, Khor C-C, Ong RT, Sim X, Chen J, et al. Natural positive selection and north–south genetic diversity in East Asia. Eur J Hum Genet. 2012;20:102–10.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  14. 14.

    Lao O, Lu TT, Nothnagel M, Junge O, Freitag-Wolf S, Caliebe A, et al. Correlation between Genetic and Geographic Structure in Europe. Curr Biol. 2008;18:1241–8.

    CAS  PubMed  Article  Google Scholar 

  15. 15.

    Qin P, Li Z, Jin W, Lu D, Lou H, Shen J, et al. A panel of ancestry informative markers to estimate and correct potential effects of population stratification in Han Chinese. Eur J Hum Genet. 2014;22(2013):248–53.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  16. 16.

    Wu Y, Waite LL, Jackson AU, Sheu WH-H, Buyske S, Absher D, et al. Trans-ethnic fine-mapping of lipid loci identifies population-specific signals and allelic heterogeneity that increases the trait variance explained. PLoS Genet. 2013;9:e1003379.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  17. 17.

    Deng L, Hoh BP, Lu D, Fu R, Phipps ME, Li S, et al. The population genomic landscape of human genetic structure, admixture history and local adaptation in Peninsular Malaysia. Hum Genet. 2014;133:1169–85.

    PubMed  Article  Google Scholar 

  18. 18.

    Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, Peltonen L, et al. Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467(7311):52–8.

    CAS  PubMed  Article  Google Scholar 

  19. 19.

    Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19:1655–64.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  20. 20.

    Weir BS, Hill WG. Estimating F-statistics. Annu Rev Genet. 2002;36:721–50.

    CAS  PubMed  Article  Google Scholar 

Download references


The authors would like to acknowledge the medical staffs from Hospital Kota Bharu, Hospital Sungai Buloh, and Hospital Selayang, as well as the staffs and scientists from the PURE-REDISCOVER Study whom assisted in the sample collection. The authors also thank all subjects who had participated this study. B.P. Hoh is supported by the Chinese Academy of Sciences Fellowship for Visiting Scientists from the Developing Countries (2914FFSA0009). The computational analyses of this study were supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (CAS) (XDB13040100), and by the National Science Foundation of China (NSFC) grants (91331204 and 31171218). The experimental funding of this study was supported by the Long-term Research Grant Scheme (LRGS) for Infectious Disease 2011 [600-RMI/LRGS 5/3(3/2011)] funded by Ministry of Education Malaysia, and the Long-term Research Grant Scheme (LRGS) for the REsponDing to IncreaSing CardiOVascular disease pRevalence (REDISCOVER) [600-RMI/LRGS 5/3(2/2011)] funded by Ministry of Education Malaysia. We also would like to acknowledge the contributions the staffs of Human Genome Center, Universiti Sains Malaysia (USM). This study was also supported by USM Apex grant: 1002 / PPSP / 910343 and USM short term grant: 304 / PPSP / 61311034. S.X. is Max-Planck Independent Research Group Leader and member of CAS Youth Innovation Promotion Association. S.X. also gratefully acknowledges the support of the National Program for Top-notch Young Innovative Talents of the “Wanren Jihua” Project.

Author information



Corresponding authors

Correspondence to Boon-Peng Hoh or Shuhua Xu.

Additional information

Competing interests

The authors declare that they have no competing interests.

Author’s contribution

BPH and SX conceived and designed the study; BPH and LD performed the data analysis; MJJA, ZZ, MNH, NSAR, WIH, IE, ABZ, YK involved in sample collection, BPH, SX and YK funded the experimental and computational costs. BPH, SX and LD prepared and wrote the manuscript. All authors have read and approved the manuscript.

Additional files

Additional file 1: Figure S1.

PCA plot including samples from the 11 states of Peninsular Malaysia. Figure S2. Correlation between PC1 and longitude. Figure S3. Correlation between geographical distance and FST.

Additional file 2: Table S1.

Regional pair-wise FST of the populations collected from different states in Malaysia. Table S2. Top 1% SNPs that are highly differentiated between the Malays from northern and southern region of Peninsular. Table S3. Gene ontology and enrichment analysis of the candidate genes underlying the top 0.1% highly differentiated SNPs between the Malays from the north and south Peninsular Malaysia. Table S4. Simulation analysis for GWAS before and after removing 1,666 SNPs with different minor allele present between the north and south Malays.

Additional file 3:

SNPs with different minor allele between northern and southern Malays.

Additional file 4:

GWAS annotation.

Rights and permissions

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hoh, BP., Deng, L., Julia-Ashazila, M.J. et al. Fine-scale population structure of Malays in Peninsular Malaysia and Singapore and implications for association studies. Hum Genomics 9, 16 (2015).

Download citation


  • Malay
  • Population sub-structure
  • FST
  • Latitude-PC correlation
  • GWAS simulation