Open Access

KinSNP software for homozygosity mapping of disease genes using SNP microarrays

  • El-Ad David Amir1,
  • Ofer Bartal1,
  • Efrat Morad2,
  • Tal Nagar4,
  • Jony Sheynin3,
  • Ruti Parvari3, 4 and
  • Vered Chalifa-Caspi4Email author
Human Genomics20104:394

https://doi.org/10.1186/1479-7364-4-6-394

Received: 1 August 2010

Accepted: 28 August 2010

Published: 28 August 2010

Abstract

Consanguineous families affected with a recessive genetic disease caused by homozygotisation of a mutation offer a unique advantage for positional cloning of rare diseases. Homozygosity mapping of patient genotypes is a powerful technique for the identification of the genomic locus harbouring the causing mutation. This strategy relies on the observation that in these patients a large region spanning the disease locus is also homozygous with high probability. The high marker density in single nucleotide polymorphism (SNP) arrays is extremely advantageous for homozygosity mapping.

We present KinSNP, a user-friendly software tool for homozygosity mapping using SNP arrays. The software searches for stretches of SNPs which are homozygous to the same allele in all ascertained sick individuals. User-specified parameters control the number of allowed genotyping 'errors' within homozygous blocks. Candidate disease regions are then reported in a detailed, coloured Excel file, along with genotypes of family members and healthy controls. An interactive genome browser has been included which shows homozygous blocks, individual genotypes, genes and further annotations along the chromosomes, with zooming and scrolling capabilities. The software has been used to identify the location of a mutated gene causing insensitivity to pain in a large Bedouin family. KinSNP is freely available from http://bioinfo.bgu.ac.il/bsu/software/kinSNP.

Keywords

autozygosity consanguinity genetic mapping homozygosity SNP microarrays software

Introduction

The availability of high-density mapping microarrays, bearing sufficient probes to analyse 10,000-1,000,000 single nucleotide polymorphisms (SNPs) in one assay, offers an efficient alternative to traditional, microsatellite-based, genome-wide linkage scans [1]. Accordingly, several public linkage software programs have been developed, such as Allegro, EasyLinkage and dChip, which can handle the high number of SNPs in these arrays. All of these have been successfully used for mapping Mendelian disorders [24].

Genotyping individuals from large, multiply-consanguineous families of isolated populations offers a unique advantage for positional cloning of rare diseases [5, 6]. A mutation occurring in a founder may be rapidly inherited by numerous individuals in the population, and the offspring of consanguineous parents will have a high probability of inheriting two copies of the mutated chromosomal segment and thus expressing the disease. Classical linkage analysis of SNP arrays in these studies is highly problematic, however, due to the computational load needed to deal with both the large number of SNPs and the family structure. The presence of complex inbreeding loops makes it extremely difficult to reconstruct genetic data from untyped individuals. Current linkage programs designed to deal with this challenge (eg SuperLink [7] and SimWalk2 [8]) are not suited for the large number of markers present in SNP arrays. The latter programs may be used if adjacent SNPs are grouped and regarded as one marker (eg Naveed et al. [9]); however, this causes great loss of the information available in the high-density array.

A powerful alternative to linkage analysis, which has long been used in studies of autosomal recessive diseases in consanguineous pedigrees, is homozygosity (or autozygosity) mapping [10, 11]. This strategy relies on the fact that in a child of a consanguineous marriage affected with a homozygous recessive genetic disease mutation, a large region spanning the disease locus is probably also homozygous. Thus, searching for regions that are consistently homozygous by descent in all affected individuals would be expected to provide a suitable strategy for mapping the recessive gene mutation. Moreover, the use of SNP arrays is extremely advantageous for homozygosity mapping, compared with microsatellites, because of the much higher density of markers they comprise. For economic reasons, a common strategy is to perform homozygosity mapping using SNP arrays for a small number of sick individuals, with or without their healthy parents or siblings. Once candidate regions are found, they are then further analysed using microsatellite markers in a large number of family members, where all of the affected and none of the unaffected members should be homozygous to the same allele [1219]. In the future, SNP typing or targeted sequencing could be used for fine-mapping. Subsequently, linkage logarithm of odds (LOD) scores are calculated using genetic programs such as SuperLink or SimWalk2.

The ever-increasing number of markers in the SNP arrays makes it unfeasible to locate homozygous regions by manual inspection of the genotype file, as used to be done with microsatellite data. At the time of writing, only two public software programs, ExcludeAR [20] and AutoSNPa [21], were available for automated homozygosity analysis of SNP arrays.

We present KinSNP, a new software tool for homozygosity mapping in SNP arrays from consanguineous families, which provides additional capabilities compared with existing software while simplifying program usage. KinSNP can be run on any computer through intuitive menus and dialogues. It directly accepts Affymetrix files as input, whereas files from Illumina or other platforms may be easily converted to the KinSNP format. The user has considerable control over the parameters that define the reported homozygous regions, allowing, for example, various degrees of stringency for tolerating genotyping errors. KinSNP results are provided both as coloured, linked Excel files and as an interactive genome browser, on which results from multiple analyses may be visualised simultaneously.

In the present study, KinSNP software was used to identify the location of the mutated gene causing insensitivity to pain (IP) in a large Bedouin family.

Results

Outline of KinSNP

KinSNP was developed to identify runs of shared homozygosity (ROSHs) in SNP microarrays. The software was initially designed to mimic the data-processing strategy traditionally used by geneticists to manually locate homozygous regions in microsatellite-based data. One of our first microarray-based studies, aimed at mapping IP in an inbred Bedouin family using four Affymetrix 250K Nsp mapping arrays (Figure 1), revealed a 16-megabase (Mb)-long ROSH. The ROSH was further confirmed by typing a larger set of family members using microsatellites, and the causative mutation was identified through sequencing in an already known IP gene, SCN9A[22]. We use that analysis here to demonstrate KinSNP operation and output. Since only one healthy relative was genotyped in the study, individuals from the same population studied for other diseases in our group were added as healthy controls for IP. This inspired us to enable KinSNP to receive all of a laboratory's or a core facility's microarray genotypes at once, and be subsequently used iteratively to look for homozygous regions in any selected set of sick individuals with any combination of analysis parameters. Program execution is carried out through an intuitive guide, which leads the user through the analysis workflow (Figure 2, left-hand panel). Eventually, the candidate disease regions are provided both as a textual, sorted list (Figure 3), as well as a visual representation in the context of the entire genome (Figures 4, 5).
Figure 1

Insensitivity to pain (IP) family tree. The numbered individuals were genotyped using Affymetrix SNP microarrays.

Figure 2

KinSNP user interface. Further details are provided in the text.

Figure 3

Excel output: Summary of the ROSHs fitting the user definition. ROSH ('block') positions are indicated in number of bases from pter. ROSH ('block') names are hyperlinks to block start positions in the chromosome files (Figure 4). Block lengths are indicated as number of SNPs, physical length (number of bases), and deCODE, Marshield and SLM1 genetic distances (cM).

Figure 4

Excel output: Chromosome view of genotypes and ROSHs. Shown is the start of the 16-Mb-long ROSH found in chromosome 2. Genotype calls having lower confidence (higher values) than the user-specified cut-off are displayed with a lighter colour.

Figure 5

IGB view of KinSNP results. The IP causative gene, SCN9A, is marked in red. Note that the sisters indicated as IP1 and IP2 share a longer ROSH than their half-sister IP3, whereas their mother, H1, is heterozygous along the region.

KinSNP analysis workflow

The analysis of microarray data often requires stepwise formatting and combining of multiple files in various formats. To simplify this step, KinSNP requires only two input files: (i) an array-specific annotation file, containing SNP identifications accompanied by their genetic and physical positions and (ii) a genotype file, containing SNP genotype and confidence calls from all typed individuals. Instructions on the exact format of these simple tab-delimited text files are available on the KinSNP website. KinSNP is also able to directly accept files produced by Affymetrix genotype calling programs (GTYPE, Genotyping Console), as well as Affymetrix annotation files (downloadable from http://www.affymetrix.com/support/technical/byproduct.affx?cat=software). Illumina data may be converted for use by KinSNP through a dedicated script available at the KinSNP website.

Once the input files are loaded into KinSNP, the user is allowed to associate a name, a disease state, a family and an ethnic group to each array included in the genotype file. This step of defining the 'population' accomplishes the one-time 'project' description, after which the user may perform any number of specific homozygosity analyses on any of the diseases and subsets of individuals.

The parameters for the homozygosity analysis are specified in the 'Analysis Settings' window shown in detail in Figure 2. The user should specify the studied disease, choose the specific chips (individuals) in which to look for shared homozygous regions and denote which other individuals to show in the output as controls for manual inspection. The reported ROSHs can then be sorted according to one of several possible length units, such as number of SNPs, physical length and genetic recombination length according to deCODE, Marshfield or SNP Linkage Map (SLM1) genetic maps. (Information about these maps and their extrapolation to obtain SNP positions is available in the README file that accompanies Affymetrix annotation files.) In addition, the user can specify a length threshold above which ROSHs will be reported because this may differ between homozygosity projects. The more related the individuals and the fewer there are, the longer the ROSHs that can be expected to be found. For example, brothers are expected to share longer ROSHs than cousins; a longer ROSH is expected to be shared by two patients than by three patients etc. Additionally, the more recent the mutation and the higher the consanguinity, the longer the ROSHs expected. The parameters 'Confidence cutoff ' and 'Skip AB with this many adj. homozygous' control the number of allowed errors within ROSHs and are explained in the next section. For faster computation, the user may limit the analysis to specific chromosomes only.

After pressing 'OK', the user is asked to specify output file names and locations, and analysis may be started.

Homozygosity analysis algorithm

First, KinSNP merges positional information from the annotation file with the genotype data and sorts the genotypes by SNP physical locations. Starting with the first SNP, the algorithm checks whether all sick individuals are homozygous to the same allele; if not, it proceeds to the next SNP. If the individuals share homozygosity, a new ROSH is started. The ROSH is extended as long as homozygosity is shared. Under the most stringent parameters, once a heterozygous SNP or a non-shared homozygous SNP for the other allele are encountered, the ROSH is terminated and stored. The algorithm then continues to proceed along the chromosome until it encounters another SNP with shared homozygosity. It then starts a new ROSH and so forth, until the end of the chromosome is reached.

A more permissive operation of KinSNP allows for occasional non-shared alleles within ROSHs. The degree of permissiveness may be heuristically adjusted via two parameters in the KinSNP Analysis Settings window, which take into account the call confidence and the scarcity of the non-shared allele.

The first parameter, 'Confidence cutoff ', makes use of the confidence values attributed to each of the genotype calls. To understand this parameter, let us first explain the 'NoCall' genotypes. By default, SNP calls designated as 'NoCall' in the KinSNP input file are ignored during the process of ROSH identification. Thus, a 'NoCall' at a certain SNP will be treated as AA or as BB if this will cause the SNP to be considered homozygous among all individuals selected for analysis. The designation of SNPs as 'NoCalls' is usually applied by the genotype calling program (eg Affymetrix's Genotyping Console) if the SNP confidence value did not pass a certain cut-off. KinSNP enables the user to change that cut-off by specifying another confidence threshold above which SNP calls will be effectively regarded as 'NoCalls' in ROSH definition. Unlike 'NoCalls', however, these SNPs will be shown in KinSNP Excel output files with their original genotype call, though coloured with a lighter colour than the others (see for example Figure 4).

The second parameter, 'Skip AB with this many adjacent homozygous', was introduced as we found that heterozygous SNPs -- even with high confidence calls -- sometimes 'break' genuine homozygous blocks, probably due to erroneous reading. This parameter enables one to ignore heterozygous SNP calls which are surrounded by a 'window' of a user-specified number of SNPs which all apply to the criteria of the ROSH.

From our experience, the two parameters, as well as the different colouring of low-confidence genotypes, not only improve the homozygosity analysis, but also enable inclusion in the analysis of low-quality arrays (eg <90 per cent call rate) which otherwise could not be analysed.

Viewing KinSNP results

KinSNP output may be viewed either as a set of stylised Excel files or in an interactive genome browser. The Excel files enable the user to further sort, mark and annotate the results using the standard Excel interface. The browser enables scrolling and zooming along the chromosomes, thus visually identifying recombination patterns flanking the reported ROSHs and estimating their likelihood as disease-related regions. We will demonstrate KinSNP output by describing the IP disease gene mapping.

Excel output

KinSNP creates a summary file which outlines all the found ROSHs longer than the user-specified threshold, along with their chromosomal position, number of consecutive SNPs and physical and genetic lengths (Figure 3). After sorting the ROSHs according to the desirable length unit, the IP analysis revealed a ~11-17 centimorgan (cM) (~16 Mb)-long block comprising 1,455 consecutive SNPs in chromosome 2. Clicking the active link in the block name opens the chromosome-specific Excel file (Figure 4), right at the top of the relevant block. In this file, genotypes of all SNPs of the chromosome are shown in the order of the SNPs' physical positions, accompanied by the SNP annotation retrieved from the annotation input file. Genotype calls from the selected sick individuals appear first, with reported ROSHs delineated by black vertical bars. Genotypes of control individuals are shown further to the right. Heterozygous SNPs are coloured in green and homozygous SNPs are coloured in red (AA) or blue (BB), respectively, such that the allele in lower alphabetical order is the A allele and the other allele is the B allele. Low-confidence calls, based on user-specified confidence cut-off, are coloured in light green, orange or light blue, for AB, AA or BB, respectively.

The Excel display allows the user to judge how informative the ROSHs are by comparing the degree of homozygosity of the ROSHs in the sick individuals with the degree of heterozygosity in the control individuals. SNPs or blocks which are homozygous across both sick and healthy individuals should be considered non-informative. Note that one bordering non-ROSH SNP (eg SNP_A-1821170 in Figure 3) is included at each end of the reported ROSH, since the disease-causing mutation may in principle be located between that SNP and the extreme ROSH SNP. Users may also scroll above and below the reported ROSHs to see whether several adjacent ROSHs can actually be merged, and to get an impression of the recombination patterns flanking the ROSHs. These tasks are more easily performed using the included Integrated Genome Browser (IGB), however.

The Integrated Genome Browser

In the IP analysis shown here, one ROSH was found to stand out from all the others, being much longer in both physical and genetic length, as well as having a greater number of SNPs. When there are several possible ROSHs, however, additional means are required to prioritise the ROSHs as candidate regions harbouring the disease-causing mutation. One way is to consider the degree of kinship among the ascertained individuals, and to check whether closer relatives indeed share longer ROSHs than those shared by all patients together. This is achieved by providing an informative visualisation of the ROSHs, along with genotypes of sick and control individuals, in a more compact and yet resizable view of the chromosomes. To this end, we have integrated with KinSNP a slightly simplified version of Affymetrix's IGB http://www.affymetrix.com/support/developer/tools/download_igb.affx, whose source code is covered by the common public licence. By choosing KinSNP's 'View Analysis Results' option (Figure 2, left-hand panel), the IGB opens right away, with an option to add tracks for the reported blocks and for coloured genotypes of the analysed individuals (Figure 5). University of California Santa Cruz (UCSC) gene predictions based on data from RefSeq, Genbank and UniProt are displayed by default, but additional genome annotations may be easily loaded to IGB from a variety of public and private databases, including, for example, RefSeq and Genbank mRNAs, GenScan gene predictions, microRNAs, Affymetrix probes from various array types, etc. The major advantage of the IGB for viewing KinSNP results is the possibility of navigating easily along the chromosomes and of zooming in and out of areas of interest. For example, in the IP ROSH hosting the SCN9A gene (Figure 5), it is possible to see the longer ROSH shared by the sisters IP1 (217) and IP2 (208) (on the left-hand side), overlapping with the shorter homozygous stretch shared with their half-sister IP3 (207). Once again, although the present KinSNP computed analysis does not consider the family tree in prioritising the reported blocks, the user may use knowledge of kinship between the ascertained individuals in combination with the IGB visualisation to estimate the reliability of each of the reported ROSHs.

The information which IGB uses to draw the block positions (Figure 5, 'ROSHs' track, shown as light blue rectangles) is read from a special text file created by KinSNP and stored in the same local directory as the Excel files. This file has a standard format, called a general feature format (GFF), and is recognised by its '.gff ' extension. Once the IGB is open (often with the results of the most recent analysis), it is possible to load on top of it additional .gff files created in previous analyses. For example, if patients from two families were initially analysed separately and then all together, or if closely related patients were analysed and then more distant ones were added, computed blocks from all analyses may be displayed together in the same IGB view and compared visually. This option is also relevant when KinSNP is executed iteratively with different analysis parameters, such as minimal block lengths or confidence cut-off.

Finally, it is possible to load computational results from other genome-related experiments, expressed as binary or numerical values attributed to individual genomic positions or intervals, into the IGB on top of any existing display. These may include, for example, gene expression data, copy number variation data, measures of conservation between different genomes, etc. This allows straightforward integration of homozygosity analysis results with related custom or public experimental data.

Discussion

KinSNP software greatly facilitates homozygosity analysis of high-throughput SNP genotypes from consanguineous families, and may serve as an efficient starting point for further microsatellite-based fine mapping of candidate regions using classical linkage analysis. Moreover, the possibility of identifying homozygous blocks could be useful for the identification of a de novo mutation occurring in a recent generation in the general population. In such a case, homozygotisation of the mutation may occur in a patient owing to a founder effect, even if kinship is unknown to the family.

Unlike the two existing equivalent software tools, ExcludeAR and AutoSNPa, reported ROSHs are defined by customised criteria, which consider genotype call confidence and allow for sporadic heterozygotes within specified-length windows of shared homozygosity. Additionally, block sizes are specified by physical length, distances according to several genetic maps and number of SNPs. Data import is easy and, for Affymetrix arrays, does not require any prior manipulation of the genotype or annotation files. KinSNP output presents the ROSHs within the context of entire chromosomes, using Excel files as well as an interactive genome browser. In future versions of KinSNP, we plan to provide additional visualisations -- such as continuous plots along the chromosomes designating the degree of shared homozygosity in each position, and its comparison with that of the healthy controls.

Declarations

Acknowledgements

This research was supported by a research grant from the National Institute for Biotechnology in the Negev.

Authors’ Affiliations

(1)
Department of Computer Science, Ben-Gurion University of the Negev
(2)
Department of Life Sciences, Ben-Gurion University of the Negev
(3)
Department of Developmental Genetics and Virology, Ben-Gurion University of the Negev
(4)
National Institute of Biotechnology in the Negev, Ben-Gurion University of the Negev

References

  1. Gibbs JR, Singleton A: Application of genome-wide single nucleotide polymorphism typing: Simple association and beyond. PLoS Genet. 2006, 2: e150-10.1371/journal.pgen.0020150.PubMed CentralView ArticlePubMedGoogle Scholar
  2. Gudbjartsson DF, Thorvaldsson T, Kong A, Gunnarsson G, et al: Allegro version 2. Nat Genet. 2005, 37: 1015-1016. 10.1038/ng1005-1015.View ArticlePubMedGoogle Scholar
  3. Hoffmann K, Lindner TH: easyLINKAGE-Plus--Automated linkage analyses using large-scale SNP data. Bioinformatics. 2005, 21: 3565-3567. 10.1093/bioinformatics/bti571.View ArticlePubMedGoogle Scholar
  4. Leykin I, Hao K, Cheng J, Meyer N, et al: Comparative linkage analysis and visualization of high-density oligonucleotide SNP array data. BMC Genet. 2005, 6: 7-PubMed CentralView ArticlePubMedGoogle Scholar
  5. Sheffield VC, Stone EM, Carmi R: Use of isolated inbred human populations for identification of disease genes. Trends Genet. 1998, 14: 391-396. 10.1016/S0168-9525(98)01556-X.View ArticlePubMedGoogle Scholar
  6. Jorde LB, Watkins WS, Kere J, Nyman D, et al: Gene mapping in isolated populations: New roles for old friends?. Hum Hered. 2000, 50: 57-65. 10.1159/000022891.View ArticlePubMedGoogle Scholar
  7. Fishelson M, Geiger D: Exact genetic linkage computations for general pedigrees. Bioinformatics. 2002, 18 (Suppl 1): S189-S198. 10.1093/bioinformatics/18.suppl_1.S189.View ArticlePubMedGoogle Scholar
  8. Sobel E, Lange K: Descent graphs in pedigree analysis: Applications to haplotyping, location scores, and marker-sharing statistics. Am J Hum Genet. 1996, 58: 1323-1337.PubMed CentralPubMedGoogle Scholar
  9. Naveed M, Nath SK, Gaines M, Al-Ali MT, et al: Genomewide linkage scan for split-hand/foot malformation with long-bone deficiency in a large Arab family identifies two novel susceptibility loci on chromosomes 1q42.2-q43 and 6q14.1. Am J Hum Genet. 2007, 80: 105-111. 10.1086/510724.PubMed CentralView ArticlePubMedGoogle Scholar
  10. Lander ES, Botstein D: Homozygosity mapping: A way to map human recessive traits with the DNA of inbred children. Science. 1987, 236: 1567-1570. 10.1126/science.2884728.View ArticlePubMedGoogle Scholar
  11. Mueller RF, Bishop DT: Autozygosity mapping, complex consanguinity, and autosomal recessive disorders. J Med Genet. 1993, 30: 798-979. 10.1136/jmg.30.9.798.PubMed CentralView ArticlePubMedGoogle Scholar
  12. Broman KW, Weber JL: Long homozygous chromosomal segments in reference families from the Centre d'Etude du Polymorphisme Humain. Am J Hum Genet. 1999, 65: 1493-1500. 10.1086/302661.PubMed CentralView ArticlePubMedGoogle Scholar
  13. Chandler KE, Del Rio A, Rakshi K, Springell K, et al: Leucodysplasia, microcephaly, cerebral malformation (LMC): A novel recessive disorder linked to 2p16. Brain. 2006, 129 (Pt 1): 272-277.PubMedGoogle Scholar
  14. Gutierrez-Roelens I, Sluysmans T, Jorissen M, Amyere M, et al: Localization of candidate regions for a novel gene for Kartagener syndrome. Eur J Hum Genet. 2006, 14: 809-815. 10.1038/sj.ejhg.5201631.View ArticlePubMedGoogle Scholar
  15. Chiang AP, Beck JS, Yen HJ, Tayeh MK: Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubi-quitin ligase, as a Bardet-Biedl syndrome gene (BBS11). Proc Natl Acad Sci USA. 2006, 103: 6287-6292. 10.1073/pnas.0600158103.PubMed CentralView ArticlePubMedGoogle Scholar
  16. Melin M, Entesarian M, Carlsson G, Garwicz D, et al: Assignment of the gene locus for severe congenital neutropenia to chromosome 1q22 in the original Kostmann family from Northern Sweden. Biochem Biophys Res Commun. 2007, 353: 571-575. 10.1016/j.bbrc.2006.12.086.PubMed CentralView ArticlePubMedGoogle Scholar
  17. Morgan NV, Pasha S, Johnson CA, Ainsworth JR, et al: A germline mutation in BLOC1S3/reduced pigmentation causes a novel variant of Hermansky-Pudlak syndrome (HPS8). Am J Hum Genet. 2006, 78: 160-166. 10.1086/499338.PubMed CentralView ArticlePubMedGoogle Scholar
  18. Smith UM, Consugar M, Tee LJ, McKee BM, et al: The transmembrane protein meckelin (MKS3) is mutated in Meckel-Gruber syndrome and the wpk rat. Nat Genet. 2006, 38: 191-196. 10.1038/ng1713.View ArticlePubMedGoogle Scholar
  19. Stoetzel C, Muller J, Laurier V, Davis EE, et al: Identification of a novel BBS gene (BBS12) highlights the major role of a vertebrate-specific branch of chaperonin-related proteins in Bardet-Biedl syndrome. Am J Hum Genet. 2007, 80: 1-11. 10.1086/510256.PubMed CentralView ArticlePubMedGoogle Scholar
  20. Woods CG, Valente EM, Bond J, Roberts E: A new method for autozygosity mapping using single nucleotide polymorphisms (SNPs) and EXCLUDEAR. J Med Genet. 2004, 41: e101-10.1136/jmg.2003.016873.PubMed CentralView ArticlePubMedGoogle Scholar
  21. Carr IM, Flintoff KJ, Taylor GR, Markham AF, et al: Interactive visual analysis of SNP data for rapid autozygosity mapping in consanguineous families. Hum Mutat. 2006, 27: 1041-1046. 10.1002/humu.20383.View ArticlePubMedGoogle Scholar
  22. Cox JJ, Reimann F, Nicholas AK, Thornton G, et al: An SCN9A channelopathy causes congenital inability to experience pain. Nature. 2006, 444: 894-898. 10.1038/nature05413.View ArticlePubMedGoogle Scholar

Copyright

© Henry Stewart Publications 2010

Advertisement