Outline of KinSNP
KinSNP was developed to identify runs of shared homozygosity (ROSHs) in SNP microarrays. The software was initially designed to mimic the data-processing strategy traditionally used by geneticists to manually locate homozygous regions in microsatellite-based data. One of our first microarray-based studies, aimed at mapping IP in an inbred Bedouin family using four Affymetrix 250K Nsp mapping arrays (Figure 1), revealed a 16-megabase (Mb)-long ROSH. The ROSH was further confirmed by typing a larger set of family members using microsatellites, and the causative mutation was identified through sequencing in an already known IP gene, SCN9A[22]. We use that analysis here to demonstrate KinSNP operation and output. Since only one healthy relative was genotyped in the study, individuals from the same population studied for other diseases in our group were added as healthy controls for IP. This inspired us to enable KinSNP to receive all of a laboratory's or a core facility's microarray genotypes at once, and be subsequently used iteratively to look for homozygous regions in any selected set of sick individuals with any combination of analysis parameters. Program execution is carried out through an intuitive guide, which leads the user through the analysis workflow (Figure 2, left-hand panel). Eventually, the candidate disease regions are provided both as a textual, sorted list (Figure 3), as well as a visual representation in the context of the entire genome (Figures 4, 5).
KinSNP analysis workflow
The analysis of microarray data often requires stepwise formatting and combining of multiple files in various formats. To simplify this step, KinSNP requires only two input files: (i) an array-specific annotation file, containing SNP identifications accompanied by their genetic and physical positions and (ii) a genotype file, containing SNP genotype and confidence calls from all typed individuals. Instructions on the exact format of these simple tab-delimited text files are available on the KinSNP website. KinSNP is also able to directly accept files produced by Affymetrix genotype calling programs (GTYPE, Genotyping Console), as well as Affymetrix annotation files (downloadable from http://www.affymetrix.com/support/technical/byproduct.affx?cat=software). Illumina data may be converted for use by KinSNP through a dedicated script available at the KinSNP website.
Once the input files are loaded into KinSNP, the user is allowed to associate a name, a disease state, a family and an ethnic group to each array included in the genotype file. This step of defining the 'population' accomplishes the one-time 'project' description, after which the user may perform any number of specific homozygosity analyses on any of the diseases and subsets of individuals.
The parameters for the homozygosity analysis are specified in the 'Analysis Settings' window shown in detail in Figure 2. The user should specify the studied disease, choose the specific chips (individuals) in which to look for shared homozygous regions and denote which other individuals to show in the output as controls for manual inspection. The reported ROSHs can then be sorted according to one of several possible length units, such as number of SNPs, physical length and genetic recombination length according to deCODE, Marshfield or SNP Linkage Map (SLM1) genetic maps. (Information about these maps and their extrapolation to obtain SNP positions is available in the README file that accompanies Affymetrix annotation files.) In addition, the user can specify a length threshold above which ROSHs will be reported because this may differ between homozygosity projects. The more related the individuals and the fewer there are, the longer the ROSHs that can be expected to be found. For example, brothers are expected to share longer ROSHs than cousins; a longer ROSH is expected to be shared by two patients than by three patients etc. Additionally, the more recent the mutation and the higher the consanguinity, the longer the ROSHs expected. The parameters 'Confidence cutoff ' and 'Skip AB with this many adj. homozygous' control the number of allowed errors within ROSHs and are explained in the next section. For faster computation, the user may limit the analysis to specific chromosomes only.
After pressing 'OK', the user is asked to specify output file names and locations, and analysis may be started.
Homozygosity analysis algorithm
First, KinSNP merges positional information from the annotation file with the genotype data and sorts the genotypes by SNP physical locations. Starting with the first SNP, the algorithm checks whether all sick individuals are homozygous to the same allele; if not, it proceeds to the next SNP. If the individuals share homozygosity, a new ROSH is started. The ROSH is extended as long as homozygosity is shared. Under the most stringent parameters, once a heterozygous SNP or a non-shared homozygous SNP for the other allele are encountered, the ROSH is terminated and stored. The algorithm then continues to proceed along the chromosome until it encounters another SNP with shared homozygosity. It then starts a new ROSH and so forth, until the end of the chromosome is reached.
A more permissive operation of KinSNP allows for occasional non-shared alleles within ROSHs. The degree of permissiveness may be heuristically adjusted via two parameters in the KinSNP Analysis Settings window, which take into account the call confidence and the scarcity of the non-shared allele.
The first parameter, 'Confidence cutoff ', makes use of the confidence values attributed to each of the genotype calls. To understand this parameter, let us first explain the 'NoCall' genotypes. By default, SNP calls designated as 'NoCall' in the KinSNP input file are ignored during the process of ROSH identification. Thus, a 'NoCall' at a certain SNP will be treated as AA or as BB if this will cause the SNP to be considered homozygous among all individuals selected for analysis. The designation of SNPs as 'NoCalls' is usually applied by the genotype calling program (eg Affymetrix's Genotyping Console) if the SNP confidence value did not pass a certain cut-off. KinSNP enables the user to change that cut-off by specifying another confidence threshold above which SNP calls will be effectively regarded as 'NoCalls' in ROSH definition. Unlike 'NoCalls', however, these SNPs will be shown in KinSNP Excel output files with their original genotype call, though coloured with a lighter colour than the others (see for example Figure 4).
The second parameter, 'Skip AB with this many adjacent homozygous', was introduced as we found that heterozygous SNPs -- even with high confidence calls -- sometimes 'break' genuine homozygous blocks, probably due to erroneous reading. This parameter enables one to ignore heterozygous SNP calls which are surrounded by a 'window' of a user-specified number of SNPs which all apply to the criteria of the ROSH.
From our experience, the two parameters, as well as the different colouring of low-confidence genotypes, not only improve the homozygosity analysis, but also enable inclusion in the analysis of low-quality arrays (eg <90 per cent call rate) which otherwise could not be analysed.
Viewing KinSNP results
KinSNP output may be viewed either as a set of stylised Excel files or in an interactive genome browser. The Excel files enable the user to further sort, mark and annotate the results using the standard Excel interface. The browser enables scrolling and zooming along the chromosomes, thus visually identifying recombination patterns flanking the reported ROSHs and estimating their likelihood as disease-related regions. We will demonstrate KinSNP output by describing the IP disease gene mapping.
Excel output
KinSNP creates a summary file which outlines all the found ROSHs longer than the user-specified threshold, along with their chromosomal position, number of consecutive SNPs and physical and genetic lengths (Figure 3). After sorting the ROSHs according to the desirable length unit, the IP analysis revealed a ~11-17 centimorgan (cM) (~16 Mb)-long block comprising 1,455 consecutive SNPs in chromosome 2. Clicking the active link in the block name opens the chromosome-specific Excel file (Figure 4), right at the top of the relevant block. In this file, genotypes of all SNPs of the chromosome are shown in the order of the SNPs' physical positions, accompanied by the SNP annotation retrieved from the annotation input file. Genotype calls from the selected sick individuals appear first, with reported ROSHs delineated by black vertical bars. Genotypes of control individuals are shown further to the right. Heterozygous SNPs are coloured in green and homozygous SNPs are coloured in red (AA) or blue (BB), respectively, such that the allele in lower alphabetical order is the A allele and the other allele is the B allele. Low-confidence calls, based on user-specified confidence cut-off, are coloured in light green, orange or light blue, for AB, AA or BB, respectively.
The Excel display allows the user to judge how informative the ROSHs are by comparing the degree of homozygosity of the ROSHs in the sick individuals with the degree of heterozygosity in the control individuals. SNPs or blocks which are homozygous across both sick and healthy individuals should be considered non-informative. Note that one bordering non-ROSH SNP (eg SNP_A-1821170 in Figure 3) is included at each end of the reported ROSH, since the disease-causing mutation may in principle be located between that SNP and the extreme ROSH SNP. Users may also scroll above and below the reported ROSHs to see whether several adjacent ROSHs can actually be merged, and to get an impression of the recombination patterns flanking the ROSHs. These tasks are more easily performed using the included Integrated Genome Browser (IGB), however.
The Integrated Genome Browser
In the IP analysis shown here, one ROSH was found to stand out from all the others, being much longer in both physical and genetic length, as well as having a greater number of SNPs. When there are several possible ROSHs, however, additional means are required to prioritise the ROSHs as candidate regions harbouring the disease-causing mutation. One way is to consider the degree of kinship among the ascertained individuals, and to check whether closer relatives indeed share longer ROSHs than those shared by all patients together. This is achieved by providing an informative visualisation of the ROSHs, along with genotypes of sick and control individuals, in a more compact and yet resizable view of the chromosomes. To this end, we have integrated with KinSNP a slightly simplified version of Affymetrix's IGB http://www.affymetrix.com/support/developer/tools/download_igb.affx, whose source code is covered by the common public licence. By choosing KinSNP's 'View Analysis Results' option (Figure 2, left-hand panel), the IGB opens right away, with an option to add tracks for the reported blocks and for coloured genotypes of the analysed individuals (Figure 5). University of California Santa Cruz (UCSC) gene predictions based on data from RefSeq, Genbank and UniProt are displayed by default, but additional genome annotations may be easily loaded to IGB from a variety of public and private databases, including, for example, RefSeq and Genbank mRNAs, GenScan gene predictions, microRNAs, Affymetrix probes from various array types, etc. The major advantage of the IGB for viewing KinSNP results is the possibility of navigating easily along the chromosomes and of zooming in and out of areas of interest. For example, in the IP ROSH hosting the SCN9A gene (Figure 5), it is possible to see the longer ROSH shared by the sisters IP1 (217) and IP2 (208) (on the left-hand side), overlapping with the shorter homozygous stretch shared with their half-sister IP3 (207). Once again, although the present KinSNP computed analysis does not consider the family tree in prioritising the reported blocks, the user may use knowledge of kinship between the ascertained individuals in combination with the IGB visualisation to estimate the reliability of each of the reported ROSHs.
The information which IGB uses to draw the block positions (Figure 5, 'ROSHs' track, shown as light blue rectangles) is read from a special text file created by KinSNP and stored in the same local directory as the Excel files. This file has a standard format, called a general feature format (GFF), and is recognised by its '.gff ' extension. Once the IGB is open (often with the results of the most recent analysis), it is possible to load on top of it additional .gff files created in previous analyses. For example, if patients from two families were initially analysed separately and then all together, or if closely related patients were analysed and then more distant ones were added, computed blocks from all analyses may be displayed together in the same IGB view and compared visually. This option is also relevant when KinSNP is executed iteratively with different analysis parameters, such as minimal block lengths or confidence cut-off.
Finally, it is possible to load computational results from other genome-related experiments, expressed as binary or numerical values attributed to individual genomic positions or intervals, into the IGB on top of any existing display. These may include, for example, gene expression data, copy number variation data, measures of conservation between different genomes, etc. This allows straightforward integration of homozygosity analysis results with related custom or public experimental data.