Integrated analysis of genetic data with R

Genetic data are now widely available. There is, however, an apparent lack of concerted effort to produce software systems for statistical analysis of genetic data compared with other fields of statistics. It is often a tremendous task for end-users to tailor them for particular data, especially when genetic data are analysed in conjunction with a large number of covariates. Here, R http://www.r-project.org, a free, flexible and platform-independent environment for statistical modelling and graphics is explored as an integrated system for genetic data analysis. An overview of some packages currently available for analysis of genetic data is given. This is followed by examples of package development and practical applications. With clear advantages in data management, graphics, statistical analysis, programming, internet capability and use of available codes, it is a feasible, although challenging, task to develop it into an integrated platform for genetic analysis; this will require the joint efforts of many researchers.


Introduction
With the success of genome projectsi nh umana nd other species,v ast quantities of genetic data are nowa vailable and increasingly used. These includet he HapMap (http://www. hapmap.org) and the BioBank (eg UK BioBank, http://www. ukbiobank.ac.uk) projects; othersa re envisaged. 1 They generate large datasets, which are used for localisation of disease-predisposing genes, for drug discovery and for better understandingo fh umanp opulation historya nd interaction with the environment.
Meanwhile,t hese data and their increasing use pose immense challenges for statisticians and have provoked a bewildering arrayo fn ew algorithmsa nd relevant software (for example,i np hasing algorithms 2,3 ). There is an apparent lack of coordination of such endeavours, however, compared with other fields of statistics, where appropriate tools are well established. For humang enetics,t he focus of research has been on the genetic dissection of complex traits such as schizophrenia, diabetes and cardiovascular diseases. The research paradigms and tools largely fall into several categories; namely,s egregation analysis, linkage analysis (including allele-sharing methods), association studies and experimental crosses mappingp olygenic traits, mapping of quantitativetrait loci (QTLs). 4 This has been further shifted to genetic association studies exploring genomic structurea nd incorporating more information regarding human population history, microarraya nalysisu sing gene expression data and proteomics, among others. The vast genomic data narrow the gap betweent he original definitions of genetic mapping and sequence analysis in the humang enome project, followed by as imilar trend for analyticalt ools. These analyticalt ools nowa ppear awkward and requireu pdating.
There are hundreds of programs and utilities for linkage and association analysis. Some of them ared escribed here.S ince most of thema re listed in the Rockefeller University (http:// linkage.rockefeller.edu) and UK Human Genome Mapping Project ResourceC entre (http://www.hgmp.mrc.ac.uk). The full names and references for these software programs are not givenh ere.T his paper is placed in the context of previous reviews on linkage analysis 5 and haplotype phase inference. 2 It is notable that Salem et al's 3 surveyc ontained at otal of 43 software programs for phasing and association analysiso f unrelated individuals. It would have been much more if topics such as data on experimental designu sing animals, phylogenetic analysisa nd microarrayd ata analysis had been included.
Computer programming for linkage analysis began with the first Fortran program, LIPED,d eveloped by Ott. 6,7 In the 1980s, the celebrated book Methods in Genetic Epidemiology 8 described av ariety of computer programs, including PATHMIX for path analysiso fn uclear family data, POINTER for complex segregation analysisa nd LIPED. The Pascal program LINKAGE waswritten in the early 1980s and included an umber of subprograms, ILINK, MLINK and LINKMAP,w hich in turnh ad their counterparta dapted for three-generation Centre d'Etude du Polymorphisme Humain families. These programs are still widely used, but requirei ntense training . Based on theseo riginal programs, a  number of other programs have been written -f or example,  SLINK, FASTSLINK, FASTLINK, MFLINK, FASTMAP,  ERPA, ESPAa nd av arietyofl inkage utility programs. Added  to the analysts' learning seta re packages such as MENDEL,  SIMLINK,S IMWA LK, VITESSE, PAP, SAGE, SOLAR, SUPERLINK, SPLINK and ASPEX. Unfortunately,theseare not exclusive; for example,anumber of programs based on the Lander-Green-Kruglyak algorithm have been developed; for example,M APMAKER, GENEHUNTER, GENEHUNTER-PLUS, GENEHUNTER-IMPRINTING, GENEHUNTER-TWOLOCUS,G ENEHUNTER-SAD, ALLEGRO, MERLIN. There are also programs based on Bayesian methods such as MORGAN.F urthermore,e fforts have been devoted to developing tools to facilitate analysis; for example,G LUE, QUICKLINK and easyLINKAGE. Popular programs for phasing and association analysis included ARLEQUIN,P HASE, EHPLUS, SNPHAP,P LEM and CHAPLAN for unrelated individuals and QTDT,F BAT, TRANSMIT and UNPHASED for family-based association tests. There area lso Bayesian counterparts such as HAPLOTYPERa nd BLADE.
Some features of these programs are worthy of note.F irst, they were written in many computer languages, ranging from C, C þþ,F ortran,P ascal, Java and Perl to Stata, SAS and S-PLUS, some of which are available in compiled forma nd are tested under specific computer systems.S econdly,t hey required ata in specific formats, often from the programmers' ownp erspective and not conforming to any standard,a nd it is often rather cumbersome to reuse output from these programs. Somei nclude primitivep arsing and af ew have graphicalcapability.Thirdly,inthe analysis of data from alarge project, it is often necessaryt owrite some customised utilities for thesep rograms. The batch of skills required for the different languagesa nd tools largely needs ap rofession of computingo ra na pplied field. These often lead to redundant work, poor maintenance and lack of validity checks. Consequently,i ti sd ifficult for practicald ata analysts to keep track of so many softwarep rograms and thus many smaller programs ares ometimes 'lost', even though they would be very usefuli fo nly people knewa bout them.
The features of good software systems for genetic data analysishavebeen described 9 and were reiterated in the recent Genetic Analysis Wo rkshop (http://www. gaworkshop.org), where some software proved to be inadequate for datasets from both real study and simulation. To al arge extent, the authors believet hat this is due to the lack of ag eneral but satisfactory platform for statistical geneticists. Excellentt heoretical work often does not have agood companion program. While there is alwaysamotivation to provide one,the effortofdevelopment is often too great. Thei deal developmentp latforms hould runa cross computer systemsand have facilities for data management, graphics,established algorithms and clear documentation, provide agraphicalu ser-interface (GUI) and accept batch jobs. Thel anguage should be powerful and flexible,b ut easy enough to track errors and modifyorextend the source codes. Furthermore,iti se ssential to be able to retrievea nd send information from the internet, givent hat large genetic data and programs arepublicly available.Finally,as much of the code for numerical analysisand other routines has been available for decades, typically in Pascal, Fortran, C/Cþþ,i tshould ideally be possible to re-use this.
The above featureswould be impossible to achievebysingle programmer(s) or group(s); however, with therecent development of general computing, such aplatformnow exists. Note howthese featuresare reminiscent of the open source initiatives led by the Linux operatingsystem. In the following sections, the authorsfirstdescribe features of Rthrough abrief introduction, and then give asurveyofpackages to illustrate the range of tools available.This is followedb yanexposition through example packages. They also provideexamples and comparisons with other platforms. They suggest that Rcould potentially serve as an integrated platform for genetic data analysis.

Ab rief overviewo fR
According to the comprehensiveRarchivenetwork (CRAN), Ri s"GNU S", af reelya vailable language and environment for statistical computing and graphics whichp rovides aw ide variety of statistical and graphical techniques: linear and nonlinear modelling,s tatistical tests,t ime series analysis,c lassification, clustering,e tc' (http://cran.r-project.org).
Ab rief historyi sc ontained in the frequently asked questions (R-FAQ)a tC RAN: The name is partlyb ased on the (first) names of the first two Ra uthors (RobertG entleman and Ross Ihaka), and partlyaplayo nt he name of the Bell Labs language "S". Sisavery high levell anguage and an environment for data analysis and graphics.I n1 998, the Association for Computing Machinery( ACM) presented its Software System Awardt oJ ohn M. Chambers,the principal designer of S, for the Ssystem, whichhas forever altered the wayp eople analyze,v isualize, and manipulate data... Si sa n elegant, widelya ccepted, and enduring software system, with conceptual integrity,t hanks to the insight, taste,a nd efforto fJ ohn Chambers.
The masters ite for CRAN is maintained in Austriaa nd is mirrored by other sitesw orldwide.T he Rs ystem is greatly enhancedb yav arietyo ft ools withe xcellent documentation. These tools are organised as base and contributed packages, which nown umberw ell over 500. Similarly to S-PLUS,a n Rp ackage is ac ollection of object(s), dataset(s) or function(s) for specific tasks. As of the latest version (2.1.1), the Rd istribution comes witht he following packages: base Rf unctions ( base); base Rd atasets ( datasets); formallyd efined methods and classes for Ro bjectsa nd programming tools ( methods); devices and functions for graphics ( grid, grDevices, Integrated analysis of genetic data with R Review SOFTWARER EVIEW graphics); interface and language bindings to Tcl/Tk GUI elements ( tcltk); tools for package development administration ( tools); utilities ( utils); and Rs tatisticalf unctions ( stats,s tats4, splines). The recommended packages includeb ootstrap( boot); clustera nalysis( cluster ); interface to other statistical packages ( foreign); lattice graphics ( lattice ); linear models and smoothing ( mgcv,nlme,KernSmooth); recursivepartitioning ( rpart ); survival analysis( survival); and functions and datasets to support the book ModernApplied Statistics with S 10 ( VR). By default, these packages arei nstalled with the Rs ystem. By contrast, contributed packages aref romo ther usersa nd require install.packages() for installation and library() command to load.
The Rs ystem is available for most computer systems, including Unix, Linux, Windows and MacOS X. With the Rs ystem is an object-orientated programmingl anguage,a powerful tool for organising the representation of information (classes) and the actionst hat area pplied to these representations(methods). It nowsupports the S4 class system, 11 which is distinguished from the S3 class system 12 and allows for object-orientated programming within an interactive environment, consistentv alidity check and multiple method dispatches. In addition, Rh as afl exible graphicalf acility and is able to readd ata in an umber of formats, including dBase, Stata, SPSS and SAS.I ta lso includes al ineara lgebra package (LAPACK, http://www.netlib.org/lapack/). As Rw as developed using the model of S, from an institutionw here the C/ C þþ language wasb orn, it is nativet oC /Cþþ and Fortran programs. Furthermore,i tc an be runb oth through GUI and in batch mode,w hich allows new and experienced userstocustomise it to their ownneeds. TCL/Tk is nowp arto ft he system, which mayb eu sed to create a user-defined GUI. There area lso packages which provide interface to common gatewayi nterface (CGI)a nd generate HTML/XML outputs. Ac losely related project is Omega, ' ...aj oint project with the goal of providing av arietyo f open-source software for statistical applications' (http://www. omegahat.org), which aims to provide facilities to communicate between Ra nd other applications such as Matlab,P erl and Python. The packages RMySQL and RODBCa re useful for connecting the MySQL database system and Open Data-Base Connectivity (ODBC).
More information about R, including documentation and recommended reading, is available from CRAN.
ape. Analyses of Phylogenetics and Evolution: provides functions for reading and plotting phylogenetic trees in parenthetic format (standard Newick format), analyses of comparatived ata in ap hylogenetic framework, analyses of diversification and macroevolution, computingd istancesf rom allelic and nucleotide data, reading nucleotide sequences from GenBank via the internet, and several tools such as Mantel's test, computation of minimum spanning tree or the population parameter thetab ased on various approaches.
bqtl. QTLm appingt oolkit for inbred crosses and recombinant inbred lines. Includes maximum likelihood and Bayesian tools.
genetics. Classes and methods for handling genetic data. Includesclasses to represent genotypes and haplotypes at single markersu pt om ultiple markerso nm ultiple chromosomes. Functionsi nclude allele frequencies, flagging homo/ heterozygotes,fl agging carrierso fc ertain alleles, estimating and testing for Hardy-Weinberg disequilibrium, estimating and testing for linkage disequilibrium.
hapassoc. Ap ackage used for likelihood inference of trait associations with haplotypesand other covariates in generalised linear models. The functions accommodate uncertain haplotype phase and can handle missing genotypes at some SNPs. 14 haplo.score. As uite of routines that can be used to compute score statistics to test associations between haplotypes and a wide variety of traits, including binary, ordinal, quantitative and Poisson. 15 These methods assume that all subjects are unrelated and that haplotypes are ambiguous (due to unknown linkage phase of the genetic markers). The methods provide several different global and haplotype-specific tests for association, as well as providea djustment for non-genetic covariates and computation of simulation p -values (which may be needed for sparse data).
haplo.stats. As uite of S-PLUS/R routines for the analysiso f indirectly measured haplotypes. 16 The statisticalm ethods assume that all subjects are unrelated and that haplotypesa re ambiguous (due to unknown linkage phase of the genetic markers). The genetic markersare assumed to be co-dominant (ie one-to-onec orrespondence between their genotypes and their phenotypes), and them easurements of genetic markers are referred to as genotypes.T he main functionsi nh aplo.stats are: haplo.em, haplo.glm and haplo.score.T he haplo.score function is an extensionofa nearlier function in the haplo.score package.
hierfstat. Estimation of hierarchicalF -statistics from haploid or diploidg enetic data witha ny numberso fl evels in the hierarchy, and tests for the significance of each Fa nd variance components. 17 hwde. fits models for genotypic disequilibria, as described by We ir and Wilson 18 and Huttley and Wilson. 19 Contrast terms are available which account for the first-order interactions between loci.
ldDesign. Apackage for designofexperimentsfor association studies for detection of linkage disequilibrium. Uses an existing deterministic powerc alculation for detection of linkage disequilibrium between ab iallelic QTL and ab iallelic marker,t ogether with the Spiegelhalter and Smith -Bayes factor to generate designs with powert od etect effects with ag iven Bayesf actor. 20 LDheatmap. Ap ackage to create ah eatm ap (a false colour image with adendrogram added to the left side and to the top) of linkage disequilibrium involving SNPs, usingboth r and D 0 .
PHYLOGR. Manipulation and analysis of phylogeneticallysimulated datasets (as obtained from PDSIMULi np ackage PDAP)a nd phylogenetically-based analyses using GLS.
qtlDesign. To ols for the designo fQ TL experiments. 21 R/gap. An integrated package for genetic dataa nalysiso f both population and family data. It contains functionsf or sample size calculations of both population-and family-based designs, probabilityo ff amilial disease aggregation, kinship calculation, some statistics in linkage analysisa nd association analysisi nvolving one or moreg enetic markers, including haplotype analysis. The functionsi ncluded are: hwe , hwe.hardy for Hardy-Weinberg equilibria involving SNPs and highly polymorphic microsatellite markers; s2k, gcontrol for single-locus association analysis of polymorphic markersand genomic control; 22,23 genecounting; gcp for haplotype analysis of all chromosomes and missing data 24 and permutation tests; tbyt, kbyl for linkage disequilibrium statistics for SNPs and multiallelic markers; htr, hap.score for extractingh aplotype information for haplotype trendr egression analysisa nd regression incorporating covariates based on conditional regression, as implemented in the haplo.scorep ackage. 15 For family data, it includes family plotting through graphviz ( pedtodot), exact probabilityo ff amilialc lustering disease ( pfc and pfc.sim), 25 kinship calculation, involves genetic index of familiality ( gif)and asimple kinship calculation ( kin.morgan). Currently,i ti sb undled with an experimental version of POINTER and PATHMIX. 8 rmetasim. An interface between Ra nd the metasim simulation engine. 26 Facilitates theu se of the metasim engine to build and runi ndividual-based population genetics simulations.
R/qtl. Analysis of experimental crosses to identifyQ TLs. 27 The following packages aren ot available from CRAN,b ut conformt ot he Rs tandard: happy. an Ri nterface into the Cp ackage HAPPY for fine-mappingQ TL in heterogeneous stocks, 28 which is an advanced intercross between (usually eight) founderi nbred strains of mice suitable for fine-mappingQ TL. The happy package is an extension of the original Cp rogram happy; it uses the Cc ode to compute the probability of descent from each of the founders,ateach locus position, but happy allows a much richer range of modelst ob efi tt ot he data.
tdthap. Transmission/disequilibrium tests (TDT) for extended haplotypes, according to Clayton and Jones. 29 popgen. Ap ackage which implements av ariety of statistical and population genetic methodology. 30 An ld2 function for two-locus log-linear models is available from the gllm,r outinesf or log-linear models of incomplete contingency tables, 31 including somel atent class models via expectation maximisation( EM) and Fisher scoring approaches. Basic tools for applied epidemiology are implemented in the general-purpose package epitools,a nd the visualizing categorical data 32 ( vcd )p ackage Wo olf 's test includes for homogeneity on 2 £ 2 £ k -t ables over strata (ie if the log odds ratiosa re the same in all strata). The locfdr package is for computation of local false discoveryr ates. 33 The rmeta package contains many functions for meta-analysis which would be appropriate for the genetic analysissetting, while the BradleyTerry package can be used for TDTanalysis. Ap otentially usefulp ackage for genome-wide association analysisi s Integrated analysis of genetic data with R Review SOFTWARER EVIEW evd,i mplementinge xtreme value distribution. Rf unctions associated with specific papersi nclude link/tdt, 34,35 EHP, 36 tdtexact 37 and htpower/Nstage. 38,39 An umber of Rp rograms, including those for methods of genomic controls, area vailable from the University of Pittsburg computational genetics lab (http://wpicr.wpic.pitt.edu/wpiccompgen/). They use the familiarf ormat of input/output files, but ares omewhat informal compared withm any packages on CRAN.
Many packages for microarrayd ata analysisa re available from CRAN and the Bioconductor project (http://www. bioconductor.org); for example, affy for Affymetrix, marray and arrayMagic for cDNA data processinga nd packages for extracting signals from the scanner ( Spot), for genea nnotation and delineating biological pathways( annotate). Unlikeg enotype data, gene expression data -after data pre-processing including normalisation-can generally be analysed using the recommended packages installed with Rfor standard statistical analysis. Bioconductor additionally providesp ackages for adjusting for multiple testing ( multtest), which is atypicali ssue in analysing high-dimensional microarrayd ata. Ta king advantage of the extensivegraphicalabilitiesofR,the package geneplotter allows userst oa ssociate microarraye xpression data with chromosomal locationa nd to visualise their data using whole genome or single chromosome plots. Thep ackage Rgraphviz can be used for laying out biological pathways.
The numerous packages available mayappear daunting, but arecent feature of Risthe so-called CRAN task views, which allowu serst ob rowse packages by topic and provide tools to automatically install all packages for special areas of interest. Av ersion for genetic analysish as been developed by Gregor Gorjanc (http://www.bfro.uni-lj.si/MR/ggorjan/software/ R/Genetics.html) and will be available soon.

Example applications
In this section, some examples are givent oi llustrate the developmenta nd use of the Rp ackages described above.
Example 1 .H aplotype frequency estimation including haplotype association with case-control data. An umber of computer programs have been written by one of the authors (J.H.Z.) for this purpose: 2LD, 40 EH þ , 41 fastEH þ , 42 GENECOUNTINGa nd HAP. 24,43 They have nowb een integrated into functions available from R/gap,s ot hat haplotype frequencies can be estimated usingt he EM algorithm, 44 including data on ChromosomeX ,t ob es erveda si nput for tbyt or kbyl to obtainl inkage disequilibrium measures such as D 0 and r 2 and linkage disequilibrium heatm ap.I nstead of calling the executable files with utilities such as LDSHELL, 45 a simplel oop is sufficient to runasliding windows analysis and for estimation using data from several populations. Furthermore,h aplotype assignment can be read into Rf or haplotype trend regression 46 of cross-sectional or longitudinal data. In addition, somew ell-known datasets 47,48 can be stored in compact formw ith detailed documentation and retrieved when needed.
Example 2 .A collaborative study on genetics of alcoholism (COGA) data from the Genetic Analysis Wo rkshop 14 (GAW14). The microsatellite markersare giveninfixedASCII format.I np revious analysis, 49 Cu tility programs had to be written to read the marker data in allele size.Now,one can use read.fortran to read such formatted data. One can also use the genetics package to test for Hardy-Weinberg Equilibrium, and pedigree diagrams can be drawn all in one go for the 143 pedigrees involved (see http://www.ucl.ac.uk/~rmjdjhz/ r-progs.htm). The use of the kinship package for the mixed-effectsCox model of alcoholism in extended pedigrees, including family relationship and with microsatellite markers, has been reported. 50 Example 3 .L og-linear models for genotype data. The Rp ackage, hwde,h as provided an examplef romH uttley and Wilson; 19 seet he detailed information giveni nt he package vignette. Example 5 .O pen database connection. RODBC implements ODBCw ith compliant databases when driverse xist on the host system.T he following is an example for reading all columns of tblOutput in an Microsoft w Access database aedata.mdb.T he end result is ad ata frame called tblOutput . #l oad the librarya nd connect to Microsoft Access library(RODBC) c2 , -o dbcConnectAccess("c:/aesop/mdb/aedata.mdb") #s elect one table from the database tblOutput , -s qlQuery(c2,paste("select * from tblOutput")) #adata.frame class(tblOutput) This shows that Ri sa ble to makeq ueries usings tructured queryl anguage (SQL) to af ormal databases ystem,s ot hat marker information from genome-wide linkage and association studies can be organised and retrieved in asimilar fashion and synchronised updates and communications are possible.

Comparison and integration with other softwares ystems
As Rh as many functionsa vailable in as ingle environment, minimum effortisneeded to write programs for data handling.
One can, then, concentrate on the statisticala lgorithm and analysis. This is clearly advantageous over stand-alone programs. Then eed to integrate internet capability within the data analyticals ystem is also essential, givent hat data on several international projects are available from the internet.
Some researchersp refer to analyse data from large genetic studies using ah ybrid of Perl or other scripts withp rograms written in C/Cþþ;h owever, these programs arem ore targeted at computing professionals,h aving ar elatively smaller statisticalc omponent. It mayb em ore difficultt or e-use codes written for such purposes. Program developmentm ay be more time-consuming, especially when analysis involving both genetic and environmental factorsi sr equired. 51 Even so,i ti sp ossible to use Ra sa ni ndependentp rogram for such purposes. Likewise,c ompiled Rp ackages for specific computer systems, but not the source code,can be distributed if necessary.
Al arge number of programs with aG UI have been developed recently.Anotable example in multilevel modelling is the MIXOR/MIXREGa nd associated programs for longitudinal data analysis. With few exceptions, such as UNPHASE and JPAP,the source codes are largelyunavailable, so it is sometimes difficult to assess the validity of the programs. Userss hould therefore remain familiar with a varietyo fi mplementations. They will encounter the usual problems of idiosyncratic data formatsa nd source codes that are difficult to reuse.A na lternativew ould be to use Java as an interface to the standard-alone programs, in order to run them in batch mode.T he documentationa ssociated with individual functions, however, is often poorert han those in R. In this regard, ad ifferent interface is provided by Rweb. 52 It provides as implet ext entryf ormt hat returns output and graphs and am ore sophisticated Javascript version that provides am ultiple windowe nvironmenta nd as et of point and click modules that are useful for introductorys tatistics courses and requirenoknowledge of theRlanguage.All of the Rweb versionsc an analyse internet-accessible datasetsi faURL is provided. It has also been shown that Perl can be used within R( see the gregmisc package).
Software developmentc ould be based on other environments -f or example,S tata, SAS and S-PLUS, including somec orporate efforts, such as SAS/GENETICS. The Rp ackage foreign provides commands to read and write dBase,S tata, SPSSa nd SAS xportfi les or access to Microsoft Excel/Access viaO DBC,w hereas data transformations between Stata and other applications requireS TAT/Transfer. Most programs written in Rc an be used with little alteration under S-PLUS.Rhas ac lear advantage on graphics, and it is easier to incorporate routines written in C/Cþþ/Fortran. UnlikeS AS,i td oes not requireaseparate module for matrix operations.
Afi nal note is givenh ere regarding feedbackt hat the authorsr eceivedw hen developing R/gap and kinship,s oa st o showthe benefit of the collaborative work that Rencourages. The hwe.hardy function in R/gap waso riginally designed to accept only the full arrayc ontaining theg enotype counts, but wasl ater extended according to ar ecommendation to use the genotype objectsc reated by the genetics package.T he C output format, %lf,w as not supported by the American National Standards Institute (ANSI) standard and was subsequently changedf ollowing the advice of the Rc ore developmentt eam. Ac ompiling errorw ith emx.f in the original POINTER program wasa lso pointed out and later fixed. The kinship package wasp orted directly from S-PLUS. Extensivee ffortsw ere required for debugging; however, this has been greatly facilitated by the package debug from CRAN.T here wasa lso ap roblem with MacOS Xi n kinship,b ut this wass ubsequently changed according to suggestions.

Discussion
The authorshavedescribed both the motivation and prospects for usingRas an integrated environment for genetic data analysis. While af ormal presentation of Ra nd comparison with Rs ystems might have been given, the description has been deliberately kept informal.T he following recaps the features of the Rs ystem.
First, it provides afl exible,i ntegrated environment for statisticalc omputingu sing an object-orientated programming language.I tp rovides standard formats for data input, documentationa nd an interface to generals tatistical packages such as Stata, SPSS,S AS,S -PLUSa nd databases such as dBase,M icrosoft Access/Excel, Oracle (http://www.oracle. com) and MySQL (http:/www.mysql.com). Above all, the R system is nowacollaborative work, involving many people, and is available on most computer systems. Secondly,t he environmentc an be greatly enhanced by contributed packages, which can either be implemented in the nativeR language or as ah ybrid with external languagess uch as C/Cþþ/Fortran/Perl. This allows for the easy incorporation of rich collections of algorithms and programs that have already been developed over the years. Packages can also be usefully incorporated from other areas of research. For example,p ackages for operations research, statistics in psychology,s ocial network analysis, neuroimaging and spatial disease mappinga re available in the same repository. Thirdly,s tandard datasets or benchmarks can be included as nativeo bjects in ap ackage; thesea re ideal for evaluating new analyticalm ethods. Fourthly,t he functions and data in ap ackage can serve for av ariety of analyses. In haplotype analysis, for example,t his could include estimation of haplotype frequencies, assignment of possible haplotypes, Al inkage disequilibrium heatm ap and conditional and jointa nalysis withe nvironmental factors, among others.
Integrated analysis of genetic data with R Review SOFTWARER EVIEW Given that the developmento ft he Rs ystem is relatively recent, the wide range of tools available is impressive. The comprehensive and powerful features of Ri nd ata management, graphics and standard statisticala nalysis are makingi tavery useful platform for microarrayd ata pre-processing, visualisation and advanced statisticala nalysis. There is also ar ich arrayo fp ackages for the analysis of population data and analysis, phylogenetic analysisa nd the analysiso fq uantitativet raits from experimental design, although there is still ar elatives hortage of packages for the calculation of identify-by-descent and thereforeo fd iscrete traits or QTLs in human pedigrees. Packages for complex segregation analysisa nd path analysis are still experimental. Given the ease of creating packages from code that is already available,h owever, we expect that this situationw ill soon change.
Tw oi mportant points should be made here.F irst, it should be pointed out that the use of Rs hould not block the developmentofstand-alone programs. Secondly,adistinction should be made between potentialand reality. Theauthorshavecome across argumentst hat implementations aret rivial and that computer programming including Rp rogramming by statisticians arebydefault, straightforward. This maynot be the case,h owever, and amore thoughtful approach is necessary. Often,softwarei sc ursorily written and poorly documented with no consideration for generality and use of examples, and consequently is hardly of anypractical value.Fortunately,with the help of the Rc ore developmentt eam,iti sp ossible to produce industry-standard applications. The authorsnote that, at thetime of writing, aspecial issue of the Journalo f Statistical Software (http://www.jstatsoft.org) has been devoted to the transition of packages in XLISP-STAT to R. Given the current situation in genetic data analysis, it is nowt ime for action.
In summary, the authorsbelievethat Rcan potentially serve as an integrated platformf or analysiso fg enetic data. While the packages currently available arelimited in R, it is expected that its rich features will increasingly attract more developers and users. Further attention by theoretical and applied geneticists for softwared evelopmenta nd analysisw ill be very rewarding in the long term.