Our strategy has been to utilise public domain software and a modular approach. In conjunction with common data interchange formats, this provides a robust setup that can be easily adapted and adopted by other groups studying complex disease. Essentially, we have achieved an integration of genomics and genetics underpinned by an integration of the workflows of genome informatics, data management and laboratory experiments and reagents.
Traditionally, individual researchers focus on their own regions or single genes and hold their own data. This makes data archiving, integration and mining impractical. In the DIL, all generated data are acquired and stored centrally in the relevant database. While all the databases are centralised, we believe that the best curation of the data is performed by the scientists. Therefore, user-friendly web or VB front-ends, together with auditing strategies, are provided; this allows the user to alter their own data in a responsible but reversible manner.
MySQL is used as the database of choice; Oracle(TM) was considered but the cost was prohibitive. The performance and ease of administration of MySQL has been very good. Some design limitations have, however, led to a substantial effort in data manipulation and off-line checking to emulate Oracle's transaction handling, form triggers, logging and referential integrity checks.
The genotyping, sample and freezer management databases have easy design goals and schema, attempting to capture large volumes of essentially similar data. Standards are yet to emerge for the sensible design of blood and/or genotyping databases. The rising interest in research governance will probably change that, as medical scientists become obliged to demonstrate ethical and accountable working practices.
The feature database structure has been designed to be as simple as possible and relies on flexibility to overcome the dual problems of complexity and increasing data and data types. This has a negative implication for query speed. To address this, work is ongoing to adopt data transformation techniques in order to build an EnsMart style database to allow fast, complex read-only queries.
Manual annotation of genes of interest is essential to exclude false-positive and false-negative predictions of genes or parts of genes, especially with the multiplicity of the splice variants for many genes. While the Ensembl predictions are becoming increasingly more accurate, they still remain predictions. Endeavours such as Vega by the Sanger Institute's Havana group also improve the accuracy of the available gene structures, this will not fully replace local verification of annotation, but it will help to speed up local annotation.
With each new version of the genome build and Ensembl, all genome mappings have to be updated. The most time-consuming task is the downloading and installing of the Ensembl data locally. The remapping takes a relatively short time, but speed could be improved by better heuristics, such as performing checks on the regions of interest, ie to see whether the regions have changed chromosomal coordinates and/or have a different sequence length. If the coordinates and length are the same, the sequence should be the same, and no remapping would have to be performed.
The advantage of a modular system is that other genome viewers/editors can easily be adopted, provided that common data types, such as GFF or DAS , are being used. The system is also extremely flexible, thus allowing the straightforward addition of new features such as a local locuslink .
The ability to add plug-ins to Gbrowse makes this system very powerful. Three types of plug-in exist: finders, dumpers and annotators. The dumper plug-in, for example, takes features from a display and allows them to be written as text. This can be used to take all the local SNPs and display summary statistics for them. There is, however, an issue that calculation of summary statistics for all our SNPs takes too long to be performed dynamically. Work is therefore in progress to store the derived data, such as QC/QA and statistical data, in a data warehouse using the EnsMart data model. The finder and annotator plug-ins can be used to find information of a certain type in the in-house database and then return their fine localisation, if looking at a single region (for example, all SNPs with P-values < 0.005), or more global, using the finder type plug-in (find all manually curated genes in all the regions). This system can also be used to attach biological experimental data to genes.
Our strategy of resequencing exons and 3 kb regions 5' of the first exon and 3 kb regions 3' of the last exon will find some variants locating to regulatory sequences. Some regulatory sequences, however, may be located in introns further than 3 kb away from the gene start and end. A public domain SNP and haplotype map of the genome is being constructed , which will greatly facilitate scanning of complete regions or chromosomes, rather than the shortfall measure of interrogating the approximately 5 per cent of the genome containing exons and conserved sequences.
Comparative genomics, where genomes from different species are used to identify highly conserved sequences, has become a powerful tool for identifying potential regulatory elements [38, 39]. We are currently testing different programs such as BLASTZ, LAGAN, MLAGAN and WU-BLAST [24, 40, 41] to integrate the detection of conserved blocks into our research. The calculated conserved blocks and pairwise percentage identity plots can also be integrated with Gbrowse.
The development of the integrated infrastructure has taken two years with a team of seven developers and three systems research staff. This is not full-time development; our development and work is driven by the science, ie enabling scientists to make discoveries about T1D. Certain elements of the system, such as the feature database, BMS, genotype database, Gbrowse and the Ensembl extraction process, would be easily deployed in other complex disease-studying labs, but other elements, such as the remapping strategy and software, would require a certain degree of recoding to work independently of our hardware setup. The hardware requirements are dependent on the size of the study and on how much data storage and remapping is required. Work is currently underway to provide an automatic system-independent installation of the modules and their linking software. In addition, we work closely with other groups working on similar projects, such as the Institute of Systems Biology .