From the very beginning, the core GeneCards features included two important components: the capability to view integrated details about a gene in 'card' format and a full text-based search engine. GeneCards has evolved by constantly adding new data sources and data types (eg protein expression and gene networks), revamping the search engine to improve results and performance, and expanding the original gene-centric dogma to encompass sets of genes.
Currently, GeneCards automatically mines over 90 sources in an offline process and constructs a consolidated gene list. First, the complete current snapshot of the HUGO Gene Nomenclature Committee (HGNC)-approved symbols is used as the core gene list. Next, human Entrez Gene entries that are different from the HGNC genes are added. Finally, human Ensembl records are matched against the emerging gene list via GeneLoc's exon-based unification algorithm; those that are not found to be equivalent to others in the set are included as novel Ensembl-based GeneCards gene entries. These primary sources provide annotations for aliases, descriptions, previous symbols, gene category, location, summaries, paralogues and non-coding RNA (ncRNA) details. Once the gene list is in place with these significant annotations, over 90 data sources--including those noted above and others[4–9]--are mined for thousands of additional descriptors.
The data for each gene are collected into a text file which is used to display the web-card. In addition to the legacy text file format, the complex data model of GeneCards version 3 is stored in relational databases . One database ('by resource') stores the data largely in the originally mined architecture, and another database ('by function') supports the website and has over 130 tables and views, with an average volume of hundreds of thousands of records. The largest table has over 6.5 million rows. This compendium is modelled into 40 entities, with hundreds of hierarchical relationships. The introduction of the relational database enables the execution of complex queries in the advanced search mode and sophisticated functionalities for sets of genes. The 'by function' data model is strongly influenced by the organisation of information in sections on the web-card (eg first descriptions, then integrated locations, followed by all disorders and so on), an organisation based on integrated scientific logic, which also keeps track of originating sources of information.
The GeneCards search is made possible by Lucene-based Solr technology,[11, 12] coupled with our original database crawler, enabling new levels of meta-annotation for field-specific dissections. In GeneCards Version 3, the search also introduces new features, including stemming (using the grammatical root along with its inflections) and proximity relations for multi-word searches (using the distance between found instances of each searched word, for relevance). Users can home in on their most desired results by viewing 'minicards' and examining expanded annotations on their chosen GeneCards gene.
More specialised capabilities that exploit the wealth of the GeneCards data are available from the GeneCards Suite: GeneNote and GeneAnnot for transcriptome analyses, GeneLoc for genomic locations and markers, GeneALaCart for batch queries and GeneDecks for finding functional partners and for gene set distillations [4, 7, 13, 14].
The GeneCards project's instantiation of data management planning, implementation, releases and versioning, with examples of its sources, technologies, data models, presentation needs, de novo insights, algorithms,[14–16] quality assurance, user interfaces and data dumps, is described in detail by Mayer et al. Over the years the life cycle has included project planning phases followed by implementation, development and semi-automated quality assurance, and deployment approximately three times a year, cycling back into new planning phases for subsequent revisions. Technologies used include Eclipse, Apache, Perl, XML, PHP, Propel, Java, R and MySQL. This platform enables user capacities that allow targeted searches, including search 'by section'. Importantly, because GeneCards mines from so many sources, each specific search amounts to obtaining knowledge from judiciously selected excerpts from many of these sources.