Editorial

Recent technological advances have increased drastically the number of experiments or assays one can perform with a small team of researchers and a manageable set of instruments. The resulting reduction in time and cost makes it feasible to perform large-scale studies. While researchers can produce enormous amounts of data over a short period of time with funding from an average research grant, extracting useful information from the data is a challenge, the difficulty of which most large-scale studies underestimate. Usually understaffed, and with an inadequate budget, the analysis teams often lack the proper analytical tools, yet have the unenviable task of drawing conclusions from data without a full understanding of the conditions under which the data were obtained or the tools used today. Unless we pay more attention to data analysis, we will soon drown in a sea of conflicting data and miss the biological pattern and information hidden within them. How to extract signal out of noise is a problem that physicists and engineers have been grappling with for a long time. The fact that a large group of people at the airport can communicate with specific individuals around the world by mobile phones and wireless internet connections is proof that the engineers have solved the problem of extracting good signal out of noise in the area of global communications. The biological world, however, especially in the study of humans, is a lot more complicated. Unlike the world of communications, where one is extracting the signal from a narrow band of frequencies in real time, human studies are mostly performed with data generated from subjects on one or, at most, a handful of occasions. To complicate matters even further, these ‘snapshots’ are usually taken without much consideration for the environment in which the subjects find themselves. The seriousness of the data analysis problem is seen in all three of the areas on which this issue of Human Genomics focuses. In human genomics, the reference human genome sequence will take years to fully annotate. In fact, there are still a significant number of places in the reference sequence where mistakes in sequence assembly are found. These mistakes were made because the automated assembly software could not handle low copy duplications in the genome, especially when they were very close to each other. The misassembled sequences can only be corrected when experts carefully analyse the genome sequence in their regions of interest. Similarly, as single nucleotide polymorphisms (SNPs) in their hundreds of thousands are being genotyped for genetic association studies, confounding characteristics such as differences in population substructure between cases and controls must be taken into account, or spurious associations will result or true associations may be missed. In proteomics, advances in protein separation and mass spectrometry make it possible to obtain protein profiles of biological materials in terms of both the specific proteins present and their relative abundance. Even with a good catalogue of proteins found in a specimen, however, the fact that most tissue specimens consist of a mixture of different types of cells, and that they are obtained under slightly different physiological conditions, creates a great deal of additional noise in the system. It is no surprise, therefore, that proteomic data can only be analysed in a qualitative and superficial way. In gene expression profiling, probably the most mature form of large-scale studies in the genomics field, intriguing patterns of expressions are seen when one compares different tissues. Once again, heterogeneity in the tissues studied and differences in the conditions under which the samples are obtained affect the gene expression patterns in significant ways. Even so, in some cases, one can predict the prognosis of a patient by looking at the expression pattern of cancerous tissue. There is, however, still great uncertainty in the predictive value of gene expression profiling as a diagnostic tool. Much like the case of proteomics, these patterns will not lead to a deeper understanding of the biological pathways involved in a disease without careful cell biology studies. Given these difficulties, how does one take advantage of the amazing technologies available in the fields of genomics and proteomics? I believe that the field must restrain from generating data for its own sake. Instead, one must face the problem head on and take the following actions. First, one must define the specimens under study with a great deal more detail, so that one is comparing different specimens that are appropriately grouped. For example, instead of labelling DNA samples simply as being from patients having a certain disease or from a group of ‘normal controls’, one should define the specimens further with as much phenotypic and demographic information as possible, including, at very least, carefully defined clinical diagnoses, key laboratory findings, major clinical features, age at disease onset, sex and ethnic origins of the four grandparents (including their places of birth and self-described ancestry). Instead of labelling tissue samples for RNA or protein studies simply as ‘tumour’, ‘tissue with active disease’ or ‘normal tissue’, one should include not only the phenotypic and demographic information needed for DNA samples, but also information on the conditions under which the tissue was obtained. In some cases, it is important to note whether a specimen is obtained EDITORIAL


Editorial
Recent technological advances have increased drastically the number of experiments or assays one can performwith asmall team of researchersa nd am anageable seto fi nstruments.T he resulting reduction in time and cost makes it feasible to performl arge-scale studies. While researchersc an produce enormous amountso fd ata over as hortp eriod of time with fundingf roma na verage research grant, extracting useful information from thed ata is ac hallenge,t he difficulty of which most large-scale studies underestimate.U sually understaffed, and with an inadequate budget, the analysist eams often lack the proper analytical tools, yeth avet he unenviable task of drawing conclusionsf romd ata without af ull understandingofthe conditionsunder which the data were obtained or the tools used today. Unless we paym ore attention to data analysis, we will soon drowni nasea of conflicting data and miss the biological patterna nd information hiddenw ithin them.
Howt oe xtract signal out of noise is ap roblem that physicists and engineershavebeen grappling with for along time. The fact that al arge group of people at the airport can communicate with specific individuals around the world by mobile phones and wireless internet connections is proof that the engineersh aves olved the problem of extractingg ood signal out of noise in the area of global communications. The biological world, however, especially in the study of humans, is alot more complicated. Unlikethe world of communications, where one is extractingt he signal from an arrowb ando f frequencies in realt ime,h umans tudies arem ostly performed with data generated from subjects on one or,a tm ost, a handfulofoccasions. To complicate mattersevenfurther,these 'snapshots' are usually taken without much consideration for the environmenti nw hich the subjectsfi nd themselves.
The seriousness of the data analysis problem is seen in all three of the areas on which this issue of Human Genomics focuses. In human genomics, the reference humang enome sequence will takeyearstofully annotate.Infact, there are still asignificantnumber of places in the reference sequence where mistakes in sequence assembly are found. These mistakes were made becauset he automated assembly softwarec ould not handle lowc opyd uplications in the genome,e specially when they were very close to each other.T he misassembled sequences can only be corrected when experts carefully analyse the genomes equence in their regionso fi nterest. Similarly,a ss inglen ucleotide polymorphisms( SNPs) in their hundreds of thousandsa re being genotyped for genetic association studies, confounding characteristics such as differences in population substructure between cases and controls must be taken into account, or spurious associations will result or true associations mayb em issed.
In proteomics, advances in protein separation and mass spectrometrym akei tp ossible to obtain protein profiles of biological materials in terms of both the specific proteins present and their relative abundance.E venw ith ag ood catalogue of proteinsf ound in as pecimen, however, the fact that most tissue specimens consist of amixture of different types of cells, and that they are obtained under slightly different physiological conditions, creates ag reatd eal of additional noise in the system. It is no surprise,therefore, that proteomic data can only be analysed in aq ualitativea nd superficial way.
In gene expression profiling, probably them ost mature formo fl arge-scale studies in the genomics field, intriguing patterns of expressions are seen when one compares different tissues. Once again,h eterogeneity in the tissuess tudied and differences in the conditionsu nder which the samples are obtained affect the gene expression patterns in significant ways. Even so,ins ome cases, one can predict the prognosis of ap atientb yl ooking at the expression patterno fc ancerous tissue.T here is, however, still great uncertainty in the predictivev alue of gene expression profiling as ad iagnostic tool. Much liket he case of proteomics,t hese patterns will not lead to ad eeperu nderstanding of the biological pathwaysi nvolved in ad isease without careful cell biology studies.
Given these difficulties, howdoes one takeadvantage of the amazingt echnologies available in the fieldso fg enomics and proteomics? Ib elievet hat the field must restrain from generating data for its ownsake. Instead, one must face the problem head on and taket he following actions.
First, one must define the specimens under study witha great deal more detail, so that one is comparing different specimens that area ppropriately grouped. Fore xample, instead of labellingDNA samples simply as being from patients having ac ertain disease or from ag roup of 'normalc ontrols', one should define thes pecimens further witha sm uch phenotypic and demographic information as possible,i ncluding, at very least, carefully defined clinical diagnoses, keyl aboratoryfi ndings, major clinicalf eatures, agea td isease onset, sex and ethnico rigins of the four grandparents (including their places of birth and self-described ancestry). Instead of labelling tissue samples for RNA or protein studies simply as 'tumour', 'tissue with actived isease' or 'normal tissue', one should include not only the phenotypic and demographic information needed for DNA samples, but also information on the conditionsu nder which the tissue waso btained. In some cases, it is important to note whether as pecimen is obtained under fasting conditions or shortly after am eal, in the morningo ri nt he evening, while the person has been at rest or after ap eriod of activity.A so ne carefully controls for the 'background' of the DNA or tissues pecimens, noise is greatly reduced and the resultant signals are more easily identified.
Secondly,t he accuracy of the data must be determined by periodicv alidation of the molecular methods and by inclusion of proper controls in the study.W hether one is performing DNA sequencing, genotyping of genetic markers, determining the global gene expression patterns of at issue or profiling the protein contento facell type,q uality control must be done consistently throughout the course of as tudy.K nowing the degree of uncertainty in the data, and takingt his into account during data analysis, enhancest he powero ft he analysis and strengthens the conclusions derived from the results. Te sting duplicate samples at various time pointsd uring the study, repeating as ubset of experiments using ad ifferent platform and looking for consistency of thed ata based on family or other sample relationships ares ome of the ways one can determine the quality of the data.
Thirdly,i ti se ssential to develop analyticalt ools that are appropriate for the data, so that they can be applied properly. In many cases, standard statisticalm ethods cannot be used for genomic,g ene expression or proteomic data obtained from a heterogeneous group of individuals. Because biological data are so complex, and one cannot control with greatp recision the conditionsu nder which the samples are obtained, the assumptions of standard statistical tools regarding the data properties cannot be met. Without proper understanding of the conditions required for astatistical method to be valid, one can apply the wrong test for ad ataset and obtain erroneous results. Biostatisticians are becoming more familiarw ith the explosiono fg enomic,g ene expression and proteomic data being produced and, in due time,a na ppropriate set of analyticalm ethods will be available for all to use.
Until these practices arestandard in thefield, one has to take the results and conclusionsoflarge-scale studies with ahealthy dose of scepticism. Journal reviewersmust insist that those who want to publish the results of genomic,geneexpression or proteomic studies address the questions of phenotypic,population and specimen heterogeneity,data quality and the rationale for their choice of analytical method for their data. When these issues are properly addressed, the quality of publications in our field will drastically improve, the number of studies showing conflicting results will decrease and the daywhen we will decipher the mysteries of the biological patterns contained in our genome and proteome will arrive much sooner.

P. Y. Kwok Editorial Board Member
Human Genomics