Genomic and biosocial research data about humans continue to proliferate, bringing with them questions about who should be able to store, hold and use which data and samples, for which purposes and with what safeguards. The research data landscape is changing with new forms of research data becoming available (e.g. next-generation sequencing, epigenetic and other ‘omics’ data [1, 2]) and existing data being increasingly accessible for research (e.g. via linkage of administrative, environmental or healthcare data to research data [3]). This complexity will only increase as new informatics technologies that enable citizen-generated data (e.g. social media, direct-to-consumer genetic testing and wearable sensors) become available to combine with research resources in novel ways.
Open science and drivers for data sharing
Expectations surrounding how research data should be shared are based on international commitments to using, and re-using, publicly funded research and its outputs for the public good [4]. Data sharing is also supported by research funding agencies internationally through their policies on open science (https://www.nwo.nl/en/policies/open+science/data+management, http://www.allianzinitiative.de/en/archive/research-data/principles.html, [5,6,7,8,9,10]). Scientific benefits of data sharing are seen to include verification, replication and the ability to pool analyses, as well as potential cost savings [9,10,11,12,13]. Some of the drive for shared data use comes from patients and publics themselves: already, many rare disease patients advocate for greater sharing of genomic and clinical data; the quantified-self movement promotes and facilitates sharing of data from wearables [14, 15]; citizen scientists demand to access clinical and other randomised controlled trials data and have filed legal claims to realise those demands [16]. International policy positions research data and samples as a public good which can only be fully realised by their wide and appropriate use [17], though some types of research data such as that generated by commercial companies (e.g. pharmaceutical companies) sit outside this definition. Indeed, some have gone even further to argue that not sharing research data is a breach of participant, patient and public rights and is itself a harm [18]. While the open science movement advocates unfettered access to research practices and the data produced by them, when those data are individual-level human data, access and re-use must be managed within relevant legal and ethical requirements, taking account of specific agreements together with the reasonable expectations of the research participants who provided those data and samples. These requirements, agreements and expectations are embodied in the consents that study participants, or their guardians if they are minors, give at the outset of the study and, often, for new collections or sub-studies.
Data sharing, longitudinal studies and the consent process
To date, the majority of genomic and biosocial research data made available for sharing in the UK have been derived originally from publicly funded research studies: longitudinal studies [19,20,21] (e.g. birth cohorts, household panel studies, other population and disease-based biobanks, clinical trials); survey and other social science datasets; and case-control studies, including case series collected/generated for research purposes and typically compared to a research-generated control group (e.g. Wellcome Trust Case Control Consortium [22]).
In longitudinal studies, the focus of this paper, it is not possible to foresee all possible research uses at the outset of a study. Data and samples may be used for long after their collection, sometimes decades afterwards. For studies whose raison d’etre is the provision of a scientific resource for which public good is the intended outcome and impact, renewed consent for each and every new research use would not only be unwieldy but would impede that aim. Instead, broad consent (within specific stipulations reflecting contemporary social values and scientific practice) are sought for future uses of the data and samples collected. In the UK, under the Human Tissue Act 2004 (https://www.legislation.gov.uk/ukpga/2004/30/contents) and Health Research Authority, the principles and contents of information given to study participants in consent and participant information documentation for different forms of data and samples are prescribed [23, 24].
But consent is a process, one which does not finish with the receipt of information and the signing of a form. Ethical approval for studies using broad consent includes mechanisms to ensure that those consents are respected and the expectations inherent in them are maintained, for example, explicitly stating which bodies can approve data and sample access. Appropriate, responsible data management and governance of the uses of those data and samples are foundational to this ongoing consent process in longitudinal studies and an important source of their trustworthiness in the eyes of individuals who have contributed data and continue to permit its storage and use. While trust is crucial for ensuring public support for research generally, for longitudinal studies, this requirement is particularly acute as ongoing research participation is vital to the longevity, productivity and impact of those studies. Research participants quite reasonably expect that their privacy (including confidentiality) will be protected and that uses of the data and samples they contribute will fall within agreed and anticipated parameters.
Data Access Committees
While Research Ethics Committees (RECs, also called Institutional Review Boards or Research Ethics Boards) interpret the ethical, legal and social frameworks within which studies are conducted and determine how data and samples may be used, data access permissions in the UK are governed by a hierarchy of regulatory processes and bodies depending on the data’s sensitivity and potential disclosivity (the ease and precision with which individuals may be identified). Light-touch administrative processes manage low-dimensional data with a low risk of disclosure or other participant harms. Data Access Committees (DACs) manage more complex or higher-risk data access. Key assessment criteria for determining the release—and access route—of such data include (1) consistency with original ethical approval, including consents and information materials and the stated aim of the data collection, and the risk of identification of individual participants; (2) issues that may directly concern individual participants, such as the risk of identifying a previously unknown disease or strong disease determinant which may warrant clinical action, or the risk of bringing a long-standing study into disrepute; (3) unreasonably damaging the intellectual property of the project, or otherwise undermining the effort invested by its investigators or data custodians (often more than one generation of researchers).
DACs may consider any risks which may impact the individual participant or otherwise breach their expectations, thereby provoking them to withdraw from a study. In addition to the above criteria, some DACs assess the quality of the science or potential public benefit for all applications, and nearly all committees assess the quality and potential benefit of the science when there is a request for use of a finite resource (e.g. blood samples). Across the research/scientific community in the genomic, health and social sciences, there is a strong consensus that the application of all such criteria (when applied in a proportionate and transparent manner) is good for society and ‘good for science’ [25].
Data access governance in the UK
In the UK, access to data and samples—the ‘resources’ of longitudinal and other research studies—is operationalised by a networked series of independent data repositories and independent data governance infrastructures, though some longitudinal studies successfully operate in-house or partially in-house governance and data issue mechanisms. Governance of access permissions (ethics and policy oversight) and governance of data issue (technical governance) may be managed jointly or by separate infrastructures. Each of these governance infrastructures is designed to ethically, legally, efficiently and securely manage access (permissions and/or distribution) to research resources of varied levels of complexity and sensitivity. For many UK longitudinal studies, data can be accessed online by bona fide researchers and are governed by end-user licences. Access decisions may also be made based on algorithms and straightforward rules for legitimate access. These are generally rapid mechanisms for access, notwithstanding any additional time required for online training upon registration to a data repository (e.g. UKDA training in data management [24]) or additional administrative processes for managing cost recovery. The data released by these mechanisms are those for which there is deemed to be a low risk of individual disclosure and which are considered not to raise additional ethical issues. Some data will only ever be made available in secure privacy-protected settings—where researchers travel to the data location (e.g. secure air-gapped data centres) or send the analysis to the data (e.g. DataSHIELD) and receive back only the analytical outcomes [26]. Use of finite resources such as blood, urine or other biological samples will always require additional oversight for legal (e.g. Human Tissue Act 2004 requirements), ethical and scientific reasons and because each use of samples necessarily precludes future uses of those samples and therefore must be judged more carefully.
Classes of research data which are potentially sensitive, insofar as they are disclosive and/or raise particular ethical concerns (e.g. incidental/secondary findings), require additional access oversight. Beyond the character of the data themselves, some research uses of data are also potentially controversial and may thereby raise ethical issues and potential harms for research participants themselves, or for the longitudinal studies of which they are a part. Some types of research are particularly likely to raise issues of potential participant and/or study harms, e.g. those involving potentially stigmatising issues, such as mental health, sexuality, criminality and certain diseases, or those in which researchers are perceived to make a commercial gain. Particular forms of or combinations of data also raise risks of disclosure. That individuals can be identified within a genetic dataset has been demonstrated methodologically [27, 28], but only where there is a reference sample/sequence available [29]. For most practical applications identification is unlikely, though this may change with increasing availability of commercial genotyping. Identification of individuals via phenotype data, though less certain than genetic identification, is easier to enact if large numbers or certain classes of variables are combined in analyses (e.g. place of residence if defined too narrowly, minority ethnicity, disability, rare disease status) or, as in longitudinal studies, repeated measures are available. Research questions which further combine genotypic and phenotypic data and/or samples may increase the potential for disclosure of individual identity. Whether the important issue for disclosure is one of identifying the individual per se (i.e. that a particular individual belongs to a longitudinal study) or whether it is about identifying the individual and information about them is an empirical question that requires further examination. Where there are higher risks of disclosure or the proposed research raises other potential harms, online registration and end-user licencing are not sufficient safeguards. Decisions about complex societal values are ultimately not amenable to a simple algorithmic approach whether derived heuristically or, for example, via machine learning. We argue that a human-mediated review and decision-making process are then required. Moreover, we argue that study participants must be central to this decision-making, seeing through to its conclusion the ongoing consent dialogue initiated in consent and participant information, procedures and documents.