Electronic medical records (EMRs) hold diverse clinical information about large populations. When this information is coupled with genetic data, it has the potential to make unprecedented associations between genes and diseases. The incorporation of these discoveries into healthcare practice offers the hope to improve healthcare through personalized treatments. However, the availability of such data for widespread research activities is dependent on the protection of a subject’s privacy. Current technological methods for privacy preservation are outdated and cannot provide protection for genomic and longitudinal data (EMR).
Access mechanisms and privacy
Data sharing mechanisms can be categorized into two broad categories: open-access and controlled-access. While both were widely used for regulating genomic data sharing, open-access datasets have been used in many more studies per year [29]. Open-access models either operate under a mandate from participants (who want to publish their genomic data in public platforms) or under the assumption that the shared data is de-identified and possibly aggregated [30]. However, as demonstrated by multiple recent studies, the risk of re-identification is strongly present. It was shown, in multiple independent studies, that it is possible to learn the identities of people who participate in research studies by matching their data with publicly available data [31]. In a recent study [32], the authors showed that they can infer the identity of 50 anonymous male subjects whose Y-chromosome has been sequenced as part of the 1000 Genomes Project. The researchers were not only able to discover the identities of these anonymized research participants but also their family members using available/public pedigrees. In response to this study, the NIH removed the age information from the project’s database. In another recent study, [33, 34], the authors reported that they can confirm whether a person participated in a genome-wide association study, by using information from the person’s DNA sample, “even if the study reported only summary statistics on hundreds or thousands of participants” [31]. In response, the NIH shifted to a controlled access mechanism. In fact, currently, most human genome projects use controlled-access mechanisms.
The personal information derived from genomic data (and EMR data) can be very damaging to the participants. It can be used against them to limit insurance coverage, to guide employment decisions, or to apply social stigma. In [35], the authors report on a case of genetic discrimination by a railroad company. The case occurred in 2002 when the company forced its employees to undergo a genetic test; employees who refused to participate in the test were threatened with disciplinary actions. The company was later forced (in an out-of-court settlement) to compensate 36 of its employees. That is hardly a consolation because if such genetic data was obtained from online sources or breached through illegal means, the company may have been able to get away with its discrimination practices.
Regulations
In many countries, the use of sensitive human-subject data for research purposes has been studied extensively from the legal aspect. Resulting legislations aimed to ensure that private information is properly used and adequately protected when disclosed for research purposes [36, 37]. The legislations (such as the Common Rule [36], Health Information Portability and Accountability Act (HIPAA) [38], and EU data protection directive [39]) generally permit data sharing under one of the following guidelines:
-
G1.
For the use of identifiable data, an approval from an Institutional Review Board (IRB) is required. To approve data requests, IRBs require:
-
a.
Informed consents from the participants for the specific data use, or
-
b.
When consents are deemed impractical, IRBs can grant data access if the study accrues more benefit than risk. Such decision requires a thorough and lengthy evaluation of each data access request from the IRB part.
-
G2.
For adequately de-identified data, researchers can be exempt from IRB approval. The adequacy of the de-identification is generally established by the IRB or by pre-approved policies such as the United States HIPAA privacy rule [37].
Guideline G2 depends on the availability of robust de-identification techniques, but as current techniques are outdated, and unable to deal with genetic and EMR data (as evident from the privacy breaches cited earlier), G2 cannot be adopted. The Vanderbilt genome project is the only project we are aware of that was ruled by Vanderbilt IRB to be a “non-human subject data” as it was deemed to be properly de-identified. However, given the potential impact of the project on the community, guidelines adhering to G1.b were enforced.
Guideline G1.a requires informed consent from participants. The problem with such requirement is that data collectors have to forecast all possible uses of the data and create a comprehensive consent detailing the benefits and risks related to all different data uses. Something that is not easily achievable. In fact, most biobanks collect consents in the form of opt in/opt out [19]. The issues/challenges in implementing proper informed consent will be discussed in depth later in this section.
Almost all existing biomedical data warehouses that house (non-aggregate) genetic data coupled with EMR data follow guideline G1.b. These warehouses lightly de-identify their data and regulate investigators’ access to the data through an IRB [18, 19, 40]. Only researchers with studies that involve less risk than benefit are allowed access to requested data and only after they pass a thorough identity check. However, IRB procedures are extensive and can obstruct timely research and discoveries [41,42,43]. Studies on platforms that rely on IRB for all data accesses reveal unsatisfied users. The application process is strenuous and approvals take a long time often delaying project initiation significantly [43, 44].
In Qatar, as an example, access to the biomedical data collected in Qatar is governed by the QSCH “guidelines, regulations and policies for research involving human subjects”, which adheres to guideline G1.b. A recently formed IRB will regulate all accesses to the research data and services by all research institutes within Qatar and outside.
With such massive mandates, a principal feature for IRBs is to have the capacity to foster timely research and discoveries. Data application processes and approvals should be smooth and should not delay project initiation significantly. Thus, the traditional “IRB-based” data sharing will produce unsatisfied users.
Methods under investigation
The inadequacy of current de-identification methods and the delays in IRB processes prompted privacy experts to seek new solutions. Rapid progress is taking place in privacy research in the biomedical area, driven by the need to protect and benefit from the large biomedical data warehouses being built worldwide. The novel methods can be divided into two main categories, legislative and technical:
-
(i)
Legislative: Legislative methods define privacy rights and responsibilities. Research in this area aims to understand and define individuals’ privacy perspectives and expectations and to update policies and laws that govern data sharing. Genetic data introduces a difficult and unique regulatory situation (with respect to data collection laws and data sharing laws) that is not found with other types of health data [16]. So, until effective privacy protection solutions are made into law, scientists and civil right advocates are calling for the adoption of anti-genetic discrimination laws to mitigate the effect of genetic data breaches. An example is the Genetic Information Non-discrimination Act (GINA) adopted by the US government in 2008. GINA forbids discrimination by insurers or employers on the basis of genetic information. The problem with such regulations is that they are enforced only when discrimination on the basis of genetic information is proven, which necessitates the difficult task of proving malicious intentions.
-
(ii)
Technical: Technical controls aim to create data sharing systems/methods that fulfill the requirements specified in privacy legislation. Current technical approaches to privacy, such as de-identification, are not effective in the genomic context (in fact, the genome is itself an identifier and as such cannot be de-identified (yet) while retaining its utility), thus the need for innovative methods to deal with our new data realities. We classify current research in privacy-preserving mechanisms into three categories: process-driven mechanisms, risk-aware systems, and consent-based systems. In process-driven mechanisms, such as differential privacy and cryptographic techniques, the dataset is held by a trusted server, users query the data through the server, and privacy is built into the algorithms that access the data. Risk-aware systems aim at speeding the IRB processes through partial/full automation, and consent-based systems aim to empower participants by allowing them to control how and by whom their data can be used. This is being done through the introduction of novel dynamic consent mechanisms.
In what follows, we briefly describe recent efforts within each of the three technical categories.
Dynamic consent
Consent-based mechanisms provide data subjects with control over who can access their stored data/specimens, for what purposes, and for how long. Thus, a researcher requesting access to data will receive the data records for which the consent is fulfilled.
The current (mostly paper-based) consent process is static and locks consent information to a single time point (typically during sample collection) [45], requiring all future data usages to be specified at the time of initial consent. This is not feasible with current (multi-purpose and evolving) biomedical data warehouses. The current process also requires limiting the amount of information conveyed to participants to ensure that their consent is informed (i.e., the educational program), since individuals can only absorb limited information at any one time. Re-contacting participants to obtain additional consents and/or to provide additional education materials is arduous, time-consuming, and expensive. Moreover, it can have a negative impact on the participants and on the enterprise.
Active research is underway to overcome this problem. It attempts to provide consent dynamicity to make it easier on the participants and data holders to continuously provide/update consent information. The authors of [46] are working on ways to represent and manage consent information. They focus on defining the different dimensions of a consent. Such dimensions include (i) the characteristics of the institutions that can access the patient’s data, (ii) the level of details that each institution can access, and (ii) the type of research allowed on the data (all possible uses of the data). The authors’ approach is to codify the different consent dimensions. The benefit of the codification “is to provide a common language to capture consented uses of data and specimens” and to “select those data for the investigator’s study that are compliant with the subjects’ consented uses and the investigator’s permissions.” Thus, given a particular study, the characteristics of the study could be matched against the subjects’ codified consent to determine the data subset that conforms. In [47, 48], the authors discuss several challenges in designing dynamic consents, particularly, participant’s consent withdrawal and its implications. It is worth noting that some commercial sequencing companies, such as 23andme [49], already provide a limited form of dynamic consent models through secure online portal systems. Such systems allow users to fill/change their consent information at their own will.
Additional aspects that need to be resolved are consent withdrawal, continuous participant education, and the cultural aspect of the consent:
-
Consent withdrawal: Withdrawal is an essential motivator for research participation; thus, research participants must be allowed to withdraw their participation at any time without any penalties. However, withdrawal is complicated by the fact that participants’ samples/data may already have been shared by other research organizations. Current best practices recommend that any leftover specimens be discarded and that medical data no longer be updated or used but that shared samples and data do not necessarily need to be revoked [50]. It is important for the consent process to highlight these issues and to make sure that participants understand the limitations of consent withdrawal. Additionally, more investigation should be done around different forms of withdrawals to understand their impact on the willingness to participate and to update best practices accordingly.
-
Continuous participants’ education: Biomedical sciences are complex and are evolving very fast, which warrants the need for continuous participant education.
-
Cultural aspect: The purpose of informed consent is to give the right of self-determination to individuals based on complete understanding of risks and benefits of research participation and without any interference or control by others. However, the right of self-determination is deeply affected by culture (some communities value the relationship with family members and turn to them for support when making critical decisions), and thus, consent should be adapted to the specifics of the underlying culture in terms of information sharing and disclosure [51].
Risk-aware access control
The risk of granting data access to a user depends on the characteristics of the request. For example, as stated in [52], “access to highly sensitive data at the data-holder’s location by a trusted user is inherently less risky than providing the same user with a copy of the dataset. Similarly, access to de-identified clinical data from a secure remote system is inherently less risky than access to identifiable data from an unknown location.” Risk-aware access control tries to quantify the risk posed by a data request and to apply mitigation measures on the data to counter the posed risk.
Risk aware access control received growing attention in the past few years. Several of the studies attempted to quantify/model privacy risk, both from the participants’ perspective and the data holder’s perspective. In [53], Adams attempts to model users’ perceptions of privacy in multimedia environments. He identified three factors that determine users’ perceptions of privacy: information sensitivity (user’s perception of the sensitivity of the released information), information receiver (the level of trust the user has in the information recipient(s)), and information usage (costs and benefits of the perceived usages). Lederer [54] uses Adams’ model as a framework for conceptualizing privacy in ubiquitous computing environments in addition to the Lessig model [55] for conceptualizing the influence of societal forces on the understanding of privacy. These efforts concentrate on privacy quantification from the participant perspective rather than the data holder.
Barker at al. [56] introduce a four-dimensional model for privacy: purpose (data uses), visibility (who will access the data), granularity (data specificity), and retention (time data is kept in storage). Barker et al.’s model was later used by Banerjee et al. [57] to quantify privacy violations. Along the same lines, and in multiple consecutive studies [58, 59], El Emam et al. defined three criteria that contribute to privacy risk; these are users’ motives, the sensitivity of the requested data, and the security controls employed by the data requestor. The authors state that, according to their long experience in private data sharing [58, 60,61,62], these are the main criteria used (informally) by data holders.
Recently, in [52, 63], the authors defined a conceptual risk-based access model for a biomedical data warehouse; the model defines the risk posed by data requests using four dimensions:
-
1.
Data sensitivity, or the extent of privacy invasion that would result from inappropriate disclosure of the requested data,
-
2.
Access purpose, or the usages for which the data was requested,
-
3.
Location of the investigator’s institution, which is critical for checking the privacy legislation (if any) that applies at the data requester’s end and whether the same laws are enforceable, and
-
4.
User risk, which measures:
-
a.
The user’s institution ability to secure the data (the research institution to which the user is affiliated). This is evaluated by looking at the privacy practices followed/enforced within their headquarters and
-
b.
The risk associated with the particular user/requestor; it is measured by tracking whether the user caused any past inconveniences.
Once calculated, the risk is fed into an access control decision module. The decision module imposes mitigation measures to counter the posed risk. The defined data-sharing mechanism would impose more mitigation measures on requests of higher sensitivity. The mitigations could manifest as reductions in the granularity of the data (de-identification) and/or as restrictions on when and how a user can access the data. The implementation of this model still requires significant efforts toward (i) assigning sensitivities to the different data attributes, (ii) assigning a score to institutions’ privacy and security practices (such as certifications), and (iii) creating universal user records for storing data breach information.
The issue of assigning sensitivity to data attributes is gaining more consideration. In [64], the authors define a method to detect privacy-sensitive DNA segments in an input stream. In [65], the authors present a privacy test to distinguish degrees of sensitivity within different attributes recognized as sensitive.
Secure multiparty computation
Secure multiparty computations (SMCs) are an attractive approach that allows a researcher to run a function on data owned by multiple parties (each holding a fraction of the data to be analyzed). The calculation is carried out on the overall dataset without any party having to reveal any of their own raw data. Such scenario can be particularly useful for cross-institutional studies (or even cross-countries studies) particularly when no site has enough data to conduct the study in question (for example, studies on rare diseases).
Figure 3 illustrates the SMC concept. In the figure, a researcher wants to run a computation f over the private inputs of three remote databases (data1, data2, data3) while keeping these inputs private. The different parties are allowed to exchange messages with each other and with the researcher. However, such messages are encrypted so as to prevent the different parties from learning any private information through interaction.
SMC is gaining more popularity in the biomedical domain. SMCs are supported by robust mathematical proofs demonstrating their ability to securely protect privacy and thus proving their ability to support data sharing without fear of privacy abuse. In [66, 67], the authors designed a secure linear regression using homomorphic encryption for a multi-hospital quality improvement study. In [68], a secure genome-wide association study (GWAS) was designed using homomorphic encryption, and in [69], a GWAS protocol was designed using secret sharing. In [70], the authors use garbled circuits to perform metagenomics analysis.
In general, the protocols for secure computation have achieved outstanding results; it has been shown that any function (no matter how complex) can be computed securely. Efficiency however is the major drawback of these computations; they are much more complex than regular protocols (that do not provide any security) [71]. The complexity is driven by the extensive message passing between the involved parties as well as the cryptographic functions employed. Recently, the authors in [72] presented a fast and secure computation for linear regression over distributed data based on secure matrix multiplication. And, the authors in [73] designed another efficient secure multiparty linear regression protocol; their method was based on mathematical results in estimation theory. It remains to be seen whether these methods are generalizable to other estimators.