How to manage privacy in a database used for biomedical research purposes under the GDPR

Guest post by Iker Ezkerra Elizalde  from BIOEF

Our personal data are highly valuable, especially if we consider health-related data. This is because they contain sensitive private information of which we are the legitimate owners and they are also valuable for businesses and other organisations as they make it possible for them to obtain tools that help them develop more efficient services and business activities.

These two factors make users wary when a business or other organization is interested in holding their personal data for biomedical research whether or not this is done in a transparent way.

Large tech companies and other organisations usually include details of the anonymous processing of people’s data that will be carried out in the terms and conditions for using their services, but this tends to make users ask more questions. What exactly does this way of processing our data involve? How can we be sure that they are really safe? What does the law say about this?

In Europe, a series of provisions are set out in the General Data Protection Regulation (GDPR), covering proactive responsibility, the obtaining of consent after providing clear understandable information, privacy by design and by default, risk assessment, impact assessment, and the application of safety measures such as anonymisation or pseudo-anonymisation.

What is anonymisation and how should it be carried out?

An anonymization process is by definition irreversible and does not allow a person’s identity to be deduced from data that are attached to a record that is supposed to be anonymous. With data analysis techniques, it is possible to reverse an anonymization process with three or four pieces of information concerning a person. If a business or other organisation provide you with services, analyses your behaviour, and subsequently transforms the related information into disaggregated data for statistical or general behavioural analysis, it continues to know who you are. And, hence, the data have not been anonymised.  

As we have just seen, for a business or other organization to ensure that our data are going to be processed anonymously it is essential that our identity is completely unlinked from the information being processed. This is why the Spanish Data Protection Agency has defined K-anonymity as the property of anonymised data that allows us to quantify to what extent the anonymity of data subjects is preserved for a given data set in which identifiers have been deleted. This serves as a measure of the risk of external agents being able to obtain information of a personal nature from the anonymised data. 

For example, if data of the oldest person in each of the provinces are contained in a anonymized “date of birth” and “postcode” data set, there is an easily identifiable series of data within this theoretically de-identified data set.  

This problem can be addressed by k-anonymisation as it allows us to provide scientific guarantees that in a modified version of person-specific structured data the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful.

Our experience in the MIDAS project

In the MIDAS project, we have managed a database containing data on some 150 clinical variables concerning approximately 900,000 patients. After defining the variables, a risk assessment was performed, deleting data such as name and record number and grouping entries by certain attributes such as low prevalence conditions, address and assigned physician. Subsequently, we assessed the level of privacy achieved using the ARX

Data Anonymization Tool. The results obtained indicated that it was necessary to transform certain variables to ensure a sufficiently deep anonymisation.

The data were transformed following the recommendations of the Spanish Data Protection Agency through k-anonymisation.

The k-anonymisation method seeks to solve the following problem, “Given person-specific field-structured data, produce a modified version of the data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful”.

In this way, after applying various anonymization techniques and k-anonymising (with k = 10) the original data set, we obtained an acceptable level of risk for the type of data in question.

Conclusions

Until recently, deleting or masking identifying attributes may have seemed sufficient to ensure the anonymity of subjects in our study but we have seen that it is possible for common fields in data from different sources suitably grouped and cross-referenced to become pseudo-identifiers that might compromise people’s privacy.

Given this, we conclude that the anonymisation process should go beyond the simple routine and passive application of certain commonly used rules and, applying the principle of proactive responsibility, the data controller should assess the risk of re-identification following the anonymization processes used, choosing the quasi-identifying attributes used appropriately seeking to reduce the chances that their cross-referencing with other data from external sources might pose a risk to the rights and freedoms of data subjects.

To ensure all this is achieved, during the phases of conceiving and designing personal data processing, we should analyse the data to determine, as accurately as possible, adequate ranges for data generalisation and deletion, within reasonable limits that maintain the analytical quality of the data. At the same time, we should analyse and appropriately balance the risks to the rights and freedoms of individuals and the legitimate interests including societal benefits of conducting said data processing. Through these two types of analysis, we should achieve a balance between societal benefits and the costs of the data processing for the rights and freedoms of the data subjects.

There are various different techniques that make it possible to protect personal data and which seek to reduce the threats to privacy associated with data de-anonymisation. K-anonymisation is one such technique and it focuses on preventing re-identification of a specific person within a group, by the generalization of quasi-identifying attributes and the deletion of outliers; nonetheless, it does not guarantee that it is impossible to infer sensitive information related to an individual who is known to be in the group.