DataSHIELD: taking the analysis to the data, not the data to the analysis

Amadou Gaye, Yannick Marcon, Julia Isaeva, Philippe LaFlamme, Andrew Turner, Elinor M Jones, Joel Minion, Andrew W Boyd, Christopher J Newby, Marja-Liisa Nuotio, Rebecca Wilson, Oliver Butters, Barnaby Murtagh, Ipek Demir, Dany Doiron, Lisette Giepmans, Susan E Wallace, Isabelle Budin-Ljøsne, Carsten Oliver Schmidt, Paolo Boffetta & 29 others Mathieu Boniol, Maria Bota, Kim W Carter, Nick deKlerk, Chris Dibben, Richard W Francis, Tero Hiekkalinna, Kristian Hveem, Kirsti Kvaløy, Sean Millar, Ivan J Perry, Annette Peters, Catherine M Phillips, Frank Popham, Gillian Raab, Eva Reischl, Nuala Sheehan, Melanie Waldenberger, Markus Perola, Edwin van den Heuvel, John Macleod, Bartha M Knoppers, Ronald P Stolk, Isabel Fortier, Jennifer R Harris, Bruce HR Woffenbuttel, Madeleine J Murtagh, Vincent Ferretti, Paul R Burton

Research output: Contribution to journalArticle

50 Citations (Scopus)

Abstract

Research in modern biomedicine and social science requires sample sizes so large that they can often only be achieved through a pooled co-analysis of data from several studies. But the pooling of information from individuals in a central database that may be queried by researchers raises important ethico-legal questions and can be controversial. In the UK this has been highlighted by recent debate and controversy relating to the UK's proposed 'care.data' initiative, and these issues reflect important societal and professional concerns about privacy, confidentiality and intellectual property. DataSHIELD provides a novel technological solution that can circumvent some of the most basic challenges in facilitating the access of researchers and other healthcare professionals to individual-level data. Commands are sent from a central analysis computer (AC) to several data computers (DCs) storing the data to be co-analysed. The data sets are analysed simultaneously but in parallel. The separate parallelized analyses are linked by non-disclosive summary statistics and commands transmitted back and forth between the DCs and the AC. This paper describes the technical implementation of DataSHIELD using a modified R statistical environment linked to an Opal database deployed behind the computer firewall of each DC. Analysis is controlled through a standard R environment at the AC. Based on this Opal/R implementation, DataSHIELD is currently used by the Healthy Obese Project and the Environmental Core Project (BioSHaRE-EU) for the federated analysis of 10 data sets across eight European countries, and this illustrates the opportunities and challenges presented by the DataSHIELD approach. DataSHIELD facilitates important research in settings where: (i) a co-analysis of individual-level data from several studies is scientifically necessary but governance restrictions prohibit the release or sharing of some of the required data, and/or render data access unacceptably slow; (ii) a research group (e.g. in a developing nation) is particularly vulnerable to loss of intellectual property-the researchers want to fully share the information held in their data with national and international collaborators, but do not wish to hand over the physical data themselves; and (iii) a data set is to be included in an individual-level co-analysis but the physical size of the data precludes direct transfer to a new site for analysis.
LanguageEnglish
Pages1929-1944
Number of pages16
JournalInternational Journal of Epidemiology
Volume43
Issue number6
Early online date26 Sep 2014
DOIs
Publication statusPublished - 1 Dec 2014

Fingerprint

Intellectual Property
Research Personnel
Research
Databases
Social Sciences
Privacy
Confidentiality
Sample Size
Developing Countries
Hand
Delivery of Health Care
Datasets
Haemophilus influenzae type b-polysaccharide vaccine-diphtheria toxoid conjugate

Keywords

  • biomedical research
  • computational biology
  • computer security
  • confidentiality
  • databases, factual
  • datasets as topic
  • Great Britain
  • humans
  • information storage
  • information retrieval

Cite this

Gaye, A., Marcon, Y., Isaeva, J., LaFlamme, P., Turner, A., Jones, E. M., ... Burton, P. R. (2014). DataSHIELD: taking the analysis to the data, not the data to the analysis. International Journal of Epidemiology, 43(6), 1929-1944. https://doi.org/10.1093/ije/dyu188
Gaye, Amadou ; Marcon, Yannick ; Isaeva, Julia ; LaFlamme, Philippe ; Turner, Andrew ; Jones, Elinor M ; Minion, Joel ; Boyd, Andrew W ; Newby, Christopher J ; Nuotio, Marja-Liisa ; Wilson, Rebecca ; Butters, Oliver ; Murtagh, Barnaby ; Demir, Ipek ; Doiron, Dany ; Giepmans, Lisette ; Wallace, Susan E ; Budin-Ljøsne, Isabelle ; Oliver Schmidt, Carsten ; Boffetta, Paolo ; Boniol, Mathieu ; Bota, Maria ; Carter, Kim W ; deKlerk, Nick ; Dibben, Chris ; Francis, Richard W ; Hiekkalinna, Tero ; Hveem, Kristian ; Kvaløy, Kirsti ; Millar, Sean ; Perry, Ivan J ; Peters, Annette ; Phillips, Catherine M ; Popham, Frank ; Raab, Gillian ; Reischl, Eva ; Sheehan, Nuala ; Waldenberger, Melanie ; Perola, Markus ; van den Heuvel, Edwin ; Macleod, John ; Knoppers, Bartha M ; Stolk, Ronald P ; Fortier, Isabel ; Harris, Jennifer R ; Woffenbuttel, Bruce HR ; Murtagh, Madeleine J ; Ferretti, Vincent ; Burton, Paul R. / DataSHIELD : taking the analysis to the data, not the data to the analysis. In: International Journal of Epidemiology. 2014 ; Vol. 43, No. 6. pp. 1929-1944.
@article{b3c135929d9d4bb08e467da68f9c6eae,
title = "DataSHIELD: taking the analysis to the data, not the data to the analysis",
abstract = "Research in modern biomedicine and social science requires sample sizes so large that they can often only be achieved through a pooled co-analysis of data from several studies. But the pooling of information from individuals in a central database that may be queried by researchers raises important ethico-legal questions and can be controversial. In the UK this has been highlighted by recent debate and controversy relating to the UK's proposed 'care.data' initiative, and these issues reflect important societal and professional concerns about privacy, confidentiality and intellectual property. DataSHIELD provides a novel technological solution that can circumvent some of the most basic challenges in facilitating the access of researchers and other healthcare professionals to individual-level data. Commands are sent from a central analysis computer (AC) to several data computers (DCs) storing the data to be co-analysed. The data sets are analysed simultaneously but in parallel. The separate parallelized analyses are linked by non-disclosive summary statistics and commands transmitted back and forth between the DCs and the AC. This paper describes the technical implementation of DataSHIELD using a modified R statistical environment linked to an Opal database deployed behind the computer firewall of each DC. Analysis is controlled through a standard R environment at the AC. Based on this Opal/R implementation, DataSHIELD is currently used by the Healthy Obese Project and the Environmental Core Project (BioSHaRE-EU) for the federated analysis of 10 data sets across eight European countries, and this illustrates the opportunities and challenges presented by the DataSHIELD approach. DataSHIELD facilitates important research in settings where: (i) a co-analysis of individual-level data from several studies is scientifically necessary but governance restrictions prohibit the release or sharing of some of the required data, and/or render data access unacceptably slow; (ii) a research group (e.g. in a developing nation) is particularly vulnerable to loss of intellectual property-the researchers want to fully share the information held in their data with national and international collaborators, but do not wish to hand over the physical data themselves; and (iii) a data set is to be included in an individual-level co-analysis but the physical size of the data precludes direct transfer to a new site for analysis.",
keywords = "biomedical research, computational biology, computer security, confidentiality, databases, factual, datasets as topic, Great Britain, humans, information storage, information retrieval",
author = "Amadou Gaye and Yannick Marcon and Julia Isaeva and Philippe LaFlamme and Andrew Turner and Jones, {Elinor M} and Joel Minion and Boyd, {Andrew W} and Newby, {Christopher J} and Marja-Liisa Nuotio and Rebecca Wilson and Oliver Butters and Barnaby Murtagh and Ipek Demir and Dany Doiron and Lisette Giepmans and Wallace, {Susan E} and Isabelle Budin-Lj{\o}sne and {Oliver Schmidt}, Carsten and Paolo Boffetta and Mathieu Boniol and Maria Bota and Carter, {Kim W} and Nick deKlerk and Chris Dibben and Francis, {Richard W} and Tero Hiekkalinna and Kristian Hveem and Kirsti Kval{\o}y and Sean Millar and Perry, {Ivan J} and Annette Peters and Phillips, {Catherine M} and Frank Popham and Gillian Raab and Eva Reischl and Nuala Sheehan and Melanie Waldenberger and Markus Perola and {van den Heuvel}, Edwin and John Macleod and Knoppers, {Bartha M} and Stolk, {Ronald P} and Isabel Fortier and Harris, {Jennifer R} and Woffenbuttel, {Bruce HR} and Murtagh, {Madeleine J} and Vincent Ferretti and Burton, {Paul R}",
note = "{\circledC} The Author 2014; all rights reserved. Published by Oxford University Press on behalf of the International Epidemiological Association.",
year = "2014",
month = "12",
day = "1",
doi = "10.1093/ije/dyu188",
language = "English",
volume = "43",
pages = "1929--1944",
journal = "International Journal of Epidemiology",
issn = "0300-5771",
number = "6",

}

Gaye, A, Marcon, Y, Isaeva, J, LaFlamme, P, Turner, A, Jones, EM, Minion, J, Boyd, AW, Newby, CJ, Nuotio, M-L, Wilson, R, Butters, O, Murtagh, B, Demir, I, Doiron, D, Giepmans, L, Wallace, SE, Budin-Ljøsne, I, Oliver Schmidt, C, Boffetta, P, Boniol, M, Bota, M, Carter, KW, deKlerk, N, Dibben, C, Francis, RW, Hiekkalinna, T, Hveem, K, Kvaløy, K, Millar, S, Perry, IJ, Peters, A, Phillips, CM, Popham, F, Raab, G, Reischl, E, Sheehan, N, Waldenberger, M, Perola, M, van den Heuvel, E, Macleod, J, Knoppers, BM, Stolk, RP, Fortier, I, Harris, JR, Woffenbuttel, BHR, Murtagh, MJ, Ferretti, V & Burton, PR 2014, 'DataSHIELD: taking the analysis to the data, not the data to the analysis' International Journal of Epidemiology, vol. 43, no. 6, pp. 1929-1944. https://doi.org/10.1093/ije/dyu188

DataSHIELD : taking the analysis to the data, not the data to the analysis. / Gaye, Amadou; Marcon, Yannick; Isaeva, Julia; LaFlamme, Philippe; Turner, Andrew; Jones, Elinor M; Minion, Joel; Boyd, Andrew W; Newby, Christopher J; Nuotio, Marja-Liisa; Wilson, Rebecca; Butters, Oliver; Murtagh, Barnaby; Demir, Ipek; Doiron, Dany; Giepmans, Lisette; Wallace, Susan E; Budin-Ljøsne, Isabelle; Oliver Schmidt, Carsten; Boffetta, Paolo; Boniol, Mathieu; Bota, Maria; Carter, Kim W; deKlerk, Nick; Dibben, Chris; Francis, Richard W; Hiekkalinna, Tero; Hveem, Kristian; Kvaløy, Kirsti; Millar, Sean; Perry, Ivan J; Peters, Annette; Phillips, Catherine M; Popham, Frank; Raab, Gillian; Reischl, Eva; Sheehan, Nuala; Waldenberger, Melanie; Perola, Markus; van den Heuvel, Edwin; Macleod, John; Knoppers, Bartha M; Stolk, Ronald P; Fortier, Isabel; Harris, Jennifer R; Woffenbuttel, Bruce HR; Murtagh, Madeleine J; Ferretti, Vincent; Burton, Paul R.

In: International Journal of Epidemiology, Vol. 43, No. 6, 01.12.2014, p. 1929-1944.

Research output: Contribution to journalArticle

TY - JOUR

T1 - DataSHIELD

T2 - International Journal of Epidemiology

AU - Gaye, Amadou

AU - Marcon, Yannick

AU - Isaeva, Julia

AU - LaFlamme, Philippe

AU - Turner, Andrew

AU - Jones, Elinor M

AU - Minion, Joel

AU - Boyd, Andrew W

AU - Newby, Christopher J

AU - Nuotio, Marja-Liisa

AU - Wilson, Rebecca

AU - Butters, Oliver

AU - Murtagh, Barnaby

AU - Demir, Ipek

AU - Doiron, Dany

AU - Giepmans, Lisette

AU - Wallace, Susan E

AU - Budin-Ljøsne, Isabelle

AU - Oliver Schmidt, Carsten

AU - Boffetta, Paolo

AU - Boniol, Mathieu

AU - Bota, Maria

AU - Carter, Kim W

AU - deKlerk, Nick

AU - Dibben, Chris

AU - Francis, Richard W

AU - Hiekkalinna, Tero

AU - Hveem, Kristian

AU - Kvaløy, Kirsti

AU - Millar, Sean

AU - Perry, Ivan J

AU - Peters, Annette

AU - Phillips, Catherine M

AU - Popham, Frank

AU - Raab, Gillian

AU - Reischl, Eva

AU - Sheehan, Nuala

AU - Waldenberger, Melanie

AU - Perola, Markus

AU - van den Heuvel, Edwin

AU - Macleod, John

AU - Knoppers, Bartha M

AU - Stolk, Ronald P

AU - Fortier, Isabel

AU - Harris, Jennifer R

AU - Woffenbuttel, Bruce HR

AU - Murtagh, Madeleine J

AU - Ferretti, Vincent

AU - Burton, Paul R

N1 - © The Author 2014; all rights reserved. Published by Oxford University Press on behalf of the International Epidemiological Association.

PY - 2014/12/1

Y1 - 2014/12/1

N2 - Research in modern biomedicine and social science requires sample sizes so large that they can often only be achieved through a pooled co-analysis of data from several studies. But the pooling of information from individuals in a central database that may be queried by researchers raises important ethico-legal questions and can be controversial. In the UK this has been highlighted by recent debate and controversy relating to the UK's proposed 'care.data' initiative, and these issues reflect important societal and professional concerns about privacy, confidentiality and intellectual property. DataSHIELD provides a novel technological solution that can circumvent some of the most basic challenges in facilitating the access of researchers and other healthcare professionals to individual-level data. Commands are sent from a central analysis computer (AC) to several data computers (DCs) storing the data to be co-analysed. The data sets are analysed simultaneously but in parallel. The separate parallelized analyses are linked by non-disclosive summary statistics and commands transmitted back and forth between the DCs and the AC. This paper describes the technical implementation of DataSHIELD using a modified R statistical environment linked to an Opal database deployed behind the computer firewall of each DC. Analysis is controlled through a standard R environment at the AC. Based on this Opal/R implementation, DataSHIELD is currently used by the Healthy Obese Project and the Environmental Core Project (BioSHaRE-EU) for the federated analysis of 10 data sets across eight European countries, and this illustrates the opportunities and challenges presented by the DataSHIELD approach. DataSHIELD facilitates important research in settings where: (i) a co-analysis of individual-level data from several studies is scientifically necessary but governance restrictions prohibit the release or sharing of some of the required data, and/or render data access unacceptably slow; (ii) a research group (e.g. in a developing nation) is particularly vulnerable to loss of intellectual property-the researchers want to fully share the information held in their data with national and international collaborators, but do not wish to hand over the physical data themselves; and (iii) a data set is to be included in an individual-level co-analysis but the physical size of the data precludes direct transfer to a new site for analysis.

AB - Research in modern biomedicine and social science requires sample sizes so large that they can often only be achieved through a pooled co-analysis of data from several studies. But the pooling of information from individuals in a central database that may be queried by researchers raises important ethico-legal questions and can be controversial. In the UK this has been highlighted by recent debate and controversy relating to the UK's proposed 'care.data' initiative, and these issues reflect important societal and professional concerns about privacy, confidentiality and intellectual property. DataSHIELD provides a novel technological solution that can circumvent some of the most basic challenges in facilitating the access of researchers and other healthcare professionals to individual-level data. Commands are sent from a central analysis computer (AC) to several data computers (DCs) storing the data to be co-analysed. The data sets are analysed simultaneously but in parallel. The separate parallelized analyses are linked by non-disclosive summary statistics and commands transmitted back and forth between the DCs and the AC. This paper describes the technical implementation of DataSHIELD using a modified R statistical environment linked to an Opal database deployed behind the computer firewall of each DC. Analysis is controlled through a standard R environment at the AC. Based on this Opal/R implementation, DataSHIELD is currently used by the Healthy Obese Project and the Environmental Core Project (BioSHaRE-EU) for the federated analysis of 10 data sets across eight European countries, and this illustrates the opportunities and challenges presented by the DataSHIELD approach. DataSHIELD facilitates important research in settings where: (i) a co-analysis of individual-level data from several studies is scientifically necessary but governance restrictions prohibit the release or sharing of some of the required data, and/or render data access unacceptably slow; (ii) a research group (e.g. in a developing nation) is particularly vulnerable to loss of intellectual property-the researchers want to fully share the information held in their data with national and international collaborators, but do not wish to hand over the physical data themselves; and (iii) a data set is to be included in an individual-level co-analysis but the physical size of the data precludes direct transfer to a new site for analysis.

KW - biomedical research

KW - computational biology

KW - computer security

KW - confidentiality

KW - databases, factual

KW - datasets as topic

KW - Great Britain

KW - humans

KW - information storage

KW - information retrieval

UR - http://ije.oxfordjournals.org/content/43/6/1929

U2 - 10.1093/ije/dyu188

DO - 10.1093/ije/dyu188

M3 - Article

VL - 43

SP - 1929

EP - 1944

JO - International Journal of Epidemiology

JF - International Journal of Epidemiology

SN - 0300-5771

IS - 6

ER -

Gaye A, Marcon Y, Isaeva J, LaFlamme P, Turner A, Jones EM et al. DataSHIELD: taking the analysis to the data, not the data to the analysis. International Journal of Epidemiology. 2014 Dec 1;43(6):1929-1944. https://doi.org/10.1093/ije/dyu188