Securely linking and analysing scientific data with ODISSEI Data Facility
Collaboration between CBS, SURF, ODISSEI, and Netherlands Twin Register
28 JUN 2018
SURF is working with Statistics Netherlands (CBS) on a virtual IT environment that allows researchers to analyse linked data in a high performance computing environment: the ODISSEI Data Facility (ODF).
The ODISSEI Data Facility is being developed especially for ODISSEI, a collaboration between CBS and Netherlands Organisation for Scientific Research NWO promoting a research data infrastructure for the social sciences in the Netherlands. SURFsara and the research group of biological psychologist Dorret Boomsma (VU) launched a Proof of Concept (PoC) for the ODF in 2017. We examined whether a data and computing environment could be set up within SURFsara that met the CBS legal, technical and security requirements, so that confidential data can be put into that environment for secure analysis.
Netherlands Twin Register
We specifically examined whether it is possible to establish a secure ODF link between CBS microdata and the very large genetic dataset of the Netherlands Twin Register (NTR) already stored with SURFsara. The first link between NTR and CBS data is now a fact, thanks to the efforts of the researchers of the Boomsma working group.
Virtual work environment
Special attention was paid to security. As a precursor of the future ODISSEI Data Facility, a completely virtual environment was created in the PoC. The ODF is part of the CBS network and must therefore comply with the strict CBS security requirements, which among other things stipulate that user access is only possible via the CBS portal. Furthermore, ODF data leaks must never occur unnoticed. Because the national supercomputer Cartesius is always in use by other researchers and can therefore not be part of the ODF environment, virtualisation based on PCOCC software (developed at the CEA in France) was chosen.
This is used to create a virtual computing environment on one or more HPC nodes in what is called a sandbox: a separate environment shielded from the outside world. The virtual nodes retain all the characteristics of a physical Cartesius node, such as connections with other nodes via InfiniBand and access to (Lustre) storage and data in the central HPC environment. SURFsara has configured the virtual environment so that data exchange is only possible between the node and the virtual machine used for the analysis.