The data explosion
The volume of stored data is growing rapidly across all fields of science, as is the number of data connections. Publications, which are themselves data objects, are supported by analysed data which is in turn based on raw data. Data corresponding to a specific publication may be housed in various data centers and recorded on various types of media. Storage locations are subject to change as well. This makes it increasingly difficult to guarantee the findability of, and access to, the data. At the same time, access is becoming ever more vital due to the reproducibility requirements of research and the reuse of scientific information.
Persistent identifiers: ISBN numbers for data
To resolve this issue, a coding system for data has been developed: persistent identifiers (PIDs). PIDs are comparable to the ISBN numbers applied to books. Just as an ISBN number provides a permanent, citable reference to a certain book, PIDs do the same for data. PIDs allow us to find data and refer back to it as well. One of the most important functions of a PID is its role as a fixed reference to underlying data, no matter where the latter is located. Any researcher consulting a PID must be able to trust that he or she will find the underlying data. This applies even if the storage location or physical form has been altered.
SURF offers researchers the opportunity to register their collected data and to make it accessible through the use of PIDs. This is done as follows:
- SURF uses the handle software provided by the Corporation for National Research Initiatives (CNRI) as a structural foundation. This handle software uses a software model resembling DNS. A reference in the top determines where each PID is located.
- PIDs consist of a prefix and a suffix. The prefix is the first piece of the persistent identifier and can be requested directly from SURF by contacting us via email@example.com. The prefix belongs to the applicant. An applicant or institution may only submit PIDs that begin with their own individual prefix. As many unique suffixes as desired may be listed under a single prefix.
- SURF acts as a host for the PIDs. The PIDs are then replicated internally at SURF, as well as externally in the context of the EPIC consortium.
- The prefix can be used to create, modify, search for and delete PIDs. This is done through a HTTPS RESTful API.
- The so-called PID resolver is an application that allows the user to determine the location of data, or to request the data object itself, based on a PID. The PID resolver is accessible via an HTTP interface. This makes it possible to use a browser or URL to resolve PIDs at http://hdl.handle.net. The PID resolver always works with one of the three identical PIDs.
The PID service is especially relevant to research projects involving vast amounts of collected data that is used by multiple parties. An example is the collaboration with KNMI/ORFEUS regarding seismic data.
Do it yourself or let SURF help
A system administrator at your institution can create and modify the PIDs accompanying a data project. Doing so will require a certain amount of programming ability. You can also ask SURF to create PIDs for you, although we will only be able to assist in this matter if the data is also stored in SURF systems, such as the Data Archive. This is because SURF has no control over data stored in our clients' systems.
The client retains responsibility for the integrity of the PIDs and the corresponding data objects. This responsibility is especially salient when data is being relocated. In such cases the data manager must ensure the PID reference is altered to reflect the new location.
Support & consultancy
PID users can always count on us for support. We can help you create PIDs, for example, and offer advice on maximizing the findability of your data.
Our helpdesk is available by telephone and email, but can also assist you in person. If you have any questions or want to report a problem, please send an email to firstname.lastname@example.org or phone +31-208001400. The help desk is available during office hours (9:00–17:00).
For advice on more specific topics, such as designing your data infrastructure, please contact our consultants.