• 2017-06-07
  • Article

Research datasets are seldom used in their raw form or in isolation, even by the researchers who captured them for a particular experiment. At minimum a raw dataset will need some processing, perhaps calibration or data-cleansing, before it can be analysed. In experiments on photon and neutron facilities, for example, typically directed at understanding the structure of materials, there is a distinction between the phases of data reduction and analysis, which can both be very time-consuming operations. In general, processed datasets will be created from raw data, and in many fields data will be combined with other datasets that have originated elsewhere—studies of the natural environment being a case in point.

As a result, there are a number of issues that arise under the general heading of “workflows and provenance”. Workflow captures and represents the stages of processing through which datasets have passed in a particular study, leading to the final published results; while provenance refers to evidence of the origins of a particular dataset, giving confidence in its authenticity, validity and reusability. These are obviously two sides of the same coin. Workflows and provenance are key for reproducibility in research. It is important not only to be able to reproduce an experiment, but also the processing of the data to obtain the conclusions.

A number of needs arise that representation of workflows and provenance can help to satisfy. Locating datasets for reuse requires not only finding potential sources of data but also understanding its suitability: its origins, how it has been processed, its limitations and context. This is especially true when reusing data across communities. What is it that makes data fit for reuse, and are there different levels of usability? How can the identity of a dataset be maintained, and how can dynamic datasets be handled? As the Research Data Provenance Interest Group puts it, summarising the key questions: “Where did it come from? Who modified it? Is this copy the same as the copy I deposited? In what way is it the same? How do I resolve discrepancies or anomalies?”

Such questions are of interest not only to researchers seeking to reuse data, but also to repositories concerned with best practices in making their data reusable.

Topic Graph

Relevant RDA groups

Brokering Framework WG The Brokering IG/WGs work on specifications for a middleware layer and services that can mediate in circumstances where heterogeneity has to be brought together. Mapping between heterogeneous metadata sets is one of the often occurring challenges.

Brokering IG

The Brokering IG/WGs work on specifications for a middleware layer and services that can mediate in circumstances where heterogeneity has to be brought together. Mapping between heterogeneous metadata sets is one of the often occurring challenges.

Data Description Registry Interoperability WG The Data Description Registry Interoperability (DDRI) WG addresses the problem of cross-platform discovery by connecting datasets together on the basis of co-authorship or other collaboration models such as joint funding and grants. The suggested solution compiles simple enabling infrastructures based on existing open protocols and standards with a flexible and extensible approach that allows registries to opt-in and enables any third-party to create particular global views of research data.

Data Discovery Paradigms IG Given the increasing number of research data repositories, and the need for cross-disciplinary data discovery, the Data Discovery Paradigms IG wants to identify common elements and shared issues that support users in discovering research data regardless of its location or the manner in which it is stored, described and exposed. These could be a registry of data search engines, common test datasets, usage metrics, and a collection of data search use cases and competency questions. So metadata and its optimal use is in the focus of this group.

Data in Context IG Data Objects which includes large collections have been created in a certain context (persons, projects, institutions, etc.). The Data in Context IG wants to work out principles of how to include context knowledge in the description of DOs which could help in re-use and also allow to draw links between various types of entities.

Data Versioning IG

The Data Versioning IG recognises that many datasets are continually being extended with new data, and that in some cases existing data is revised due to enhanced processing capabilities. This means that it becomes important to clearly identify which version of a dataset was used to derive some research result. It is also necessary to be able to specify extracts (such as slices through space or time) on which the processing was done. The aim is to establish technical bases and good practice for identification of versions of dynamic data.

Libraries for Research Data IG

Libraries have expanded on their traditional roles and developed new services in the digital environment, not just facilitating but becoming active participants in the research process. These services include providing access and preservation of research data, as well as advising and supporting researchers in the management of research data.

Libraries have a successful history in collaboration and interoperable solutions, something that is increasingly vital in an environment of evolving software and data management products, mobile researchers, and volatile repositories. Maintaining continued long term access to scholarly assets is essential, and RDA offers a venue for librarians to share their skill sets and expertise in this regard with members of other groups such as Domain Repositories Interest Group, the Metadata Working Group, and the Data Publishing Interest Group. Librarians in turn can receive best practice developed in other fields and bring this back to the library community. It also offers the opportunity to share the principles, and practices of librarians experienced in the stewardship of data, with domain specific groups seeking to develop local solutions to often universal problems within data management.

The objectives of the Libraries for Research Data Interest Group include development of strategies to embed data management services at academic and research institutions, identification of sustainable organisational business models for libraries in support of RDM, and the promotion of best practice and interoperability of library infrastructures with domain repositories and other RDM initiatives. Working groups will be formed with reference to specific, short term activities identified by the Interest Group.

Provenance Patterns WG

The Provenance Patterns WG recognises the centrality of tracking of provenance of research data. The aim of the group is to identifying and recommend best practices for provenance representation and management.

Reproducibility IG The Reproducibility IG seeks to advance and enable reproducibility in research based on or producing datasets. Recommendations need to be made to overcome the current situation where too often research results based on data cannot be reproduced. One important pillar amongst others is to develop and/or adopt suitable metadata standards describing data and code that are involved in creating research results and encourage their usage.

Research Data Provenance IG analysing the requirements for provenance metadata relevant for later data re-use

WDS/RDA Assessment of Data Fitness for Use WG

The WDS/RDA Assessment of Data Fitness for Use WG tackles the question of understanding and assessing data quality, which is a multifaceted property depending not only on the data itself but on supplementary information such as annotations as well as curation, citability and the like. The aim is to develop criteria or metrics that represent fitness for use, and to develop ways of communicating it for individual datasets.