Open Data – Open science

In addition to open access to publications, open sharing of primary data generated during research is increasingly being adopted. In experimental biology, examples of such data include gene and genome sequences, details of protein 3D structure, gene expression data from transcriptomic experiments, results of chemical analyses of biological samples, microscopic images, and more.

Open sharing of research data facilitates scientific and economic progress. It allows other scientists to use the data in their projects and encourages collaboration between teams.

But it also benefits the authors of the data themselves. They have their results stored securely and in a structured way on a publicly accessible web (effectively “in a cloud”) for later use. Moreover, they increase the credibility of their work. Anyone can analyse their data and make sure that it was obtained properly and the resulting publication is therefore not based on fraud or questionable methodology.

For a useful overview of topics related to open data, see e.g. the PDF document from the National Technical Library’s webinar Open Science and Citizen Science (pages 103–166, partially in Czech, partially in English).

Where and how to publish data?

Open data should follow the FAIR principles, discussed in more detail below. They must be published in one of the so-called repositories, specialized digital storage sites that meet certain requirements for accessibility, trustworthiness, and so on.

Registries and search engines such as re3data.org and DOAR can help you find a suitable repository. You can use either subject-specific repositories (e.g. arXiv, ELIXIR Deposition Databases, Europe PMC for biology) or universal ones (examples are Dryad and Zenodo). There is also the ASEP data repository which is available for authors from the Czech Academy of Sciences.

Published data should be easily searchable both manually and by a computer. Therefore, they need to be accompanied with so-called metadata that provide more details about the data. Metadata usually include the names of authors, keywords, information on project funding, bibliographic data on related scientific publications, the methodology used to obtain the data, in biology also e.g. the species of experimental organism, and so on.

There are various general and discipline-specific standards for metadata structure. For a basic orientation, you can use the National Technical Library’s document Obecné doporučení pro metadatový popis výsledků výzkumu, zejména publikací a dat (in Czech). An overview of metadata standards used in biology can be found in the Digital Curation Centre database.

Searchability is aided by so-called persistent identifiers – codes that uniquely identify a document, dataset, person, etc. DOI (Digital Object Identifier) is typically used for scientific articles and datasets, ORCID (Open Researcher and Contributor ID) for authors, and ROR for research organisations.

The advantage of persistent identifiers is that they always remain the same – even in case of changes in the web address of the online version of an article, the name of a person after marriage, etc.

How to set up data access?

Open data must have a clearly defined licence – this tells the user how they are authorised to use it. Probably the most commonly used licenses are Creative Commons. Ideally, the data should be published with a Creative Commons CC BY 4.0 license that allows redistribution, modification and other use, with the only requirement being that the author is credited.

Nevertheless, in justified cases it is appropriate to keep the data partially or fully closed. For example, this could mean not allowing access to everybody without restrictions, but requiring interested persons to identify themselves beforehand or ask the authors for permission. Reasons for doing so may include trade secrets, protection of intellectual property, commercial use of the data by the authors, national security, etc. However, grantmakers usually require that project applicants explain the reasons for any data closure in their Data Management Plan.

Data Management Plan

When planning your project, you should already think about how you will acquire, process, store and publish research data. A good methodology for these activities makes your work more efficient, helps to protect information and increases the scientific value of the data.

Some funders already require a written Data Management Plan (DMP). It is a part of the grant application and should be continuously updated during the project if there are significant changes relevant to the data produced. However, it is useful to prepare a similar plan even for projects where it is not mandatory.

The preparation of your Data Management Plan could be made easier by various online resources. DMPs Online – Public DMPs offers publicly available plans that you can use for inspiration. Horizon Europe program has its own plan template. On the ARGOS website, you can browse shared plans from grant proposals or create your own using an online tool (after registration). Scientists from the Czech Academy of Sciences can use the online service AV CR FAIR Wizard to create a plan.

FAIR Principles

Well-managed and organised data that allow easy sharing, further scientific use and collaboration between researchers should follow the so-called FAIR principles (Findable – Accessible – Interoperable – Reusable). According to these principles, data and associated metadata must be:

Findable

It should be possible for others to discover your data. Rich metadata should be available online in a searchable resource, and the data should be assigned a persistent identifier.

– A persistent identifier is assigned to your data.

– There are rich metadata, describing your data.

– The metadata are online in a searchable resource e.g. a catalogue or data repository.

– The metadata record specifies the persistent identifier.

Accessible

It should be possible for humans and machines to gain access to your data, under specific conditions or restrictions where appropriate. FAIR does not mean that data need to be open! There should be metadata, even if the data aren’t accessible.

– Following the persistent ID will take you to the data or associated metadata.

– The protocol by which data can be retrieved follows recognised standards e.g. http.

– The access procedure includes authentication and authorisation steps, if necessary.

– Metadata are accessible, wherever possible, even if the data aren’t.

Interoperable

Data and metadata should conform to recognised formats and standards to allow them to be combined and exchanged.

– Data is provided in commonly understood and preferably open formats.

– The metadata provided follows relevant standards.

– Controlled vocabularies, keywords, thesauri or ontologies are used where possible.

– Qualified references and links are provided to other related data.

Reusable

Lots of documentation is needed to support data interpretation and reuse. The data should conform to community norms and be clearly licensed so others know what kinds of reuse are permitted.

– The data are accurate and well described with many relevant attributes.

– The data have a clear and accessible data usage license.

– It is clear how, why and by whom the data have been created and processed.

– The data and metadata meet relevant domain standards.

Source: Jones S., Grootveld M. (2017): How FAIR are your data? Zenodo. DOI: 10.5281/zenodo.3405141. License CC BY 4.0, authors Sarah Jones and Marjan Grootveld, EUDAT.