How do I open my data?

Moderator: Pasi Kolari, University of Helsinki
Reporters: Daniel Schraik, Aalto University, and Junko Sugano, University of Helsinki

Opening / publishing data does not simply mean uploading files to an FTP server and waiting for users to find out by themselves how to find, retrieve, interpret, and use the data. Opening data means to make an effort to create suitable metadata that is readable to both humans and machines, to take care of legal aspects such as licensing and considering ethical and privacy aspects of the data, and to maintain clean and structured data for the user.

Open data should consist of a persistent identifier (PID), metadata, and the “actual” datasets. They should be accessible online in a data repository. Such repositories are either provided by one’s institution or can be a repository common in one’s field, or general-purpose data repositories (e.g. B2SHARE, Zenodo). list of repositories can be found in re3data.org. It is important to select a repository that offers licensing options and offers searchable and machine-readable metadata.Various choices for licencing exist, but Creative Commons is likely to be the most suitable. Another option is to publish the metadata on metadata platforms such as etsin.avointiede.fi, including instructions on how to obtain the data.

A key aspect of open data is documentation. It is necessary, together with metadata, to guide users on how to obtain, analyze and interpret the data, and to credit the authors of the data correctly. The right moment to start writing documentation is either at the beginning, but at latest, after the end of the data collection. Do not start writing documentation when the study is over as writing purely out of memory is often insufficient!

The data that should be opened is, simply speaking, all data that is necessary to reproduce one’s results or might otherwise be useful to the user. That may contain, in addition to field records, source code used in the analyses. Ideally, everything should be opened. However, this guideline should not always be taken too literally. Common sense would indicate that dead ends in the analysis or other potentially confusing data may be omitted from publishing while still mentioning them in the documentation.

Some researchers see the risk that others might publish a very similar study if they open their data early. This risk is, in most cases, rather low. If there is a reason to believe that the risk is high, there is the option to publish only metadata, or open the data with an embargo. The right moment for opening data is early in the study. It is possible to maintain multiple versions of data.

Metadata is the key for finding open data. It contains a technical part which describes the data, i.e. the file hierarchy, file types, the contents, and their interrelations. The other key aspect of metadata is called administrative metadata. This part contains information on authorship, data ownership, distributor(s), contact person(s), licence and copyright information, and contributors. This part is crucial for scientists to be able to receive credit for their data, and for the data user to be able to contact to the data providers.