Data Management Plan

Research data management refers to all activities involving data that ensure high-quality research data: organizing data, documenting, information security, and long-term preservation of data. In short, conducting ethical research.

When planning a research study, it is important to think through and document how data will be collected and handled during the study, and where research data will be stored after the project ends. A data management plan (DMP) helps with planning and is useful for the research team and required for submission to funders.

A Data Management Plan (DMP) is an official document that describes how research data will be handled throughout the research project and after its completion. The DMP is structured, systematic, and a living document that is continuously updated during the project. It follows the logical research data lifecycle and describes all its aspects.

Tools for creating a DMP:

Some countries have localized and adapted the DMP template to their needs. These are useful for collaborative projects:

  • DMPTuuli: localized DMPonline for Finland
  • DMPTool: adapted DMPonline for the USA
  • RDMO: Research Data Management Organiser, Germany

The following guide follows exactly the structure and sections of DMPonline:

Data Collection and Organization

I collect data myself, (re)use my previously collected data, use public open data (Estonian Open Data Portal), (re)use data collected by others, see the repository registry Re3data, data registries (Mendeley Data, DataCite Commons, etc.), or purchase data.

What to keep in mind?

If data is reused or purchased, which version is being used?

What happens if the data author uploads a new version?

Keep the version you use and its documentation also on your own server.

Check copyrights, licenses, and restrictions (access, reuse).

Check data machine-readability and interoperability with the planned information system.

Data types (experimental, observational, survey data, audio-video, etc.).

How will new data be integrated with existing data?

Which data deserve long-term preservation?

If some datasets are subject to copyright or intellectual property rights, show that you have permission to use the data.

Name the data formats used and justify them.

Use open formats.

Use standard formats.

Use machine-readable formats.

Check whether the format allows automatic addition of metadata.

Check whether repositories support the chosen formats.

Recommended data formats: File Formats. Open Data Handbook,

Estimate the data volume at the end of the project.
Many subsequent decisions and costs related to data management depend on this: storage, access, backup, data exchange, hardware and software, and technical support.

Are there standard procedures and methods (name them, provide links).

Are there data standards (name them, provide links).

How is data quality ensured (availability, integrity, confidentiality).

How are errors handled (input errors, problematic values).

Whenever possible, always use open-source software.

Keeps hardware and software costs low.

Compatible with other open-source software products.

Software is developed and supported by a large community (higher quality, security, and updates; unfortunately, sometimes limited documentation and support).

Software must allow all analyses to be reproduced.

Documentation when creating new software yourself.

Technical support for your own software—can this be provided in the future?

Version control system: Git.

Cloud-based code repository: GitHub.

Open-source licenses: Choose an open source license.

Be systematic and consistent!

File naming: simple, logical, without abbreviations or using standard abbreviations (countries, languages, units of measurement, methods).

Use abbreviations consistently in one language (e.g., MRT or MRI?).

File organization (options: project name, time, place, collector, material type, format, version).

Folder structure should be hierarchical, simple, logical, and short.

How version control is managed and what problems may arise from uploading new versions.

Copying files to multiple locations is not good practice—keep them in one place and create shortcuts.

Adding metadata (who is responsible, when is it added).

Article: Data Organization in Spreadsheets.

Data Collection and Organization

An excellent guide for data documentation: Siiri Fuchs & Mari Elisa Kuusniemi. (2018, December 4). Making a research project understandable – Guide for data documentation (Version 1.2). Zenodo. DOI: http://doi.org/10.5281/zenodo.1914401

A README text file should be provided together with the data files. The README.txt file gives information about the dataset and enables correct interpretation of the data both by yourself and by other researchers after the data is shared or published. Create one README.txt file for each dataset and always name it as README.txt or README.md (Markdown), not LOEMIND, readme, ABOUT, etc.

The README.txt file should definitely include the following information: dataset title, short description of the dataset (abstract), file structure and relationships between files, data collection methods, software used (including versions), standards applied, specific information about the data (units of measurement, explanations of abbreviations and codes, etc.), possibilities and restrictions for data reuse, contact details of the person who uploaded the dataset.

Guide for creating a README.txt file.

Administrative metadata about the project (ID, funder, PI, rights and licenses).

Technical metadata (about hardware and software, instruments, tools, access rights).

Descriptive metadata (authors, title, short description, content description).

DataCite metadata framework (mandatory, recommended, optional) on the DataCite Estonia consortium page.

Metadata standards define which fields need to be completed: Directory of Metadata Standards.
Universal metadata standards: Dublin Core (used in DataDOI), Schema.org, DCAT, DataCite Metadata Schema.

Controlled vocabularies and classifications for metadata specify what to write in these fields using standard terminology. BARTOC (Basel Register of Thesauri, Ontologies & Classifications).

Examples:

Estonian Subject Thesaurus
AGROVOC Thesaurus
Mammal Species of the World
JACS Education Subject Classifications
GeoNames

Research Ethics and Legal Compliance

Estonian Research Council: “Guidelines for Addressing Ethical Issues in a Personal Research Grant Application.”

Information should be provided if the study involves:
people, personal data, human embryos and/or fetuses, human cells and/or tissues, animals, genetic resources, low-income countries, environmental health and safety, potential misuse of research results, other ethical issues.

Add that research ethics and the researcher’s professional ethics will be followed.

Examples of some documents:

Research integrity
Ethics in Social Science and Humanities
A Code of Ethics for Folklore Studies
Personal data protection: GDPR, Estonian Personal Data Protection Act

Is it necessary to apply for ethics committee approval?

Who is responsible?

Here, describe whether the project collects personal data and how it is processed in accordance with the General data protection regulation (GDPR) and the Estonian Personal Data Protection Act.

Who owns the data (personal rights and property rights)?
Data always has an owner, even when it is open data.

How is the data licensed?

Creative Commons licenses.

Excerpts from the copyright guidelines prepared by the University of Tartu lawyer Reet Adamsoo, which may and should be used when drafting a data management plan:

Data belong to the University of Tartu. The proprietary rights to the grant results, including data, are transferred to the university by the grant executors through an employment contract (academic staff) or another written document (intellectual property transfer agreement).

Data are published under the Creative Commons license CC-BY 4.0.

A third party whose data have been used to create the grant results may impose restrictions on the use of the data. In such cases, these restrictions must be taken into account when licensing the data, i.e., a license for data use can only be granted to the extent of rights permitted by the third party (i.e., the scope of rights that the university has obtained from third parties).

If the university or a third party whose data were used to create the grant results wishes to file a patent or utility model application to protect the results, the publication of the data must be postponed until the relevant application has been submitted.

Guide for data protection in research.

Secure Data Storage During Research

The goal is to maintain the technical and substantive quality of the data: availability (accessibility and reachability); integrity (accuracy, completeness, and timeliness); confidentiality (accessible only to authorized persons or systems, key management, retention of log files).

Storage: cloud environments, central servers, servers for sensitive data, computer hard drive, external hard drive, mobile devices.

Files containing personal data must not be stored in cloud environments whose headquarters’ legal address is outside the European Union (Dropbox, Google).

Backup: creating a copy of the current state of data and/or programs, which after a security incident allows restoration to that known state. How often backups are made, how many copies, whether the process is automated. Preservation and backup of the master file. 3-2-1 rule: three copies, stored in two different locations, one of which is off-site. Who is responsible for backups, especially for mobile devices.

A risk analysis is recommended. What happens if: IT systems fail, power outages occur, water or fire damage happens, a device is lost or stolen, malware is detected on devices, a team member leaves or dies, etc.

Risk assessment (likelihood and impact).

Risk evaluation: threats and their probability, vulnerabilities, measures.

Information security standard ISO/IEC 27001.

University of Tartu IT Helpdesk.

University of Tartu cybersecurity guidelines.

Data storage and backup options at the University of Tartu.

Who is responsible?

Management of access rights (whether everyone has the same rights, rights for contractual partners, rights for temporary staff).

Retention of log files.

Pseudonymization, encryption, key management.

Data exchange, personal data, third countries.

Organizational and physical security: training for new employees, potential issues with departing employees, internal work regulations, fire safety, locking doors.

Responsible persons.

Long-term preservation of data

Which data have long-term value? Their preservation and sharing for reuse.

Preparing data for sharing, FAIR data.

Choice of repository.

The data have a persistent identifier (DOI). See DataCite Estonia.

Metadata are in the DataCite registry.

Standard metadata, e.g., Dublin Core.

Machine-readable metadata.

Data and their metadata are in separate files, because data may be closed while metadata must be open. Files are linked to each other.

Keywords and subject terms.

Version control.

Repository where the data will be stored.

Which data are open access, i.e., open data.

Which data remain closed and for what reason.

Metadata must be open even if the data are not (exceptions, e.g., location data of rare species).

Technical metadata: required software (version), instrument specifications, software tools.

Are there encrypted data.

Authentication, from whom to request access rights.

Is it necessary to create a user account that is linked to certain conditions.

Mainly the responsibility of the repository.

Which data and metadata standards, controlled vocabularies, and taxonomies are used.

Descriptions of data types and data formats: if they are not standard, how interoperability is ensured.

Linking with other data, metadata, and specifications.

Correct citation of the datasets used.

Always include a citation format for your dataset.

Data exchange standards.

Partly the responsibility of the repository.

Add a README.txt file.

Specify whether these are raw data, cleaned data, or processed data.

Embargo period, justification.

Licenses: Creative Commons licenses 3.0 Estonia.

Citation: DataCite citation formatter.

Standard metadata, which (subject-specific) standards have been used.

Identification of data provenance (who collected it, where, for what purpose, where it is published, DOIs).

Which software version was used.

How long will data availability for reuse be guaranteed.

Ensuring data quality (availability, integrity, confidentiality).

Recommendations on who might need the data (in the README.txt file).

Data Sharing

Will the data be shared in a repository, as supplementary data alongside the article, or as a separate article in a data journal.

In which repository will the data be stored.

Who could benefit from these data.

How will you share your data (are they open data or must they be requested, under what conditions).

When will you share (continuously, after publication, after the embargo ends).

Is the data linked to the publication.

Link the data to your ORCID account.

Which data are open access, i.e., open data.

Which data remain closed and for what reason.

Are there encrypted data.

How is authentication carried out.

Who decides on access rights and signs the agreements.

Contact details of the data owner (think long-term!).

Responsibilities and  Data Management Costs

By position: Principal Investigator (PI): data management policy, drafting the data management plan, contracts, costs, training; researchers: following and updating the data management plan, data management, raising issues; data manager: training, consulting, information security, preservation, backup, hardware and software; laboratory assistant, support staff according to the tasks assigned to them.

By workflow: who is responsible for data collection, documentation, metadata creation, information security, etc.

Example: TU Delft RD Policy.

Costs are mainly related to labor, hardware, and software.

Guidelines, training, retraining, legal and/or DPO consultation, translation service.

APC (article processing charge).

Data collection: data purchase, transcription of recorded interviews.

Digitization and OCR: hardware and software, labor.

Software development or software purchase, usage licenses.

Hardware: computers, servers, instruments, fieldwork equipment.

Data analysis: hardware and software, outsourced services, HPC.

Data storage and backup: projected data volume, 3-2-1 rule.

Long-term data preservation: preparation for sharing (formatting), anonymization, repository storage.

Partner meetings, conferences.

Project data manager.

General position: 5% of the project budget.

More detailed information about open data and data management plans can be found in the open materials course prepared by the University of Tartu Library: “Research Data Management and Publishing”.

Contact:
Tiiu Tarkpea, Senior Specialist for Research Data
Phone: 737 5728
Email: tiiu.tarkpea@ut.ee