Dataset Preparation

One of the most important considerations for preparing a dataset is how to make it FAIR data (FAIR Principles). While this is something that is increasingly stipulated by funding bodies, it also enhances the dataset you are creating, making it more likely to be re-used. A checklist for assessing the FAIRness of your data can be found at Jones, S. & Grootveld, M. (2017, November). How FAIR are your data? DOI:10.5281/zenodo.1065990.

FAIR Data Principles
"FAIR Principles". The Open Science Training Handbook. Used under a CC-Zero 1.0 licence.

In order to prepare a dataset for sharing there are several stages which need to be carried out:

  • Preparing you data files
  • Documenting your data and files
  • Depositing in a data repository
  • Updating datasets

Preparing you data files

Interoperability is one of the four pillars of FAIR data and there are several measures that can be used to ensure highly interoperable data.

Enhancing Interoperability & Reusability
  • Avoid image only datasets for quantitative results. Plotted figures can be part of the dataset but not be the whole dataset.
  • Use Machine Readable formats for dataset
    • Use preferred formats where possible (see Preferred File Formats)
    • Structured data is preferable - easily defined and understandable
    • Use sensible variable names i.e. force, torque are better than x,y
    • Avoid spaces or special characters in variable names (see File Naming Recommendations)
    • Don’t use commas as decimal separators in numbers as these can easily be mistaken as csv files, rendering the data unusable.
      • Format numbers as 675454453.00 or 0.00007654
  • Store dataset metadata in a file format that is both machine readable and human readable and never use proprietary formats for metadata files. Suggested suitable formats:
    • json
    • ascii text
    • yaml / yml
    • toml
    • xml
    • csv
  • For visualisation of datasets and possible sharing of DEM data, file formats such as .vtk/.vtu are well supported

Documenting your data and files

Dataset Accompanying Metadata
The README File - Putting the R in FAIR

The README file is an important file that is distributed with your dataset. It is provided as a means to convey key information about the dataset and it's structure so that your dataset can easily re-used. It will typically contain a description of the dataset but will also provide information on how best to use the data, what tools were used in the preparation of the dataset, provenance (when and where it was collected/generated) and the license under which it is being shared.

The README file should be in plaintext format (NEVER use a proprietary format, although PDF can be acceptable) and should be well organised and human-readable. Markdown files (*.md) have become the default format for the README file in many repositories because they are human readable, but also support syntax that can easily be rendered online or to PDF.

Recommended content for a README file:

  • Title for dataset
  • Keywords
  • Investigator / Contact person
  • Collection/Generation Timeframe
  • Methods of collection/generation
  • Deviations from standards
  • Description of dataset:
    • Filename(s) - this can be sometimes created as a separate index file
    • File structure
    • etc.
  • License

You may also want to include instructions on how to cite your dataset (the DOI and/or path to the repository) within the README file.

Depositing in a data repository

Once the dataset is prepared the next stage is sharing your data publicly. There are many ways of doing this

Choosing a licence for your dataset

It is important that any data being shared is provided with a valid license that tells the user what they are allowed to use the data for. Licensing consideration for software and datasets are quite different and for this reason typical open source software licenses may not be suitable for sharing datasets.

Licence types for Machine Learning (ML) models also fall into a separate category and should be considered different to software and data licences.

Typical suitable licences for data:

  • Creative Commons (default on Zenodo)
    • Creative Commons Attribution Share-Alike 4.0 (CC-BY-SA-4.0)
    • Creative Commons Attribution 4.0 (CC-BY-4.0)
  • Open Data Commons
    • Open Data Commons Attribution License (ODC-By-1.0)
    • Open/Non-Commercial Government Licence
  • Public Domain

Licencing can be complicated and if in need of assistance with selecting an appropriate licence, you should speak to an expert within your organisation. Also, your funding provider or organisation may have it's own requirements on what license can be used for sharing data.