A step by step guide on how to publish UNIS data

The most important product of research is knowledge. At UNIS, most of the knowledge is derived from in-situ observations of nature during field work. To preserve this information, UNIS has teamed up with Svalbard Integrated Earth Observation System (SIOS) to make scientific data available and easily accessible according to the FAIR (Findable Accessible Interoperable Reusable) principles.

Here you will find a step-by-step procedure to go from digital measurements to data availability via the SIOS data portal.

Data Collection, Preparation and Quality Control

Think ahead

It is a good idea to have publishing in mind already when acquisition takes place, so make sure you record where and when the data were obtained, equipment used, condition etc., and whether the data can be classified as open (UNIS Data Classification). Also, make sure your observations are properly calibrated and have proper units. Erroneous data should not be archived, uncalibrated data sets should be clearly labeled as so in the metadata.

Consistent and structured files

Converting customised files to a FAIR-compliant data formats can be time consuming. However, time can be saved by ensuring that the files are well-structured and populated in a consistent way.

Be consistent in how you are entering values.
- Don’t mix numbers and characters in the same column
- Have a convention for missing values. Either leave the cells blank or use some fill value that you indicate elsewhere in your file

Use templates

If an individual uses the same template for multiple data collections, they can write some code to automate converting their data. If communities develop templates, code can be shared or software can be developed for converting the templates. Data in the templates themselves can also be shared and more easily understood by other members of the community. Both data sharing and conversion to CF-NetCDF or Darwin Core Archives can be further simplified if the CF conventions and/or Darwin Core terms are considered when designing templates.

If you will later be creating a Darwin Core Archive or CF-NetCDF file and you like to work with spreadsheets, consider using the Nansen Legacy template generator.

Key features of the Nansen Legacy template generator

Separate configurations to help you create either a Darwin Core Archive or CF-NetCDF file
Select from a full list of Darwin Core terms and CF standard names to use as column headers
Outlines required and recommended data and metadata terms
Descriptions for each term as notes when you select a cell
Cell restrictions to prevent you from entering invalid values

If the template generator helps you and your work, consider citing it just like you would a paper or data publication.

Luke Marsden, & Olaf Schneider. (2023). SIOS-Svalbard/Nansen_Legacy_template_generator: Nansen Legacy template generator (v1.01). Zenodo. https://doi.org/10.5281/zenodo.7993322

Suitable Data Formats

Why choice of data format matters

As a scientific community, we are now quite good at publishing our data in a way that makes them both findable and accessible. However, FAIR data must also be interoperable and reusable. Central to the FAIR principles is the requirement that data and metadata be fully readable and understandable by machines, a point emphasized throughout by Wilkinson et al. (2016). This machine-readability is crucial for building efficient and effective services on top of data at scale. Examples of services include the integration of multiple datasets, on-the-fly visualization of datasets, the development of monitoring systems and forecasting. This is increasingly important in the age of big data.

Some particularly noteworthy services that deserve attention are:

Destination Earth: A project to develop a digital twin of the Earth.
Global Biodiversity Information Facility (GBIF): An international network and data infrastructure that provides open access to data about all types of life on Earth, enabling research and informed decision-making in biodiversity conservation.
Copernicus: The European Union’s Sentinel Earth observation programme, providing comprehensive remote-sensing data for environmental monitoring, climate change analysis, and disaster management.

These projects exemplify the potential of FAIR data in enabling advanced research, integrated environmental monitoring, and comprehensive data-driven decision-making systems.

We should want the data we collect to contribute to services that benefit soceity. But to effectively integrate data into such services, data must be compliant with the FAIR principles.

Should all data be made FAIR-compliant?

This is a question that sparks some debate. Whilst it is relatively simple to publish certain data in FAIR-compliant data formats (e.g. time series of data, vertical profiles, biodiversity data), for some data it is more complicated (e.g. data from complex experiments, qualitative data). It is up to the data management community to make this process easier for scientists by providing support and developing tools to simplify the data transformation.

The UNIS data policy mandates that “All data are published in a self describing form … where possible. Where this is not possible a detailed product manual shall be linked to the dataset”.

We acknowledge that this is not always possible or practical. Let’s elaborate on this policy here.

Data that can be published in a FAIR-compliant format should be wherever plausible (can be done within an acceptable time frame by an experienced user – new users should expect this to take more time).
For complex datasets, if there are certain important variables that can be published in FAIR-compliant formats, these variables should be.
Some data might take an excessive time to publish in a FAIR-compliant data format and it simply isn’t worth the time that would need to be invested to do so. In some cases the data don’t fit well into such formats. In other cases, suitable data formats are not available. In such cases, we should be aiming for human-readable data.

If you are not sure whether your data should be published in a FAIR-compliant data format, email Luke Marsden.

What Makes a Data Format FAIR-compliant

To be considered FAIR-compliant, a data format must adhere to the following principles:

Software Independence: Data formats should not be tied to a specific software. FAIR-compliant formats must be accessible using various software applications across different operating systems.
Inclusion of Metadata: All necessary metadata required to understand and utilise the data should be embedded within the file.
Use of Controlled Vocabularies: Both data and metadata terms should be derived from controlled vocabularies. A controlled vocabulary is a well-documented glossary of terms that is accessible online and actively maintained. Each term should include:
- A clear description
- Instructions on how to use the term
- A unique identifier to distinguish it from other terms
Standardised Structure: There should be defined conventions for the placement of data and metadata within the file. All files following these conventions should organise their data and metadata in the same way. The specific conventions adhered to by the dataset should be explicitly stated.
Machine Readability/Interoperability: The aforementioned points are crucial in reducing the variability in file creation, thereby enhancing interoperability. This ensures that software or services can be developed to work with all datasets following the same conventions.
Documentation: The data format and conventions should be well-documented, with comprehensive guidelines and examples provided to facilitate its use by others. Good documentation ensures that users can correctly interpret and utilise the data.
Versioning: It is essential to maintain version control for data formats to track changes and updates over time. Additionally, data formats should ensure that past versions remain supported by future versions to guarantee long-term usability.

SIOS provides interoperability guidelines that provide details regarding suitable and less suitable data formats. In the sections that follow, we refer to a few suitable data formats for UNIS data and refer to resources available to help you create these.

Please note that this list might not cover all suitable FAIR-compliant data formats. If you think that something is missed, please let us know. We also acknowledge that there may not be a suitable FAIR-compliant data format in all cases.

CF-NetCDF

Designed to facilitate the creation, access, and sharing of array-based scientific data with any number of dimensions.

Note that NetCDF files themselves are not neccessarily FAIR-compliant unless they adhere to the Climate and Forecast (CF) conventions and the Attribute Convention for Data Discovery (ACDD).

It is not recommended to combine data from several stations in a single CF-NetCDF file. SIOS discuss the issue of granularity in their Granularity Perspectives document.

Examples of data that can be published in a CF-NetCDF file

Meteorological data: such as temperature, pressure, and humidity measured at different altitudes and times
Oceanographic data: such as sea temperature, salinity, or the concentration of chlorophyll a or different nutrients.
Model outputs: consisting of multi-dimensional data arrays representing various environmental parameters, perhaps over time.
Environmental science: such as air quality, hydrology, and soil moisture
Point clouds/DEMs/DTMs
Some geological data: such as sedimentary logs, borehole measurements

There is an ongoing effort to promote the use of NetCDF files across more disciplines – a movement quickly gaining traction. NetCDF files are suitable for any array-oriented scientific data with any number of dimensions. Using the same data format across multiple disciplines facilitates cross-disciplinary collaboration; scientists from different fields can more easily access and understand each other’s data without being impeded by unfamiliar formats.

We may also see the CF conventions being applied in other file formats (e.g. JSON) in the future.

Training materials

Introduction to CF-NetCDF: A general introduction to what CF-NetCDF files are and how to use them. This resource discusses different tools and software that can be used to work with NetCDF files, including some that don’t require you to use any programming languages.
Working with NetCDF files in Python: A written tutorial series on how to use and create CF-NetCDF files using Python. Suitable for beginners who are relatively new to Python.
Working with NetCDF files in R: A written tutorial series on how to use and create CF-NetCDF files using R programming. Suitable for beginners who are relatively new to R.

Validating that your file is FAIR-compliant

Validators you can use to ensure that your files are compliant with the CF and ACDD conventions before you publish them. For example:

https://compliance.ioos.us/index.html
https://sios-svalbard.org/dataset_validation/form (need a SIOS account)

Peer review is also helpful. You are welcome to email Luke Marsden.

Darwin Core Archive

Darwin Core is a data standard originally developed for biodiversity informatics, though this has expanded to be useful for various types of data associated with one or a list of organisms. Darwin Core includes

Darwin Core terms: A controlled vocabulary of terms that are used for data and metadata terms within the archive. Each term can be a column header in the CSV files within the Darwin Core Archive.
Darwin Core Archive: A more-or-less FAIR compliant data format.

Examples of data that should be published in a Darwin Core Archive

Biodiversity data
Measurements or facts related to organisms (e.g. color, leaf size, height)
Fossil specimens
Some experimental data
Species lists derived from DNA data

We advise that you publish measurements of the physical environment (e.g. soil moisture content if quantitative, air temperature, wind speed) separately in CF-NetCDF files. These data are useful to people who are not neccessarily interested in your “biological” data. Each publication can reference the other in the metadata so that someone interested in the data can see that they are related and how. You can even publish both files in a combined data collection with one citation if you wish.

Where to publish a Darwin Core Archive

The Global Biodiversity Information Facility (GBIF) hosts the largest collection of biodiversity data and is part of a larger network of services that share data between each other. This network includes the Ocean Biodiversity Information System (OBIS), Living Norway, INaturalist and more. Since we should want our data to contribute to this network of data, we have several options for where to publish our Darwin Core Archives. The best two options for UNIS data are:

The Norwegian Marine Data Centre (NMDC). The data centre of IMR, data published with NMDC will be shared with OBIS and then therefore shared with GBIF. Data will also be made available via the SIOS data portal. Contact datahjelp@hi.no to start the process. Marine data only – preferred option.
GBIF Norway: The Norwegian participant node of GBIF. Contact helpdesk@gbif.no to start the process. Luke Marsden also has admin credentials to their Integrated Publishing Toolkit at the time of writing. Marine or terrestrial data

3 steps for creating and publishing your Darwin Core Archive.

Use the Darwin Core configuration of the Nansen Legacy template generator to prepare your data. It allows you to create structured spreadsheets containing separate sheets for each core or extensions. It provides you with requirements and recommendations for what terms you should be including as well as descriptions for each term.
Send this completed template to NMDC or GBIF Norway. You can send this to Luke Marsden first/at the same time if you wish.
They will help you to publish your data. This might involve using their node of the Integrated Publishing toolkit. This software (with a user interface) can be used to convert you template into a Darwin Core Archive, include all your metadata and publish the archive. The Integrated Publishing Toolkit includes a validator to check that your data are okay before publishing.

More information on each of these steps is provided in this document.

Human-readable data

Data that can’t plausibly be published in FAIR-compliant data formats should be published in a way that is human-readable and understandable. This means that people who are not experts in your field should be able to understand your data using only the files that you have published.

You should consider the following points when preparing your data:

Use descriptive names for your variables. Don’t use abbreviations. Bonus points if you take terms from controlled vocabularies and refer to the URI of the term used in your publication, e.g. http://vocab.nerc.ac.uk/collection/P07/current/CFSN0023/ – clicking on this URI will give you a description for the term. You can search for terms in the NERC Vocabulary Server.
Don’t rely on cell formatting (colours, bold font, merged cells) in software like Excel to make your data understandable. This formatting is lost when data are loaded into other software like R, Python, Matlab etc.
Use software-independent files. Someone should not have to download (and maybe pay for) extra software to read your data. CSV files are preferred to XLSX files.
Using templates can help with consistency between datasets. Community templates can facilitate sharing of data across and beyond communities.
You should publish a README file in PDF format alongside your publication that includes:
- Metadata related to each data file published (how they were collected, where, when, how, by whom, what instruments or equiment were used). Consider following the Arctic Data Centre’s requirements for discovery metadata.
- A paragraph (or a few) describing your data. Don’t assume that the data user has or will read your paper.
- A description of what is included in each file published and how each file is formatted.
- Descriptions and units for each column header/variable name used. If applicable, include the URI that provides a link to a description of the term in a controlled vocabulary.

Verification and Review

Perform an internal review of your dataset before final submission. Just as with any other publication, all authors should have a chance to review the dataset be grant their approval before it is published.

Also, please make sure to run your data file through a validator if possible to ensure that it adheres to the conventions that you are stating that it should. See the section above for validators for specific file formats.

Selecting a Data Centre

What makes a good data centre?

There are hundreds of data centres. Where should you publish your data? A good data centre should:

Provide your dataset with a DOI, unique to your dataset.
Ensure that your data will remain available through time. Data centres can achieve this by running routines to routinely open and close files to ensure that they have not become lost or corrupted. Some data centres (e.g. Zenodo) don’t offer this service.
Make your data as easy to find as possible.

Let’s elaborate on that last point. Data are important in their own right, independent of any associated paper publication. It can be surprising which datasets get used again. Furthermore, you should not consider your dataset in isolation. Your data are your contribution to a much larger collection of similar data that someone might want to use altogether. This is particularly relevant in the age of big data.

A good data centre should require you to provide thorough discovery metadata (metadata that helps someone find the data through a search engine – e.g. when and where the data were collected, by whom, some keywords). When assessing a data centre, you can test this yourself. How easy is it for you to speculatively find data from a certain region, collected in a certain year, for example?

There are hundreds of data centres. It is impractical for a potential data user to search through all of them. Data access portals aim to make data from different data centres available through a single searchable catalogue. According to the UNIS data management plan, all UNIS data should be made available via the SIOS data access portal, which is a catalogue of data relevant to Svalbard. SIOS does not host any data themselves; they harvest metadata from contributing data centres to provide links to the data.

Other data portals also exist. If your data are published with a data centre that contributes to SIOS, they might also available via other data portals too, for example:

Data centres that contribute to SIOS

The best and easiest way to make your data available via the SIOS access portal is to publish to data centres that contribute to the SIOS. These are listed in the section “Allocation of resources” of the SIOS data management plan. Of the data centres that contribute to SIOS, Norwegian data centres should be prioritised for UNIS data.

NIRD Research Data Archive

How to start process: Data collection form on their site. For datasets too large to upload, begin the data collection form and email them to arrange data transfer.
Email address: archive.manager@norstore.no
How to make data available via SIOS: Metadata collection form (see this section)
Use for: Any data

Norwegian Marine Data Centre (IMR)

How to start process: Email them
Email address: datahjelp@imr.no
How to make data available via SIOS: Automatic
Use for: Any marine data

Norwegian Polar Data Centre (NPI)

How to start process: Email them
Email address: data@npolar.no
How to make data available via SIOS: Automatic
Use for: Any data

Arctic Data Centre (MET)

How to start process: Email them
Email address: adc-support@met.no
How to make data available via SIOS: Automatic
Use for: Any data that contributes to the objectives of MET (e.g. meteorology, physical oceanography, sea ice, remote sensing, air pollution, models and climate analysis).

EBAS (NILU)

How to start process: Email them
Email address: ebas@nilu.no
How to make data available via SIOS: Automatic
Use for: Atmospheric composition data

Exceptions – publishing your data elsewhere

Data can be published to the data centres listed above in most cases. However, there are some data services that exist that are the default for a certain type of data. Crucially, people actively use these services to look for data to research or making data-driven decisions. We want UNIS to contribute data to these services.

If you feel we have missed something, please email us.

Global Biodiversity Information Facility (GBIF)

See above section on Darwin Core Archives

Sequence data

The International Nucleotide Sequence Database Collaboration (INSDC) is a global collaboration of independent governmental or non-profit organisations that manage nucleotide sequence databases.

Participating databases include the European Nucleotide Archive (ENA) and GenBank.

Making your data available via SIOS

If you publish your data to a data centre that SIOS is harvesting from, your data will automatically be made available via the SIOS data access portal – though if in doubt, mention this to them when emailing the data centre to send them your data.

Data published to a data centre that does not contribute to the SIOS data access portal must be linked manually using a metadata collection form hosted by SIOS. Please note that SIOS are not able to create services on top of data linked manually to SIOS (e.g. visualisation, aggregation). Only a link to the data will be provided.

NIRD Research Data Archive

Note that at the time of writing, NIRD RDA is not being harvested by SIOS. This is likely to change in the near future.

There will soon be a way of publish CF-NetCDF files to NIRD RDA through NorDataNet (https://www.nordatanet.no). This will:

Retrieve most of the discovery metadata from the CF-NetCDF file so you don’t have to enter it again.
Make the data available via SIOS

To see whether this tool is now available:

Visit https://www.nordatanet.no
Go to Submit data in the navigation bar at the top
Select Submit data as NetCDF/CF

If this service is now available, you will be prompted to upload your data. If not, you will be redirected to another page.

Citing your data (or someone else’s)

Include the DOI in your publications and share it with collaborators. Your should include the citation for your data in your list of references, just as you would any other publication.

Some journals require you to include a data availability statement. This is fine, but this should be as well as (not instead of) including the citation in your list of references.

Properly citing datasets is important for several reasons:

Credit to data creators: Providers proper credit to the data providers in a way that academia recognises.
Tracking data use: Some data centres scan through the references of publications for the DOIs of the datasets they host to provide statistics on data use on the landing page of the dataset. Tracking the impact and reach of datasets provides feedback to data creators.
Easier to find the data: Providing the full citation makes it easier for someone to find the dataset, which is crucial for replicating the study or using the data for someone else.
Legal and ethical responsibility: Fulfills legal and ethical obligations to acknowledge the original data sources.
Supporting data sharing initiatives: Encourages the culture of data sharing by showing that datasets are valuable and acknowledged.