December 1, 2021
One of the key principles of citizen science, compiled by the European Citizen Science Association, highlights the importance of making citizen science project data and meta-data made publicly available, preferably in open access format.
That’s precisely what we have been aiming for ever since the launch of the OSDG Community platform in March 2021.
With our launch of OSDG Community Dataset (OSDG-CD), we mark a significant milestone in the journey of open-source research, as it is our first publicly available project result.
OSDG-CD is the direct contribution of hundreds of citizen scientists from all over the world who chose to contribute their time and effort to improve the understanding of the Sustainable Development Goals (SDGs). We are sincerely thankful to each and every volunteer that joined this community effort.
The dataset, available in Zenodo’s multi-disciplinary open repository, is accessible to any researcher or expert that wishes to gather insights into the SDGs using machine learning (ML) or ontology-based approaches.
OSDG Community platform and OSDG-CD are primarily based on publicly available texts such as publications, reports and other written data sources. You can learn more about the dataset’s structure, and the methodology behind the exercise by entering the Zenodo repository, or by following the tdataset’s Digital Object Identifier (DOI): zenodo.org/record/5550238.
We made sure to anonymise the data and remove all instances of personal data. The OSDG-CD is a flat tabular dataset with the following data columns:
doi - Digital Object Identifier (DOI) of the source document text_id – a unique text identifier text – a text excerpt from the document sdg – the specific SDG the text is validated against and that volunteers were asked to consider labels_negative – the number of citizen scientists that rejected the suggested SDG label labels_positive – the number of citizen scientists that accepted the suggested SDG label agreement – agreement score.
We make this data public with the hope to inspire and enable researchers to discover new insights into and meaningful connections among the SDGs. As researchers and curators of the project, we would really like to hear about your insights and discoveries. Please consider informing us about your outputs, be it a research paper, an ML model, a blog post, or just an interesting observation.
We aim to update the OSDG-CD every quarter, presuming there are notable changes in the number of labels.
The OSDG Community platform is our ambitious attempt to create a large and accurate source of textual information on the SDGs.
If you would like to learn more about the OSDG approach from a technical standpoint or find examples of text classification using OSDG-CD in practice, please visit out GitHub repository. If you have a technical background or interest in data science, we welcome your contributions to the OSDG Labelling Tool. If you have any questions, do not hesitate to reach out to our team.
If you would like to find out more about the OSDG approach to open-source classification of text data to UN SDGs, refer to the article developed by Lukas Pukelis, Nuria Bautista Puig, Mykola Skrynik, and Vilius Stanciauskas, entitled “OSDG -- Open-Source Approach to Classify Text Data by UN Sustainable Development Goals (SDGs)”.
If you plan to use the dataset in a research paper, please cite it as follows:
OSDG, UNDP IICPSD SDG AI Lab, & PPMI. (2021). OSDG Community Dataset (OSDG-CD) (2021.09) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5550238.