Joint statement by CETAF, SPNHC and BHL on DATA within scientific publications: clarification of [non]copyrightability

The EU and other states have made legislative efforts to clarify data mining in copyrightable works, but the situation remains obscure and confusing, especially in a globalised field where international legislation can contribute to opacity. The present paper aims at asserting a common position of three communities representing biodiversity sciences and data specialists on this issue and to propose common and best practice guidelines so that they become universally accepted rules. As scientific data users, we take the standpoint that scientific data are not copyrightable and, furthermore, they can be accessed, shared and reused freely


Introduction
This paper is the outcome of a workshop organised in October 2022 during the annual meeting of TDWG, the Biodiversity Information Standards organisation, held in Sofia, Bulgaria.The workshop was jointly organised by members of the Biodiversity Heritage Library (BHL), the e-Publishing working group of the Consortium of European Taxonomic Facilities (CETAF) and the Society for the Preservation of Natural History Collections (SPNHC) and supported by the Biodiversity Community Integrated Knowledge Library (BiCIKL;Penev et al. (2022)) project.The focus of the workshop was on the legal and contractual rules governing data within copyrighted works.The goals of its recommendations are to empower the biodiversity sciences and data community, including publishers, authors and users, to use appropriate legal and contractual licences and language that will allow data to be reused.A more in-depth discussion is provided by the authors (Buschbom in press).Building on this, the aim is to develop a common vision and a way forward that will allow and accelerate the extraction and reuse of data contained within publications, both legacy and prospective.
Clarifying the legal, ethical and socio-cultural contexts of FAIR (Findable, Accessible, Interoperable and Reusable) data (Wilkinson et al. 2016), we recommend a set of best practices that provide legal clarity, as well as attribution, transparency and accountability for the extraction and reuse of often high quality and information-rich biodiversity data from copyrighted works, specifically scholarly publications.Such data can be integrated into the body of the publication itself, for example, in the form of free text, tables, images or identification keys or attached to it as supplementary datasets.
The proposed set of recommendations builds on existing frameworks, as for example, the Bouchout Declaration on Open Biodiversity Knowledge Management (Anonymous 2014), the "GEO Statement on Open Knowledge" (Group on Earth Observations 2021), the "Recommendation on Open Science" (UNESCO 2021), the "Recommendation of the Council on Enhancing Access to and Sharing of Data" (OECD 2021) and the CARE principles (Collective benefit, Authority to control, Responsibility, Ethics;Carroll et al. 2020).This set of recommendations considers existing discussions of copyright-associated questions in scientific contexts (e.g.Watanabe 2018; European Commission, Directorate-General for Research and Innovation and Angelopoulos 2022) The proposed recommendations reinforce existing best practice guidelines (Ball 2014;Patterson et al. 2014;Egloff et al. 2016, Egloff et al. 2017;Bénichou et al. 2018, Bénichou et al. 2021, Benichou et al. 2022) in use by the biodiversity sciences and informatics community and adapts them to the evolving legal landscape and changing global policy contexts of the ongoing digital transformation.

Description of the problem
Currently, most small publishers, specifically institutional or learned society journals in the natural sciences sector, express concerns related to copyright and are uncertain if they are allowed to share data contained within a published paper without a clear statement from the author.Similarly, many authors are also unaware of whether or not they retain copyright for their text and data in publications.Finally, legal uncertainty and cumbersome procedures, even unmanageable, for extracting data from publications widely persist, negatively affecting the productivity of biodiversity scientists and data managers who are interested in, and dependent on, the re-use of data published in scholarly publications and digital infrastructures.Unclear rights and obligations form a substantial obstacle to the effective interlinking of data and, thus, scientists' and data managers' work.
While scientific publications, by default, are works protected by copyright, scientific data are not copyrightable.Their form is dictated by applicable standards, technical capacity and scientific good practices, which means that data in themselves are neither the result of creative choices nor expressive elements of a work made by the author(s).Furthermore, the copyright protection of a publication refers to the work, not to the data contained in it (499 U.S. 340 1991, Feist vs. Rural, U.S. Supreme Court 1991; Gervais 2019).
Liberating data from existing publications therefore means -from a copyright point of view -extracting unprotected data from protected works, often referred to as text and data mining.We understand text and data mining as "any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes, but is not limited to patterns, trends and correlations", as defined in Art. 2 n. 2 EU Directive 790/2019 (European Parliament and Council 2019).As this automated procedure includes the reuse of the protected work (as do some manual approaches as well), access to and reuse of the work needs an authorisation.This authorisation can be given by contractual licence or by legal licence.Legal licences can be compulsory (i.e. they are applicable even where the parties concerned have stipulated otherwise) or subsidiary (i.e. they are only applicable as far as the parties have not stipulated otherwise).
The EU Directive 790/2019 has introduced two compulsory legal licences referring to text and data mining: Art. 3 obliges every Member State to introduce into its national copyright law a compulsory legal licence for text and data mining for the purposes of scientific research conducted by recognised research organisations and cultural heritage institutions.Art. 4 obliges them to introduce a subsidiary legal licence for any form of text and data mining for any other purpose.
As a result, copyright legislation actually presents a legal divide: in the EU, extracting data from publications for the purposes of scientific research is allowed by law.This authorisation prevails over any contractual agreement and also over eventual licences (as for example CC-licences).In the US, the same procedure may require a contractual licence, unless the conditions for "fair use" are satisfied.In the rest of the world, the legislation differs from country to country.
In Switzerland, extracting data from publications is allowed by legal licence since a revision of the Swiss copyright law in 1992 (SR 231.1 1992).This is why Plazi has based its extraction workflow in Switzerland.Systematic extraction of taxonomic data from scientific publications started in 2009.Since 2013, the extracted data have been deposited in the Biodiversity Literature Repository in Zenodo, a general-purpose open repository developed under the European OpenAIRE programme and operated by CERN (Conseil européen pour la Recherche nucléaire).There has never been any dispute referring to an alleged copyright infringement.
Beyond copyright, it is good scientific practice to attribute extracted data to the source of extraction (Wilkinson et al. 2016;EOSC 2023).Once legally extracted, data can be reused freely.Some restrictions may apply from other protection schemes such as those concerning the protection of national security, the right of privacy and the protection of endangered species.However, we would point out that attribution and credit should not be confused with copyright.From a copyright point of view, extracted data can be reused worldwide without further authorisation.
As with existing legacy publications and data contained within them, it is important for authors and publishers to be aware of the legal situation and the differentiation between the copyright concerning the publication as a whole and copyright of the data within it, as these are matters that are independent of each other.
Thus, journal articles and books as a whole are and remain assets protected by copyright laws and regulations.Therefore, the business foundation of publishers and the business intelligence represented by their portfolios is not affected by the recommendations presented below.These consider solely the scientific data present in the publications.
Experiences with existing publications and data contained in them demonstrate that they often do not have clear copyright and licence information enabling and supporting reuse associated with them.This can require intense background research for each publication about which data are to be used within a research, digitisation or data interlinking infrastructure project.At the end of such inquiries into the legal status, it is not uncommon that questions and uncertainties still remain.
Even if the legal conditions associated with publications and data within them are easily accessible and clearly stated, specifically in investigations utilising many resources from multiple, divergent scientific backgrounds and including various data types, the individual source publications and their data might fall under a wide range of (national) copyright contexts and licence statements implicating the rights and obligations of publishers, authors and users.This creates a patchwork of distinct and divergent conditions, which are difficult to navigate for researchers assembling large datasets.
Looking forward, a solution to the current often ambiguous and patchy situation in the publishing landscape is to explicitly designate scientific data within publications as open and freely reusable, which will result in harmonisation and increased availability of machine-actionable data.

Recommendations
The proposed set of recommendations focuses on the copyright law aspects and scientific best practice norms for accessing and reusing data from scholarly works.The recommendations clarify and adapt existing best practice guidelines in use by the biodiversity sciences and informatics community to the evolving legal landscape and changing global policy contexts for digital information, as well as data needs for answering today's challenges.As societies and associations, we recommend that: 1.
authors and publishers make copyrighted publications as accessible as possible by waiving copyright (CC0) or publishing with a CC-BY-licence; 2. authors and publishers explicitly state that they consider scientific data as not copyrightable.Best practice is to set the contents of their publications, be it data, drawings, media objects etc. (see the Blue List below) into the public domain by attaching a public domain mark that provides certainty about their reusability; 3. publishers use a publishing technique supporting automatic text and data mining (Agosti et al. 2022).

4.
authors state as clearly and comprehensively as possible the provenance of their data, the authors of previous works cited and -for works having more than one author -the respective contributions of all co-authors.
It is best practice in scientific communities to work on the basis of scientific norms that exist independently of the legal realm with its laws, regulations, licences and agreements.These scientific norms exist in the form of well-established best practice approaches to scientific processes and an overarching community code of conduct.Our practices and codes state that data are not owned, but represent a common achievement, to be made openly and freely accessible and available, and to be shared and reused for fostering scientific inquiry and progress as contributions to the public good (Kalkman et al. 2019;Salwén 2021).Data sharing and its associated comprehensive attribution form an important component of the unwritten though widely agreed norms, practices and codes that are in place for fostering transparency, reproducibility and accountability.
As a wider scientific community, it is important to reiterate that the data contained in a scientific publication are freely extractable and reusable.This holds true, in particular, for those parts of the text that form the basis of a taxonomic treatment, as formerly described in the Blue List established by Patterson et al. (2014) other image or series of images) by a person or persons using a recording device, such as a scanner or camera, whether or not associated with light-or electron-microscopes, using X-rays, acoustics, tomography, electromagnetic resonance or other electromagnetic sources, of whole organisms, groups, colonies, life stages especially from dorsal, lateral, anterior, posterior, apical or other widely used perspectives and designed to show overall aspect of organism* ;20.Photographs (or other image or series of images) by a person or persons using a recording device, such as a camera associated with light-or electron-microscopes, using X-rays, acoustics, tomography, electromagnetic resonance images or other electromagnetic sources) of parts of organisms, such as, but not limited to appendages, mouthparts, anatomical features, ultrastructural features, flowers, fruiting bodies, foliage, intra-organismic and inter-organismic connections, of compounds and analyses of compounds extracted from organisms that demonstrate the characteristics of an individual or taxon and/or allow comparison amongst individuals/taxa; 21.Photographs (or other images or series of images) of whole organisms, groups, colonies, life stages, parts of organisms made by camera or scanner or comparable devices using automated procedures; 22. Drawings of organisms or parts of organisms made by a person or persons to demonstrate the characteristics of an individual/taxon or to allow comparisons amongst taxa; 23.Graphical/diagrammatic representation (such as, but not limited to, scatter plots with or without trend lines, histograms or pie charts) of quantifiable features of one or more individuals or taxa for the purposes of showing the characteristics or allowing comparison of individuals or taxa and made by a person or persons.