Corresponding author: Daniel Mietchen (
Academic editor:
Knowledge workers like researchers, students, journalists, research evaluators or funders need tools to explore what is known, how it was discovered, who made which contributions, and where the scholarly record has gaps. Existing tools and services of this kind are not available as Linked Open Data, but Wikidata is. It has the technology, active contributor base, and content to build a large-scale knowledge graph for scholarship, also known as WikiCite. Scholia visualizes this graph in an exploratory interface with profiles and links to the literature. However, it is just a working prototype. This project aims to "robustify Scholia" with back-end development and testing based on pilot corpora. The main objective at this stage is to attain stability in challenging cases such as server throttling and handling of large or incomplete datasets. Further goals include integrating Scholia with data curation and manuscript writing workflows, serving more languages, generating usage stats, and documentation.
The project is funded by the Alfred P. Sloan Foundation under grant number G-2019-11458.
Data Science Institute, University of Virginia
Information on ethics or security was not required but we plan to explore these issues nonetheless.
This document represents a slightly edited version of the original proposal that was submitted to the Alfred P. Sloan Foundation on February 8, 2019. The original proposal then underwent peer review, as a result of which the proposed project "Robustifying Scholia" was funded. We plan to document this review process as well.
In comparison to that original proposal, the current version differs mainly in that it has an abstract (taken from the cover letter) as well as more space for references, tables, and figures.
This project seeks to strengthen the infrastructure behind using Wikidata (
Linked Open Data is a key element in putting the idea of a semantic web into practice (
While the marketplace offers a range of services with overlapping functionality, none of the available options are community-curated, most use proprietary code, and those few that are based on free and open-source software make use of non-open data, which impedes further aggregation and reuse. We strive to apply Wikimedia values of openness to this problem by developing Scholia as an entirely free and open alternative to the competition, focusing on collaborative curation rather than seeking to capture and contain user contributions.
The major problem with Scholia at the moment is that it is a beta prototype which works, but has not had the necessary development to make its infrastructure ready for Wikimedia-scale use. In this project, we seek to develop Scholia into version 1.0 for pilot communities in anticipation of the approaching opportunity to adapt it for use on a global scale for all areas of research (
Table
This project seeks to assemble key data corpora for narrow use cases in
Competing products either restrict data export or require software installation to present data visualizations of the kind that Scholia provides in any modern browser using just JavaScript and Wikidata.
Products which perform comparable functions to Scholia
In comparison, Scholia commits entirely to being free and open regarding both its software (which is available under the
Wikipedia has defined minimum standards for access to information. Similarly, Scholia will not compete with the most costly features of alternative products, but instead seeks to raise the world's minimal expectations regarding knowledge discovery, research assessment, and related activities.
The proposers are qualified for including the three people who have established Scholia as a working beta tool, for having the academic background to credibly share Scholia with research institutions, and for already having insider credibility as community members within the larger
Complementary to Scholia itself, the ecosystem of media around the tool includes the team's own documentation of
The
The Scholia project uses the Wikimedia custom of agile development, in which each improvement is an incremental change which makes a permanent impact both in the function and published record of the product. The changes this project proposes are independent of each other, and will be addressed in parallel and deployed into the working tool without disrupting its user base.
The goal of such infrastructure development is to create channels for a community-crowdsourced feedback loop of contribution of content, comments, criticism and improvements. As with any Wikimedia project, the large majority of the content development and labor which improves Scholia comes from users with no direct relationship to the core team of developers. Each of the individual plans for development seeks to lower a barrier which users must overcome to participate in the workflow of using Scholia and contributing content to develop it further.
1. Back-end development and testing with
This project requires systematic testing of Scholia
While these technical test sets may be of limited use or interest outside the Scholia development team, the systematic testing of Scholia’s limits can also help identify circumstances where the tool works well, and in conjunction with usage information, we can then start to build pilot datasets like the
2. Redesign of the Scholia user interface for better
The existing beta version of Scholia is functional but requires integration with Wikimedia standards of high usability at a global scale and using accessible design whenever possible. Wikimedia projects already have multilingual infrastructure into which we can integrate Scholia to share its language interface with other Wikimedia translation efforts, particularly through Wikidata.
This project will seek review from a user experience professional to ensure that Scholia meets contemporary standards for usability, including accessibility in design.
3. Improving integration with WikiCite curation workflows, e.g. around
When information is lacking in Wikimedia projects, signals such as "citation needed" invite users to contribute, enrich, and critique the available information. In the Wikimedia way, as Scholia presents and visualizes information, it also identifies content gaps.
Currently, Scholia is primarily a tool for visualization of Wikidata content. Separately, the WikiCite project has tools for data curation on Wikidata, such as the
4. Enhancing Wikidata-based reference management for scholarly writing workflows
Besides the visualizations, Scholia has a number of additional functionalities e.g. regarding entity recognition or reference management. The latter is due to a Python library that processes
This mechanism, while functional, needs to be made more comprehensive and robust. If that is achieved and its usage scaled up, this would provide a compelling way for the community of BibTeX users to share the curation of their metadata through Wikidata.
We currently have no plans to expand this functionality beyond TeX/LaTeX for other writing environments, but a related JavaScript library,
5. Establishing metrics to generate usage stats for Scholia pages and key bibliographic properties and items
We already publish some basic metrics for usage statistics of Scholia-related Wikidata properties (cf. Table
6. Improving
Documentation of Scholia's content, code, and user instructions is significant beyond Scholia and models and teaches the concept of openness in general. By prioritizing documentation, we also establish a historical record of values. Universities, libraries, and the public should expect and protect the level of transparency, accessibility, and openness for which we are setting a standard. Our documentation priorities include routine usage instructions, lay and accessible interpretations of visualizations, notes on the underlying queries and data sources, and statements about gaps and biases.
The main output of this project are technical improvements to Scholia–which will result in the release of version 1.0–as a Wikimedia-based tool for presenting scholarly profiles based on Linked Open Data from Wikidata. Users can view these profiles to gain insights of general or specialist interest to answer questions and guide future research.
If we are successful to the limits of our dreams, then the Wikidata community will play a significant role in shaping the Data Science Revolution in the context of scholarly research, knowledge discovery and research evaluation, and Scholia will be used routinely by students, researchers, journalists and many others around the globe and across disciplines. If we fall short, then we will learn about technical, social and other factors that need to be addressed before such dreams can become reality, and identify some niches where Scholia might find more favourable conditions.
Scholia is a tool which people are using already. Building on the growing awareness of Wikidata in
For the personal profile use case (which
Links to Scholia pages have also been integrated into a number of other Wikimedia tools, e.g.
Scholia has been in beta testing since late 2016. In that time, various pilot communities have flagged technical problems and made feature requests in the tool's GitHub repository and elsewhere. The Scholia team seeks to resolve issues to end beta testing, release Scholia 1.0, and plan for the next stage of the tool's integration into the Wikimedia platforms.
The team seeks to be cautious in this project, and rather than funding any major new visible features, make its back-end infrastructure stable, well designed, well documented, and orderly for others to test and examine. On the front-end, the priority is conforming to Wikimedia standards for accessibility and internationalization.
All of the staff allocations (cf. Table
Scholia has never had dedicated financial support for development. The team has no plans to seek additional funding for any of the features for which this project seeks funding. Scholia's origins are as an unfunded side project of the core team, in the general reuse of open software and open data, and in crowdsourced Wikimedia community engagement.
The most valuable existing support which the project has is social and technical integration with the Wikimedia platforms and the Wikimedia community. However, part of the writing for this proposal has been accomplished by Daniel Mietchen and Lane Rasberry on staff time at the Data Science Institute of the University of Virginia.
Data in this project will have primary publication in Wikidata and is dependent on the infrastructure of that project. Scholia code is hosted
This project has not received previous Sloan funding. The organizers of this project have no previous Sloan grants. In the Wikimedia ecosystem, Sloan has funded other projects and the Wikimedia Foundation itself with indirect connections to this Scholia project. The difference is that the Wikimedia Foundation is a platform which provides a space for community organization, and Scholia is one of the community projects which operates within both the Wikimedia platforms and the general public commons.
In 2017, Sloan
Annually from 2016-18, Sloan has sponsored the Wikimedia Foundation to organize the WikiCite conference, which is an event for about 100 people to discuss the curation of source metadata in Wikidata. Our Scholia project depends on WikiCite data and seeks to contribute more data to the collection discussed at that conference. However, the Wikimedia Foundation and its conference are separate and independent from this proposal and Scholia’s tool and content development, although this proposal's principal investigator Daniel Mietchen has been an organizer for the WikiCite conference.
Principal investigator Daniel Mietchen and project advisor Dario Taraborelli have a conflict of interest for being co-organizers of the Sloan-sponsored WikiCite conference. The other core organizers and the grantee institution have no conflict of interest.
Scholia is a free and open community project. Anyone may participate as an end user or by registering a Wikimedia account and joining discussions. In Wikimedia projects, there is no identified precedent of a conflict of interest problem in a project comparable to Scholia. The Scholia team will watch for conflict of interest as is customary in Wikimedia projects, and will report any issues which arise, but does not anticipate undue participation of any conflicted stakeholders despite the openness of the project.
The Scholia team expects to have the participation of individuals who make purchase decisions for knowledge discovery tools for their institutions. The Scholia team does not anticipate that seeking input and participation from people at this level will raise the challenges of a conflict of interest. Such people could include scholarly communication librarians and human resources teams at research institutes.
Because Wikidata ingests external data, it also ingests the bias of the sources of that data and its environment. Visualizations like those provided by Scholia can help identify such biases. Fig.
This project starts with two mandates for diversity: that of the Wikimedia community and that of the University of Virginia as project host. These mandates require diversity in the appointment of funded positions, and in recruitment of community engagement in the development of the project, and in targeting the pilot communities who will be among the first beneficiaries of project outcomes.
In funded roles, the project starts with 3 investigators in 3 different countries who will develop the tool in their 3 different native languages and additionally English. Later appointments will seek other dimensions of diversity. The project already has collaborations with established Wikimedia community organizations to provide user feedback during development, including groups organized by country, organizations for gender and racial diversity, or inclusivity of specific academic fields.
Scholia itself is a tool which can identify academic accomplishments of minority groups and their members. For example, scientists who are ethnic minorities, women, or LGBT+ often seek to publicly identify themselves to be counted and encourage more people of their demographic to join the sciences. Some queries which Scholia's native infrastructure can accomplish, but which are too progressive and provocative for competing products, include "ratios of scientists by ethnicity at a given university" (
This project seeks to be a model of Wikimedia openness in all information product outputs. Every information product which this project creates will be aligned with the Wikimedia ideal of free media and have compatibility with the appropriate Wikimedia project licenses, which are CC0 for data, CC BY or CC BY-SA for most media and text, and
This project will present datasets, software, documentation, and the published text of online community discussion as part of the primary goal of developing Scholia as an online tool for exploring the Wikidata knowledge graph of WikiCite data. We will put data produced in this project into the Wikidata platform which offers various format options for anyone to export their own copy of the content. Beyond applying open licenses to the primary information products, this project additionally seeks to be open in development, community participation, and public discussion around the project. These processes and conversations will also happen in the open in ways that create media records with open licenses which anyone can access or scrutinize.
To increase accessibility to information products beyond the Wikimedia platforms, we will mirror the publication of some products in more traditional spaces. Examples of additional distribution plans include using GitHub as a code repository for this project and Zenodo for archival copies to make these resources more accessible.
This project will reuse code and content whenever possible, always with a Wikimedia compatible open license. The policy which best describes constraints on this project are the Wikimedia policies on openness, such as their
Everything produced by this project will be accessible online for anyone to access without paying a cost to access, export, remix, or reuse it.
The authors would like to acknowledge the WikiCite community's contributions to bibliographic and related data in Wikidata and to Scholia's documentation and code.
The project is funded by the Alfred P. Sloan Foundation under grant number G-2019-11458.
Robustifying Scholia: paving the way for knowledge discovery and research assessment through Wikidata
Data Science Institute, University of Virginia
Information on ethics or security was not required but we plan to explore these issues nonetheless.
All four authors were involved in conceptualizing the project. LR and DM wrote it up.
Screenshots of Scholia with examples of the kinds of visualizations it provides. Such individual visualization panels are then combined in a predefined way into profiles for authors, topics, organizations, works, events, locations or other units of interest. The examples are taken from the Scholia
"Topic scores" panel for an
"Number of publications per year" panel for an
"Co-author graph" panel for a
"Locations" panel for an
"Timeline" panel for an
"Co-occurring topics" panel for a
Visualization of Scholia traffic logs (note the semi-logarithmic scale). This data has not been analyzed yet, so cannot be used to draw any conclusions other than that usage grows. Multiple contributing factors seem likely, including web crawlers, generic growth of Wikidata and WikiCite content, increased interlinking both within Wikidata and between Wikidata and other websites, especially Wikimedia projects, as well as WikiCite or Wikidata outreach activities, which often
Map of geographical bias in Wikidata (by
Sample Scholia Visualizations. Scholia creates scholarly profiles by presenting the output of standardized sets of queries over the Wikidata corpus. Popular query sets include researchers, topics, and institutions. In some specific cases, e.g. to visualize large coauthor networks or the entire academic output of large universities, the current query results do not compute or render properly due to technical limitations which this project is to address. In the submitted version of the proposal, the table included a miniature version of the images in Fig.
|
|
Topics about which a journal, researcher, or institution publish most often | Fig. |
Counts of publications from a person or organization by year | Fig. |
Networks of co-occurring topics in research, or of clusters of co-authors | Fig. |
Locations of research, or of institutions active in a field of research, or groups which receive a type of funding | Fig. |
Timelines of a researcher's institutional affiliations, or the history of research around a topic | Fig. |
Charts ordering all sorts of popularity counts, like most cited papers, researchers, or institutions for a topic | Fig. |
Snapshot of the live statistics panel on the
Count | Description |
---|---|
7063397403 | Total number of triples |
174306082 | Citations |
91891486 | Author name strings on items about works |
17773585 | Items with a PubMed ID |
16440023 | Items with a DOI |
7236781 | Items with a geolocation |
5876764 | Links from items about works to items about their authors |
5382376 | Links from items about works to items about their main subjects |
4613688 | Links from items about works to items about their main subjects |
2559260 | Items with a taxon name |
452547 | Items about authors with an ORCID profile that has public content |
Categories of Robustifying Scholia expenses. For the detailed budget, see Suppl. material
|
~% |
|
3 investigators, 0.1 time each | 11% | oversight and administration |
back-end developer | 25% | add features and improve function |
front-end developer | 13% | apply interactive wiki interface |
UX designer | 5% | accessibility |
community outreach | 5% | user feedback throughout development |
documentation / student research | 6% | test workflows and publish instructions |
other direct costs | 5% | publishing and travel |
benefits | 17% | defined by university |
overhead | 13% | defined by university |
total | 100% |
Budget for "Robustifying Scholia"
Data type: budget spreadsheet
Brief description: The file contains the budget that was submitted along with the proposal.
File: oo_298201.ods