Robustifying Scholia: paving the way for knowledge discovery and research assessment through Wikidata

Knowledge workers like researchers, students, journalists, research evaluators or funders need tools to explore what is known, how it was discovered, who made which contributions, and where the scholarly record has gaps. Existing tools and services of this kind are not available as Linked Open Data, but Wikidata is. It has the technology, active contributor base, and content to build a large-scale knowledge graph for scholarship, also known as WikiCite. Scholia visualizes this graph in an exploratory interface with pro ﬁ les and links to the literature. However, it is just a working prototype. This project aims to "robustify Scholia" with back-end development and testing based on pilot corpora. The main objective at this stage is to attain stability in challenging cases such as server throttling and handling of large or incomplete datasets. Further goals include integrating Scholia with data curation and manuscript writing work ﬂ ows, serving more languages, generating usage stats, and documentation.

While the marketplace offers a range of services with overlapping functionality, none of the available options are community-curated, most use proprietary code, and those few that are based on free and open-source software make use of non-open data, which impedes further aggregation and reuse.We strive to apply Wikimedia values of openness to this problem by developing Scholia as an entirely free and open alternative to the competition, focusing on collaborative curation rather than seeking to capture and contain user contributions.
The major problem with Scholia at the moment is that it is a beta prototype which works, but has not had the necessary development to make its infrastructure ready for Wikimediascale use.In this project, we seek to develop Scholia into version 1.0 for pilot communities in anticipation of the approaching opportunity to adapt it for use on a global scale for all areas of research (Lemus-Rojas and Odell 2018).
Table 1 provides an overview of the kinds of visualizations (Fig. 1) which Scholia can present.It does so by querying Wikidata for information of scholarly interest about the concept to be profiled (Malyshev et al. 2018) and matching that with related information from Wikidata, as well as content from ist sister sites, Wikimedia Commons and Wikipedia.The figure highlights cases which work, but a range of challenges can cause these visualizations to break.By addressing the underlying basic problems, we hope to bring enough stability to Scholia that the Wikimedia infrastructure and community can more fully begin to engage with and participate in discussions around scholarly data.
To visualize this concept in a Wikimedia profile... Scholia presents a data visualization like this...
Topics about which a journal, researcher, or institution publish most often Fig. 1a Counts of publications from a person or organization by year Fig. 1b Networks of co-occurring topics in research, or of clusters of co-authors Fig. 1c Locations of research, or of institutions active in a field of research, or groups which receive a type of funding Fig. 1d Timelines of a researcher's institutional affiliations, or the history of research around a topic Fig. 1e Charts ordering all sorts of popularity counts, like most cited papers, researchers, or institutions for a topic Fig. 1f This project seeks to assemble key data corpora for narrow use cases in pilot communities to get deep feedback on the performance, design, accessibility and other features of Scholia's most important functionalities.The near future of Scholia's development will require some technical decisions in database architecture, hardware investment and planning for institutional partnerships.Before making those foundational decisions and doing Wikimedia-scale outreach, the team seeks to pilot, assure partner buy-in, and establish community values for the project.
Sample Scholia Visualizations.Scholia creates scholarly profiles by presenting the output of standardized sets of queries over the Wikidata corpus.Popular query sets include researchers, topics, and institutions.In some specific cases, e.g. to visualize large coauthor networks or the entire academic output of large universities, the current query results do not compute or render properly due to technical limitations which this project is to address.In the submitted version of the proposal, the table included a miniature version of the images in Fig. 1 as well as of the complete author profile referred to in the legend to Fig. 1a.

What is the major related work in this field?
Competing products either restrict data export or require software installation to present data visualizations of the kind that Scholia provides in any modern browser using just JavaScript and Wikidata.Wikipedia has defined minimum standards for access to information.Similarly, Scholia will not compete with the most costly features of alternative products, but instead seeks to raise the world's minimal expectations regarding knowledge discovery, research assessment, and related activities.
Why is the proposer(s) qualified to address the issue or subject for which funds are being sought?
The proposers are qualified for including the three people who have established Scholia as a working beta tool, for having the academic background to credibly share Scholia with research institutions, and for already having insider credibility as community members within the larger WikiCite project Mietchen et al. 2017) and Wikimedia platforms more broadly.
Complementary to Scholia itself, the ecosystem of media around the tool includes the team's own documentation of Scholia's code, usage instructions, conference presentations, and presentation in academic literature.The published record also includes media from Wikimedia community discussions about Scholia, the testimony of academic partners who have piloted Scholia use of their own accord, and the market research descriptions which list Scholia among key Wikimedia development trends.
The core team consists of Daniel Mietchen, a biophysicist and data scientist at the Data Science Institute at the University of Virginia, along with Finn Årup Nielsen, a computer scientist and neuroscientist in DTU Compute at the Technical University of Denmark who started Scholia, and Egon Willighagen, a chemist in the Department of Bioinformatics at M aastricht University.

What is the approach being taken?
The Scholia project uses the Wikimedia custom of agile development, in which each improvement is an incremental change which makes a permanent impact both in the function and published record of the product.The changes this project proposes are independent of each other, and will be addressed in parallel and deployed into the working tool without disrupting its user base.
The goal of such infrastructure development is to create channels for a communitycrowdsourced feedback loop of contribution of content, comments, criticism and improvements.As with any Wikimedia project, the large majority of the content development and labor which improves Scholia comes from users with no direct relationship to the core team of developers.Each of the individual plans for development seeks to lower a barrier which users must overcome to participate in the workflow of using Scholia and contributing content to develop it further.

Back-end development and testing with pilot corpora
This project requires systematic testing of Scholia performance across possible use cases or usage scenarios.On that basis, we will assemble a corpus of examples that test Scholia's technical limits that can help us optimize the infrastructure or inform technical design decisions.We will make such decisions around the types of visualizations available through Scholia and how they are cached or preserved, the ways in which the data to visualize gets into Scholia from Wikidata, the ways in which users can configure the experience (e.g. for comparisons), and the ways in which Scholia is integrated with WikiCite curation workflows, or hardware requirements.
While these technical test sets may be of limited use or interest outside the Scholia development team, the systematic testing of Scholia's limits can also help identify circumstances where the tool works well, and in conjunction with usage information, we can then start to build pilot datasets like the Zika corpus or the Invasion biology corpus to serve as examples that engage different user communities.Having example sets to show off creates a model and workflow for others to emulate to open, expand, integrate or clean up the datasets which are relevant to them.
2. Redesign of the Scholia user interface for better usability, e.g.site navigation and internationalization The existing beta version of Scholia is functional but requires integration with Wikimedia standards of high usability at a global scale and using accessible design whenever possible.Wikimedia projects already have multilingual infrastructure into which we can integrate Scholia to share its language interface with other Wikimedia translation efforts, particularly through Wikidata.
This project will seek review from a user experience professional to ensure that Scholia meets contemporary standards for usability, including accessibility in design.
3. Improving integration with WikiCite curation workflows, e.g.around missing data When information is lacking in Wikimedia projects, signals such as "citation needed" invite users to contribute, enrich, and critique the available information.In the Wikimedia way, as Scholia presents and visualizes information, it also identifies content gaps.
Currently, Scholia is primarily a tool for visualization of Wikidata content.Separately, the WikiCite project has tools for data curation on Wikidata, such as the Author Disambiguation tool (Smith 2019) to replace the strings of unidentified author names with the Wikidata identifiers of the respective authors, or the SourceMD tool (Manske 2019) for ingesting source metadata.We propose to integrate such popular existing tools more closely with Scholia, so that as with Wikipedia, anyone who is reading the information gets an invitation to add or modify content.

Enhancing Wikidata-based reference management for scholarly writing workflows
Besides the visualizations, Scholia has a number of additional functionalities e.g.regarding entity recognition or reference management.The latter is due to a Python library that processes BibTeX and thereby provides the ability to cite references in TeX/LaTeX documents through their Wikidata identifier or DOI, such that the citation is generated based on the reference's metadata as available from Wikidata.
This mechanism, while functional, needs to be made more comprehensive and robust.If that is achieved and its usage scaled up, this would provide a compelling way for the community of BibTeX users to share the curation of their metadata through Wikidata.
We currently have no plans to expand this functionality beyond TeX/LaTeX for other writing environments, but a related JavaScript library, Citation.js(Willighagen 2019 ), is being developed with input from the Scholia team, and so is pandoc-wikicite, a filter for the opensource document converter tool Pandoc (Voß 2019).Both libraries can handle formats beyond BibTeX and Wikidata and could help expand a collaborative way of reference metadata management beyond the TeX community.Zotero has also been integrated with Wikidata in a way that allows data exchange in both directions.It can ingest and output files in a variety of formats including BibTeX, so it can serve as a bridge between different writing environments.Another important development in this space is that Citoid, a MediaWiki extension that facilitates reference management on Wikimedia sites other than Wikidata, is scheduled to be adapted to Wikidata in the coming months, which will provide further integration with Wikimedia writing workflows.
5. Establishing metrics to generate usage stats for Scholia pages and key bibliographic properties and items We already publish some basic metrics for usage statistics of Scholia-related Wikidata properties (cf.Table 2).We lack more granular insights, like the impact of Scholia on a particular field or institution.Paving the way towards routine availability of such metrics is thus an aim of this project, so that we support our contributors and partners with the media metrics that institutions use to demonstrate the value they get from engagement.What is clear is that the traffic to Scholia pages is increasing (cf.Fig. 2).for which we are setting a standard.Our documentation priorities include routine usage instructions, lay and accessible interpretations of visualizations, notes on the underlying queries and data sources, and statements about gaps and biases.

What will be the output from the project?
The main output of this project are technical improvements to Scholia-which will result in the release of version 1.0-as a Wikimedia-based tool for presenting scholarly profiles based on Linked Open Data from Wikidata.Users can view these profiles to gain insights of general or specialist interest to answer questions and guide future research.
If we are successful to the limits of our dreams, then the Wikidata community will play a significant role in shaping the Data Science Revolution in the context of scholarly research, knowledge discovery and research evaluation, and Scholia will be used routinely by students, researchers, journalists and many others around the globe and across disciplines.If we fall short, then we will learn about technical, social and other factors that need to be addressed before such dreams can become reality, and identify some niches where Scholia might find more favourable conditions.
Scholia is a tool which people are using already.Building on the growing awareness of Wikidata in cultural heritage contexts, including as a Linked Open Data hub for libraries, pilot projects have begun to explore systematically the roles Wikidata could play in libraryrelated workflows.Some of these pilots focused on integrating Wikidata into production library systems, others on offering library services around scholarly profiles for faculty.
For the personal profile use case (which was at the origin of Scholia), the tool has been demoed at institutions in the US, the UK, the Netherlands, Germany and Spain and was highlighted in a white paper of the Association of Research Libraries' Wikidata Task Force (Allison-Cassin et al. 2019).Some Wikimedia-external websites are linking to Scholia profiles, e.g. the Journal of Cheminformatics, and the Scholia tool suite has been used to both write and track publications about Scholia, while researchers outside the Wikimedia community have begun to cite Scholia in their work about academic profiling, author disambiguation and related subjects.
Links to Scholia pages have also been integrated into a number of other Wikimedia tools, e.g.Reasonator, Wikidata Hub, the Author Disambiguator (to which Scholia links back, so as to smooth data curation workflows) as well as the MediaWiki templates Wikidata infobox and Scholia.Through such templates, Wikimedia Commons and several Wikipedias (e.g.Basque, Swedish, English) now have mechanisms to link to relevant Scholia profiles, and we have begun to track some of this usage in a rudimentary fashion.The large user bases of these wikis provide a potential for future growth of Scholia usage, and while eliciting such growth will not be in the focus of the current project, the technical work undertaken in its framework will have reuse at Wikimedia's massive global scale in mind.

What is the justification for the amount of money requested?
Scholia has been in beta testing since late 2016.In that time, various pilot communities have flagged technical problems and made feature requests in the tool's GitHub repository and elsewhere.The Scholia team seeks to resolve issues to end beta testing, release Scholia 1.0, and plan for the next stage of the tool's integration into the Wikimedia platforms.
The team seeks to be cautious in this project, and rather than funding any major new visible features, make its back-end infrastructure stable, well designed, well documented, and orderly for others to test and examine.On the front-end, the priority is conforming to Wikimedia standards for accessibility and internationalization.
All of the staff allocations (cf.Table 3) will have some input into all aspects of this phase of development, but roughly, most of the development will go to back-end infrastructure, less will be for front-end design and accessibility, and less still will go to engaging with end users including meeting UX standards, recruiting testing from pilot communities, and documentation for the project.What other sources of support does the proposer have in hand or has he/she applied for to support the project?
Scholia has never had dedicated financial support for development.The team has no plans to seek additional funding for any of the features for which this project seeks funding.Scholia's origins are as an unfunded side project of the core team, in the general reuse of open software and open data, and in crowdsourced Wikimedia community engagement.
The most valuable existing support which the project has is social and technical integration with the Wikimedia platforms and the Wikimedia community.However, part of the writing for this proposal has been accomplished by Daniel Mietchen and Lane Rasberry on staff time at the Data Science Institute of the University of Virginia.
Data in this project will have primary publication in Wikidata and is dependent on the infrastructure of that project.Scholia code is hosted on GitHub, with archival copies being released on Zenodo on a regular basis (see example).The code is run through the Wikime dia Toolforge.Many of Scholia's features are features in the Wikidata platforms, hence as Wikidata develops, so do these features become better in Scholia.Many features in the Wikimedia ecosystem are interrelated to others, and this project operates in that continually developing environment.
Categories of Robustifying Scholia expenses.For the detailed budget, see Suppl.material 1.

What is the status and output of current and/or previous Sloan grants?
This project has not received previous Sloan funding.The organizers of this project have no previous Sloan grants.In the Wikimedia ecosystem, Sloan has funded other projects and the Wikimedia Foundation itself with indirect connections to this Scholia project.The difference is that the Wikimedia Foundation is a platform which provides a space for community organization, and Scholia is one of the community projects which operates within both the Wikimedia platforms and the general public commons.
In 2017, Sloan provided funding to the Wikimedia Foundation for a project to convert metadata for media files, mostly images on Wikimedia Commons, into open linked data.That project is in a separate domain than Scholia, and the Wikimedia Foundation is its own unrelated entity.
Annually from 2016-18, Sloan has sponsored the Wikimedia Foundation to organize the WikiCite conference, which is an event for about 100 people to discuss the curation of source metadata in Wikidata.Our Scholia project depends on WikiCite data and seeks to contribute more data to the collection discussed at that conference.However, the Wikimedia Foundation and its conference are separate and independent from this proposal and Scholia's tool and content development, although this proposal's principal investigator Daniel Mietchen has been an organizer for the WikiCite conference.

Appendix
Because Wikidata ingests external data, it also ingests the bias of the sources of that data and its environment.Visualizations like those provided by Scholia can help identify such biases.Fig. 3 highlights counts of 2018 geolocation data showing that Wikidata's content better covers some locations over others.Various commentators and researchers discuss systemic bias in Wikimedia platforms.This project aims to counter Wikimedia biases by seeking out underrepresented data sets in the pilot corpora.

Attention to Diversity
This project starts with two mandates for diversity: that of the Wikimedia community and that of the University of Virginia as project host.These mandates require diversity in the appointment of funded positions, and in recruitment of community engagement in the development of the project, and in targeting the pilot communities who will be among the first beneficiaries of project outcomes.
In funded roles, the project starts with 3 investigators in 3 different countries who will develop the tool in their 3 different native languages and additionally English.Later appointments will seek other dimensions of diversity.The project already has collaborations with established Wikimedia community organizations to provide user feedback during development, including groups organized by country, organizations for gender and racial diversity, or inclusivity of specific academic fields.
Scholia itself is a tool which can identify academic accomplishments of minority groups and their members.For example, scientists who are ethnic minorities, women, or LGBT+ often seek to publicly identify themselves to be counted and encourage more people of their demographic to join the sciences.Some queries which Scholia's native infrastructure can accomplish, but which are too progressive and provocative for competing products, include "ratios of scientists by ethnicity at a given university" (prototype) or "academic fields sorted by the count of LGBT+ researchers publishing on the topic" (prototype).

Information Products Appendix
This project seeks to be a model of Wikimedia openness in all information product outputs.Every information product which this project creates will be aligned with the Wikimedia ideal of free media and have compatibility with the appropriate Wikimedia project licenses, which are CC0 for data, CC BY or CC BY-SA for most media and text, and free and open software licenses to operate on Wikimedia servers.
This project will present datasets, software, documentation, and the published text of online community discussion as part of the primary goal of developing Scholia as an online tool for exploring the Wikidata knowledge graph of WikiCite data.We will put data produced in this project into the Wikidata platform which offers various format options for anyone to export their own copy of the content.Beyond applying open licenses to the primary information products, this project additionally seeks to be open in development, community participation, and public discussion around the project.These processes and conversations will also happen in the open in ways that create media records with open licenses which anyone can access or scrutinize.
To increase accessibility to information products beyond the Wikimedia platforms, we will mirror the publication of some products in more traditional spaces.Examples of additional distribution plans include using GitHub as a code repository for this project and Zenodo for archival copies to make these resources more accessible.
This project will reuse code and content whenever possible, always with a Wikimedia compatible open license.The policy which best describes constraints on this project are the Wikimedia policies on openness, such as their Open Access Policy.
Everything produced by this project will be accessible online for anyone to access without paying a cost to access, export, remix, or reuse it.

Grant title
Robustifying Scholia: paving the way for knowledge discovery and research assessment through Wikidata

Hosting institution
Data Science Institute, University of Virginia Figure 1.Screenshots of Scholia with examples of the kinds of visualizations it provides.Such individual visualization panels are then combined in a predefined way into profiles for authors, topics, organizations, works, events, locations or other units of interest.The examples are taken from the Scholia author profile for "Finn Årup Nielsen", available via https://tools.wmflabs.org/scholia/author/Q20980928, and the Scholia topic profile for "Zika virus", available via https:// tools.wmflabs.org/scholia/topic/Q202864.All of these images were made available through Wikimedia Commons as part of the drafting process for this proposal.The files' locations there are indicated, and they are all licensed CC0/Public Domain.

Figure 2 .
Figure 2.Visualization of Scholia traffic logs (note the semi-logarithmic scale).This data has not been analyzed yet, so cannot be used to draw any conclusions other than that usage grows.Multiple contributing factors seem likely, including web crawlers, generic growth of Wikidata and WikiCite content, increased interlinking both within Wikidata and between Wikidata and other websites, especially Wikimedia projects, as well as WikiCite or Wikidata outreach activities, which often feature Scholia prominently.From Nielsen and Mietchen 2019, CC BY 4.0.

Figure 3 .
Figure 3. Map of geographical bias in Wikidata (by Adam Shoreland, via Wikimedia Commons, CC0).The color indicates how many Wikidata items have a geolocation within the corresponding pixel, with brighter colors indicating a larger number.The observed biases are a combination of multiple factors, including human settlement patterns and Internet access.In a similar fashion, Wikidata can be used to visualize other kinds of biases, across data types, domains of knowledge, or historic eras.

Table 2 .
Snapshot of the live statistics panel on the Scholia homepage as of 7 February 2019, with some key stats regarding Wikidata content used by Scholia.A corresponding screenshot is available on Wikimedia Commons, and so is a timeline for some of these stats.