Changing the world, one byte at a time – Open Citations

The word Metadata is a term that is often guaranteed to cause eyes too glaze over in under five minutes, and yet it is a word that describes something that is both simple and yet deeply disruptive and transformative. The world as we now know it wold be radically different and a lot worse off without it.

What does the word “Metadata” mean?

OK, let’s set the scene.

In essence “metadata” is a term that refers to data that describes data. The prime example is one we are all familiar with, the World Wide Web. The idea is simple, when you create a document, you insert into it data that describes not only the data within that document, but also related data within other documents across a network. The big idea that Tim Berners-Lee had was not the creation of metadata. Back in 1945, yes it really was that long ago, Vannevar Bush wrote an article in The Atlantic Monthly called “As We May Think” in which he describes a hypertext device which not only contained data, but also data that described that data and linked to related data.

In 1963, Ted Nelson began implementation of a hypertext system.

So why does Berners-Lee get all the credit?

His 1989 implementation of the WWW that successfully married up hypertext with networking is of course significant, but what made it go viral and truly take off is that the code was given away for free to anybody who wanted to give it a go and try it out. That was truly the moment in which things really blossomed.

Writing about it at the time, he describes it all as follows ..

HyperText is a way to link and access information of various kinds as a web of nodes in which the user can browse at will. Potentially, HyperText provides a single user-interface to many large classes of stored information, such as reports, notes, data-bases, computer documentation and on-line systems help. We propose the implementation of a simple scheme to incorporate several different servers of machine-stored information already available at CERN, including an analysis of the requirements for information access needs by experiments… A program which provides access to the hypertext world we call a browser.

― T. Berners-Lee, R. Cailliau, 12 November 1990, CERN^[14]

Apr 2017 – Another new step – The Initiative for Open Citations

What prompts me to write is the news that one more step has now been taken on the road to an ever increasing flow of information.

Published scientific papers build upon the ideas established within other previously published scientific papers. Read one and you will find it full of citations to other earlier papers.

Who is citing whom?

Up until recently, the only way to work that out was via Web of Science and Scopus.

That however has just changed. The new Open Citation initiative has been born.

Why does this matter?

It is not simply a matter of ego, but also of understanding the flow of ideas. Which ideas are gaining traction and why? That is an important idea itself.

The Open Citation Initiative (I4OC), describes it as follows …

Citations are the links that knit together our scientific and cultural knowledge. They are primary data that provide both provenance and an explanation for how we know facts. They allow us to attribute and credit scientific contributions, and they enable the evaluation of research and its impacts. In sum, citations are the most important vehicle for the discovery, dissemination, and evaluation of all scholarly knowledge.

As the number of scholarly publications is estimated to double every nine years, citations – and the computational systems that track them – enable researchers and the public to keep abreast of significant developments in any given field. For this to be possible, it is essential to have unrestricted access to bibliographic and citation data in machine-readable form.

The present scholarly communication system inadequately exposes the knowledge networks that already exist within our literature. Citation data are not usually freely available to access, they are often subject to inconsistent, hard-to-parse licenses, and they are usually not machine-readable.

That last bit in bold there is key. It even applied to open access articles, and so simply having the raw text fully available via open access is not sufficient, a far richer set of metadata was needed in a common format that is centralised and easily accessible to all.

How do Open Citations work?

Science publishers dump citation data into Crossref, a non-profit organisation established in 2000. But up until now only about 1% was actually freely accessible. Most publishers held on to the copyright of the citation data, details of who is citing whom, very tightly. To get at that required payment of huge fees.

After six months of persuading publishers to make such citations freely available it has jumped to 40%.

This is not hard and is simply a matter of persuading publishers to just flip a setting to make the citation data freely available.

Springer Nature, Taylor & Francis, and Wiley and all in and are amongst the list of 29 publishers who have committed to this. Missing, so far, is Elsevier, but perhaps in time.

What benefits will this bring?

The vision that they are running with here is this …

1. The establishment of a global public web of linked scholarly citation data to enhance the discoverability of published content, both subscription access and open access. This will particularly benefit individuals who are not members of academic institutions with subscriptions to commercial citation databases.
2. The ability to build new services over the open citation data, for the benefit of publishers, researchers, funding agencies, academic institutions and the general public, as well as enhancing existing services.
3. The creation of a public citation graph to explore connections between knowledge fields, and to follow the evolution of ideas and scholarly disciplines.

I should perhaps also add the rather obvious. This is not the end-game, it is simply one more step, a rather sensible and very practical one.

Data Scientists will now have access to a vast pool of linked citations, and thus the lineage of ideas can be traced. However, why stop with citations, why not make everything freely available?

Doing that would greatly accelerate our ability to make scientific progress.