Peter Murry-Rust on Liberating Facts from the Scientific Literature

Peter Murry-Rust spoke at mySociety‘s meet-up in Cambridge on the 4th of November 2014 about the project he leads, The Content Mine, which aspires to liberate facts from scientific literature.

I have produced a video and transcript of the speech:

mySociety is one of the most wonderful things to have come out of the bottom-up democratic movement in the UK and the UK is a shining light for the rest of the world. It is magnificent what’s gone on and I have used many of the tools of mySociety. I’ve used WriteToThem on many occasions. How many people have used WriteToThem? [hands go up] Hey, Wow. It just makes the whole business of contacting your representative so much easier. And I’ve also used a lot of WhatDoTheyKnow FOI requests and again it’s absolutely brilliant. It makes the difference between doing it and not doing it. It doesn’t always mean you get a reply and we actually need an automatic bot which beats up people who think by not replying they are helping keep the noise level down.

I have been a scientist in Cambridge for fourteen years, something like that, I’ve been in the Chemistry Department. I’m still in the Chemistry department but I’m technically retired and I have found that in the last few years I cannot do my research because I’m being prevented by things which are not scientific, but which bring me right up against the problem of ownership of intellectual property, or regressive, restrictive, practices and so on. So half my work has been fighting against the mega-corporations to try and get justice in terms of our right to access and read information.

So a year ago I applied for a fellowship from the Shuttleworth foundation. That’s Shuttleworth, that’s jack the baboon from South Africa. How many of you use Ubuntu? [hands go up] Well Ubuntu was developed by Mark Shuttleworth, he made a lot of money out of it and put some of the money to support people like me. Fellows are required to change the world and that’s what I’m in the process of doing.

A well known Shuttleworth fellow is Rufus Pollock who set up the Open Knowledge Foundation and I’m very happy to discuss with any of you how that works.

So my re-application is due in tomorrow and for that they needed a five minute video so it’s probably the easiest way of explaining what I’m doing so I’m going to play this video:

I’m Peter-Murray rust and this is my application for a second year of Shuttleworth fellowship. Here’s PMR presenting the vision at Wikimania this year.

So I’m going to talk about Content Mine about a project funded by the Shuttleworth Foundation which has given me the opportunity to build something completely new and different. So we are going to use machines to liberate scientific facts on a massive scale and we’re going to put them in WikiData.

We’ve created a great team over the last year and they’ll tell you their roles:

  • Hi there. My name is Ross Mounce. We’re here to analyse PLoS, BMC, PeerJ, ELife and the Hargreaves Allowed Corpus. Articles contain facts. They also may have associated supplementary materials. A journal contains many articles. Some journals publish a vast number of articles. There are many journals many of which have restrictions so at the end of the day there are billions of facts trapped in a galaxy of articles.

We’ve tried this out on many people at our workshops and they’ve gone through and you can see how they’ve marked up different things with different colours and that’s what we’re going to ask the machine to do.

So here’s a machine. This is how a machine might see a bit of science. You don’t need to understand this but you can see the kinds of things that a machine might be able to pick up. And we can mine data as well. Every bit of information in this image has been extracted and can be further manipulated. Richard and the rest of the team have a generic solution:

  • Richard Smith-Unna: We’ve built an eco-system of open-source software for data-mining the entire academic literature, and we’ve also grown a community of volunteers who help us maintain it. Now we’re ready to actually go live and start actually doing content mining on a daily basis on a massive scale.

We start with the scientific literature, we crawl it, we scrape it, we extract it. And then we’re going to take the results and put it in WikiData. What we need is we need more science plug-ins. I’ve written a chemistry one, I’ve written a phylogenetic tree one. We need people to do maps, we need people to do birds, we need people to do stars.

This is our application software. This is what we’ve been concentrating on at the moment and what we’re immediately going to do. It’s based on a plug-in architecture so that every discipline can create a plug-in specifically for it’s needs. We need to spread the word and train people and Jenny tells us how.

  • Jenny Molloy: It has historically been very hard to use content mining technologies which has led to a huge skills deficit. We’re trying to address this by running international workshops and also creating a repository of open online material so that anyone researching anywhere in the world can find out how to do content mining.

Jenny and Puneet have just run a workshop in New Deli.

We are already doing literature research on a variety of plants and the chemicals they produce.

Steph explains our resources:

  • Steph Unna: As part of our commitment to remaining open we’re creating a modular resource base for teaching and learning content mining so that we can make sure that our project is accessible to wider audiences.

Here’s what we’ve already achieved. We’ve run several workshops, we’re collaborating with a lot of different organisations and people, we’re working with publishers, we’ve applied for grants, and we’ve already got several publications. We assert that the right to read is the right to mine.

We’re starting the daily extraction right now. PLoSone then BMC then PeerJ and then moving on to the closed journals which is now legal in the UK.

So to sum up we’re going to start extracting facts now and grow this rapidly month by month. We’re going to develop more workshops and train people to do the workshops by themselves. We’re also going to create a community of developers and scientists so that they are going to decide what can be done and will build the tools and the protocols to run that and finally we are going to look and see how content mine can become self-sustaining.

So we’re going to liberate one hundred million facts per year from the scientific literature and we’re going to put them in Wikipedia or rather WikiData and we’re working closely with WikiData. Everyone is welcome to… well.

This is my promise: I will never sell out to non-transparent organisations because what’s happening in this space is that as soon as someone in academia comes up with a bright idea they then go and sell themselves to either Digital Science or Elsevier and this is a major problem because all the innovation gets bought up.

This is my thesis; that publicly funded research, including charity funded research is about four hundred thousand million dollars per year, or four hundred billion if you like that. It’s an awful lot of money and you are all paying for it. There are about one and a half thousand [?] [Slide says one and half million] scientific articles published each year. So each of them represents about three hundred thousand dollars of work and if you want to publish. How many people here have published [hands go up]. Right the majority of you. It cost somebody seven thousand dollars to publish that. It either cost your library to subscribe to it, or you have to pay fees to the publisher, or both, and at the end of that you might be able to access the paper.

It technically costs seven dollars. How many of you have used archive, Arxiv? This is physics and maths and the cost of putting a paper which the world can read is seven dollars so we have a cost scale up here. Some of this is due to the fact that academics want glory and this is the price that you have to pay for glory, to get it in a journal like nature, 95% of the papers are rejected and that’s where some of the cost comes in, but the cost is also the fact that Elsevier make profits of forty percent, no other industry makes that amount of profit. Because they get the stuff for free and they can charge what they like for selling an article. You can see I have a slight take on this. And academic libraries spend ten billion dollars a year on this. Almost everybody in the world cannot read this. It is only rich places like the University of Cambridge and Harvard which can actually read most of the scientific literature.

Now we are therefore denying ourselves the downstream wealth. Patel did a study on the human genome, four billion dollars were invested in it and they calculated that we got eight hundred billion, this is in the US, in terms of jobs, in terms of materials, new technologies and so on. So a huge multiplier effect.

Now here’s the bad bit. Most of that published science is wasted. The Lancet, four years ago, said that eighty-five percent of the research was wasted because it was duplicated, it was badly designed, it was poorly reported, it just wasn’t published at all, things of that sort.

And PLoS have confirmed the same thing this year. The way it is published is awful. How many of you have tried to read the words in a PDF, not with your eyes, but with a machine. Right. That’s what we’re on about. PDF destroys information more than any other thing. I’ve had a paper which I’ve just published and they just said can you photograph the maths equations in this so we can embed them in the HTML and this organisation described itself as twenty-first century dynamic publisher. I mean, it is awful. We need to turn this stuff into semantic form.

Here’s the vice chancellor. This is not me ranting. He was asked two months ago by Michelle Brook about spending money with Elsevier. “Just wait until we get into open-data debates. Elsevier is already looking at ways in which it can control open data as a private company”. That is true. The point is these publishers are so big that libraries will have no ability, no communal will, to do anything other than pay them money for their products. And this way we get the same effect as Google, Facebook, and all the others who build up a restricted walled-garden environment.

We’ve tried to change that. So we have argued that the right to read is the right to mine, which is what I have here. The publishers are trying to make us licence “their content” so that they can restrict our use of it. We’ve fought it in Brussels and it came to a show-down and basically all the good-guys walked out.

Are there any publishers here? “I’m an ex-publisher, I will admit it now”. Well you will feel a bit of ex-angst in this talk. So that they came up with a licence for Europe and basically it’s an impasse. Well the great news is that the UK has jumped ahead of everybody else and we will see that in a minute.

This is typical. This is this year. Ebola haemorrhagic fever, a report on this, if you want to read it you have to pay thirty-one dollars, now that’s a crime against the planet. But academics don’t care, as long as they get the glory of publishing in the Lancet that’s all that matters to them. I’ve been an academic, I’m allowed to criticise them.

This is the open access declaration and you can see here the learning of the rich with the poor and the poor with the rich that’s the important thing. True open access is democratic and symmetric, it’s not about little bits coming out under publisher conditions it’s about building the same ethos as open source has built.

This is our phrase that is if you have the right to read a piece of information you have the right to data mine it. It’s crawling. How many people have written a crawler. Is it fun. (Responses: “yeah”. “it can be”). Does it still work a year later. (Response: “sometimes”). Exactly. How many people have written a scraper? By scraping we mean something which looks at a URL and pulls down all the bits, the PDF, the HTML, the PNGs and everything and then we have to extract the facts. How many people have written a PDF parser? We’ll move on.

So my thesis is that actually citizens can understand science if we translate it for them. So nobody here knows what Panthera leo is. Well I’m sure they do actually, but everyone knows what a lion is. Similarly you might not know what Aspergillus oryzae is but it’s the fungus that makes soya-bean sauce so these papers actually become accessible if you build an amanuensis into it. An amanuensis is a scholarly assistant. So here’s our amanuensis, she’s called Amy, she has no emotions, she never gets bored, she does what she’s told and if you tell her something stupid she does something stupid and that’s what most programmes do.

So we gave this to people who knew no science what ever and you can see they went through and marked this up and they’ve done a very good job of it. Similarly our machines can do this and we use here some quite sophisticated parsing, this is a shallow parser which we turn the chemistry here into Chomsky like paths. In chemistry we can take this. How many chemists here? None. Good. Well this looks like gobbledygook but the machine can read that and can turn that bit of language into actual living chemical systems, semantic chemicals, and it works it out within an second so we can read the literature in that way.

So this is our pipeline and the way that we’re doing it is we’re exposing a plug-in so people can write plug-ins for the scrapers. Richard was going to be here, but he isn’t, that was Richard on the video, and he’s got lots of volunteers to do this scraping and similarly I’m beginning to get people who want to scrape chemistry and species and things like that.

We can actually now take PDFs. We can take that graph and turn it into a CSV in less than a second. If you have graphs that you want to read, our technology, I mean it needs customising, and it needs blood and sweat and ultimately you can get to the stage where you can get from this to that and this picture here shows it. It’s got units and scales and things like that, and these don’t have to be science they could be stock prices or the amount of money spent on drugs against the number of people who still use drugs – you saw that in the last few days.

So that would produce that sort of graph, and then you’ve got that, you can do clever things like smoothing it or looking for the variations. We’re working particularly on phylogenetics. So every new bacterium comes out in this journal, the International Journal of Semantic Systematic Evolutionary Microbiology. And we are now, even though it’s a closed journal, we are now extracting it under the new legislation.

What’s happened this year is the UK Government has pushed through copyright reform and it has given exemptions to copyright which now allow you to do five different things. One of them is parody. Does anyone here write parody? Well if you do you will not be sued for copyright, or you can defend yourself in the UK. You might be sued for something else. Similarly you can do format shifting, archiving, and most importantly for us we can do what’s called data-analytics or as we call it content-mining.

This is mining scientific images. This gives us an idea of the kind of things we can do. We can take a picture like that, snapped with my iPhone and we can turn it into a chemical formula and again that takes about a second. Has anyone done any image analysis? Good. I’d like to talk to you. We can now do scientific image analysis. We can recognise that structure and turn it into a chemical.

Here’s a whole lot of chemicals. My colleague Andy Howlett can read the whole of that and turn it into an animated reaction. All of that within a few seconds so we can read the whole literature. There’s another one he can do, he can read the whole of this there, automatically and turn it into animated structures.

I’ll finish the slides there and say that’s what’s at stake is the following. We can take, there are about a thousand new papers which come out a day. We’re building a crawler which is going to extract all these papers from the journals that we’re entitled to read in Cambridge University and that’s why we have to have our magic jumper on which has Cambridge University because that allows us to read it in Cambridge.

How many people by the way have access to Cambridge? One. OK. If any of you want access to this, if any of you become a research collaborator of me then you can get access.

I am not in it for the glory. I am in it to destabilise, to disrupt, the whole system. So I am allowed to collaborate with anybody so if you have research projects which you want to mine the literature I can do it and so long as we don’t sell the results, that’s fine, we can do it for non-commercial purposes, and we have to be British which is why we have the British flag. We are allowed to do it in Britain.

  • So can I ask you a quick question there. So when you say you can do it for non-commercial purposes does that mean that all the things you have put out have to be under a non-commercial licence

Good question. Let’s come to that in a minute.

Literally I’m just finishing up, so the publishers have fought this with mud and FUD and money and suits and lobbying and so on, it’s very difficult doing your research knowing there are people out there with literally millions of dollars trying to stop you doing your research. So that’s the situation. We’ve got the law. The law hasn’t been tested. I am allowed to do it according to the law for non-commercial purposes. Elsevier says I can’t because they can stop me doing it under the law and we had a big public fight in London, well you know verbal, with Elsevier and we won on points on that one because they’ve sort of said “well maybe ok”.

Springer are actually OK on this. So our view is we can extract all the facts in a paper and we can also surround them with two hundred characters of context. Now that gives you enormous power. Now the publishers hate this, they want to get the libraries to sign these rights away. They’re trying to get libraries to buy special text-mining services which whose purpose is to stop me using them, right. And when I say me they do quote me by name because they say well if Peter Murray-Rust does this he’s going to melt our servers down. They’ll say there’s no demand for this, for text-mining, but if Peter Murrary-Rust does it then our servers will be melted. Right OK.

Has anybody been in a FUD fight with a large corporation? No-one’s fought Microsoft? Is there anyone from Microsoft here?

The point is that you know if we take this title – that’s what’s called an entity and everything that’s here could be regarded as the context from which we established the facts so now it’s going to discussion – coming back to what you asked – it’s untested in court – we’re going to put this out not as a resource, but because I’m a responsible scientist I have to publish all my research data so if I mine all this stuff from the literature as a research scientist I have to make sure the rest of the world can see what I’m doing. I will be putting it out under CC0, which means anyone can do whatever they like and if they want to commercialise it, their problem not mine. So that’s the position at the moment. We’re going live today. I hoped Richard would be here, and I hoped we would have switched it on and it might be now or it might be tomorrow but it’s that sort of timescale. We’re not getting the thirty-thousand articles, we’re going to do PLoS which is primarily a biological journal and do that one.

So I’ll stop talking and invite questions and comments and so on.

I’ve not transcribed the Q&A, but this link takes you straight to the start of them on the video.

My Comments and Views

  • The new UK copyright law is one thing; but there’s nothing to stop the publishers preventing access for example by rate limiting access to their databases, or blocking people who they consider have downloaded too much. Without the co-operation of the major publishers this kind of work is going to turn into an arms race between the publishers working out how to protect their data and researchers getting more inventive at coming up with ways to circumvent their security.
  • During the talk Peter Murray-Rust said that the University of Cambridge is one of the few places where there is access to much of the scientific literature. When I came to Cambridge University in 2001 the university was in dispute with Elsevier and access within the university was poor until negotiations concluded. Even now those within the University frequently face paywalls for certain journals.
  • As mentioned when referring to the publisher requesting a photo of an equation, we shouldn’t be losing the structured data when material gets published. We need to get scientists releasing what they produce in structured forms so we don’t need image analysis software to extract it and turn it back into a form in which further work can be done on it.
  • When Peter Murray-Rust said:

    I have to publish all my research data so if I mine all this stuff from the literature as a research scientist I have to make sure the rest of the world can see what I’m doing. I will be putting it out under CC0, which means anyone can do whatever they like

    I did wonder if he meant he had a responsibility to make the raw material he downloads from the publishers available to others so they can reproduce his work and if they like run his scripts for themselves. Now that really would be a disruptive action.

  • What’s the security like at Peter Murray-Rust’s lab if he’s going to have copies of all this valuable material inside? Anyone thinking of liberating the raw content will probably have their aspirations chilled though by the experience of activist Aaron Swartz who killed himself after he was charged with a range of offences as a result of him merely downloading academic journal articles in bulk.
  • See Also

    I have written on this subject before, for example:

One response to “Peter Murry-Rust on Liberating Facts from the Scientific Literature”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.