Use case HPC Cloud: Academic piracy

Balázs Bodó (UvA) investigated the underground circulation of scholarly books via illegal archives of academic publications. For gathering, processing and analysing his data on millions of downloads, SURF’s HPC Cloud service proved to be “a dream scenario”.

Bibliotheek van de universiteit van Kopenhagen

“My motivation for this research project comes from my time as a professor at the Technical University in Budapest. Hungary is of course part of the European Union and my students were to be competing on the European job market. But we had a problem: due to financial reasons the best English-language books, that are used in Western higher education, were not available in the local university libraries. My choice was to either give my students a second rate education, or to tell them: go to this website and download the book illegally. And, as our recently published study Shadow Libraries proved, my students and I were not alone with this dilemma."

Shadow libraries

Balazs Bodo

Balázs Bodó

Balázs Bodó, PhD (Hungary, 1975) is an economist and piracy researcher at the Institute for Information Law (IViR) at the University of Amsterdam. In a soon to be published article, Bodó writes about the underground circulation of scholarly books via so-called shadow libraries: illegal archives of scholarly journal articles, textbooks, monographs, and other forms of academic work. “Services such as LibGen and SciHub provide free, unrestricted, copyright infringing access to more than 100 million pay-walled scholarly articles, and millions of books for anyone interested.”

A dataset was provided to Bodó by one of the administrators of a prominent shadow library. His team mapped both the supply of and the demand for academic monographs, textbooks and other learning material. “Our primary findings suggest that scholarly book piracy is a ubiquitous global phenomenon.”

Black market

The rapid, global growth in the demand for scholarly works, and the tightening financial conditions of higher education coincided with a rapid concentration and commercialization of the Western scholarly publishing, explains Bodó. “The publishers who control such key resources are able to charge excessive access fees despite the fact that every other input for these journals (the articles themselves, peer review, etc.) are provided by the academic community for free. These twin developments of rapidly rising costs and rapidly rising demand coincided with the widespread availability of increasingly cheap digital reproduction technologies.”

Wereldkaart waarop het aantal downloads van wetenschappelijke boeken te zien is

The number of downloads of scholarly books per day per capita (million) via a shadow library in 2015 (view high-resolution image)

The other side of the coin

Copyright lawyers, including Bodó’s colleagues at the IViR, will say the law forbids to download copies of academic books from pirate websites. “I wanted to demonstrate that these downloads have big political and economic implications. If more people from the developing world, from India, Brazil, Eastern Europe, have access to academic knowledge, that’s a good thing in my opinion. That is the other side of the coin. These shadow libraries facilitate an unprecedented knowledge transfer on a global scale, where millions of people are reading and learning useful stuff about all fields of science. I wanted to show my colleagues that this is the reality. My claim is that you can’t develop sane laws and publishing strategies if you don’t know the reality of the black market.”

Interestingly though, the research data shows that the biggest per capita users are still the high income North American and European countries. Users who probably could have had legal access through their institutions. Bodó suspects that in these regions the convenient one-click access shadow libraries provide to full digital copies plays a role. The huge numbers of illegal downloads also mean that only measuring legal downloads through university libraries misrepresents the impact of academic publications.

"At first, it was unclear to me how and where to get this kind of IT infrastructure. We spent some time investigating if we should build our own server. Finally, a colleague told me about SURF. That was such a relief, the HPC Cloud service was exactly what I needed."
Dr. Bazlázs Bodó, economist and piracy researcher (IViR)

Big data in legal science

“The Technical University in Budapest, where I worked before I came to the Netherlands in 2013, was a very tech-friendly environment. All the infrastructure you needed was there.” Laughs: “At the UvA law faculty, we are slowly changing to better deal with technology-heavy research. Traditionally, legal research was mostly desk research.”, Bodó says, pointing to his bookcase. “New research topics such as artificial intelligence, digital information, and online piracy require us to learn new research methods as well. There is a constant demand for legal researchers who know how to code and use advanced statistical, machine learning, text and data mining methods. The big data revolution has reached the legal profession and we need to prepare for that.”

For his research into the use of pirate libraries, Bodó worked with a set of data that consisted of tens of millions of download records of 1,5 million books. “I also needed to enrich this dataset. I wanted to get the bibliographic metadata of the downloaded books: author, year of publishing, and format, the legal availability in libraries and bookstores, prices, and the geographical location of the IP address. This required a lot of web scraping. I had to write and test scrapers and manage the information somewhere. This continuous operation ran for months on end, and required good connectivity and availability of IT resources. I also needed a space where I could collaborate with my students and colleagues on the data analysis.”

“At first, it was unclear to me how and where to get this kind of IT infrastructure. We spent some time investigating if we should build our own server. Finally, a colleague told me about SURF. That was such a relief, the HPC Cloud service was exactly what I needed. Not only the technical side, but also the support from SURFsara was a dream scenario: to have someone advise me on how to optimally use the services for my research needs. Once I had discovered HPC Cloud, I also started using other SURF services, such as SURFdrive. We also set up an R server and a Jupyter notebook server, so I could see what my team was doing and we could work collaboratively from anywhere. Such a national infrastructure is super useful.”

Author: Josje Spinhoven

Photo: Library of the university of Copenhagen, Eric Mueller Flickr CC


The study 'The Science of Piracy, the Piracy of Science' has been published on Kluwer Copyright Blog

This article was published in SURF Magazine June 2019

SURF Magazine NR 02-2019