Update (08. March 2012): Wiley has informed me that they just fixed the issue.
70 GB of digital content from Wiley’s Major Reference Works can be downloaded for free from Wiley’s web server. This is probably not intended, as Wiley still charges hundreds or even thousand of Euros for each one. I notified them of this fact several times during the last two months. They do not seem to mind. In this post I explain how I found out about this, how one can download the content, and why this is probably legal (at least in some jurisdictions).
An unexpected finding
In September 2011, I received a Google Alert about a new document on ‘time domain reflectometry’. The URL itself looked a quite interesting:
Going down the rabbit hole (or up the directory tree), one reaches http://onlinelibrary.wiley.com/mrw_content/ which contains around 140 directories. The directory names are mostly acronyms for pretty much all of Wiley’s Major Reference Works, a collection of textbooks on pretty much every topic in engineering and natural sciences.
The total file size is about 70 GB. In many cases the chapters of each book seem to be available as XML, HTML, and PDF.
I was busy at this time and assumed the directory listing to disappear soon. After all, this looked like a simple misconfiguration on the web server. Only, the directory listing stayed available.
On 25. October 2011, I sent an e-mail to Wiley’s customer service, as I could not find an e-mail address for technical issues. I notified them of the issue and got a reply:
“I have checked and can confirm that the links does allow free access to the articles. I am looking into this matter and have escalated this to our internal Specialist teams for resolution. I will contact you as soon as I have more information. Thank you for your time in informing us about this issue.”
I have not yet received a follow-up email.
On 6. November 2011, I sent a tweet to Wiley, pointing out the issue.
No reply yet.
On 16. November 2011, I sent an e-mail, asking for an update.
No reply yet.
On 21. November 2011, I contacted customer service via Wiley’s live chat and reported the problem. I was told
“we would like to thank you for pointing this out. I will forward to our Web Staff to have them fix this issue.”
I was promised an update.
No update yet.
I guess this means that they do not really care about people downloading these files.
How to mirror the MRWs
There are a lot of good reasons for downloading the MRW files: for learning, for use as a corpus of scientific language, for plagiarism detection, etc.
Downloading these files from the mrw_content directory is almost straightforward:
It seems necessary to first download the main index once – not sure why:
httrack http://onlinelibrary.wiley.com/mrw_content/ -r0 -p0 -c8 -A100000 -v -#L1000000 --updatehack -X0
Then you can mirror a single reference work like this:
httrack http://$BASE_URL/$RW/ --referer http://onlinelibrary.wiley.com/mrw_content/ -c8 -A1000000 -#L1000000 -v --updatehack --update -X0 +$BASE_URL/$RW/*
Note: httrack severely limits your download speed to avoid DoSing the server. That is probably the right behavior. However, you can use a command-line parameter to make it work faster or take a tool like wget instead.
Is this legal?
Probably. Wiley’s robots.txt allows almost all bots access to all files on the server. Additionally, Wiley did not protect those files, even when notified about their availability to the public. In fact, you are doing the same thing Google’s crawlers do all the day.
At least in Germany downloading these files for one’s own scientific purposes seems to be covered by § 53 UrhG. Of course, redistributing these files would be illegal.
So, if you decide to download these files, please do not DoS the server and please do not redistribute them.