Transform Web Archive URLs to avoid downloading an HTML page instead of the PDF file #8979

timvandermeij · 2017-09-30T15:15:47Z

Moreover, adjust one linked test case that did not conform to the standard Web Archive URL format and adjust one linked test case because the link was dead.

Fixes #8920.
Fixes #8412.

SamyCookie · 2017-09-30T15:46:31Z

Hi, I tried tests locally just now and got this :

Downloading http://web.archive.org/save/_embed/http://210.243.166.143/prob1.pdf to pdfs/issue8169.pdf...
/home/samuel/src/pdf.js/test/downloadutils.js:34
    var id = regex.exec(url)[0];
                            ^

TypeError: Cannot read property '0' of null
    at downloadFile (/home/samuel/src/pdf.js/test/downloadutils.js:34:29)
    at downloadNext (/home/samuel/src/pdf.js/test/downloadutils.js:103:5)
    at /home/samuel/src/pdf.js/test/downloadutils.js:110:7
    at WriteStream.<anonymous> (/home/samuel/src/pdf.js/test/downloadutils.js:76:9)
    at emitNone (events.js:110:20)
    at WriteStream.emit (events.js:207:7)
    at finishMaybe (_stream_writable.js:603:14)
    at afterWrite (_stream_writable.js:455:3)
    at onwrite (_stream_writable.js:445:7)
    at fs.js:2189:5

SamyCookie · 2017-09-30T15:52:25Z

test/downloadutils.js

+    // Web Archive URLs need to be transformed to add `if_` after the ID.
+    // Without this, an HTML page containing an iframe with the PDF file
+    // will be served instead (issue 8920).
+    var regex = /web\.archive\.org\/web\/(\d+)/g;


minor optimization, make regex global outside downloadFile to prevent to compile it each time ?

Snuffleupagus

Thanks for doing this, but I'd like us to be a bit more explicit about what URLs we attempt to rewrite (and have a bit more validation as well). Could we e.g. do something along these lines?

diff --git a/test/downloadutils.js b/test/downloadutils.js
index ef145416..f1b977fd 100644
--- a/test/downloadutils.js
+++ b/test/downloadutils.js
@@ -22,7 +22,19 @@ var crypto = require('crypto');
 var http = require('http');
 var https = require('https');
 
+function rewriteInternetArchiveUrl(url) {
+  var urlParts =
+    /(^https?:\/\/web\.archive\.org\/web\/)(\d+)(\/https?:\/\/.+)/g.exec(url);
+  if (urlParts) {
+    return urlParts[1] + (urlParts[2] + 'if_') + urlParts[3];
+  }
+  return url;
+}
+
 function downloadFile(file, url, callback, redirects) {
+  // Comment here...
+  url = rewriteInternetArchiveUrl(url);
+
   var completed = false;
   var protocol = /^https:\/\//.test(url) ? https : http;
   protocol.get(url, function (response) {

Snuffleupagus · 2017-09-30T16:15:05Z

test/downloadutils.js

+    // Web Archive URLs need to be transformed to add `if_` after the ID.
+    // Without this, an HTML page containing an iframe with the PDF file
+    // will be served instead (issue 8920).
+    var regex = /web\.archive\.org\/web\/(\d+)/g;


Won't this regex match already rewritten URLs as well?
Also, I'd like us to be a bit more specific about what URL format we expect; more elsewhere.

Snuffleupagus · 2017-09-30T16:17:43Z

test/downloadutils.js

+    // will be served instead (issue 8920).
+    var regex = /web\.archive\.org\/web\/(\d+)/g;
+    var id = regex.exec(url)[0];
+    url = url.replace(regex, id + 'if_');


Isn't it sufficient to just do exec, i.e. couldn't you just assemble the URL manually based on the result from a more "complete" regex?

Snuffleupagus · 2017-09-30T16:18:41Z

test/downloadutils.js

+    // Without this, an HTML page containing an iframe with the PDF file
+    // will be served instead (issue 8920).
+    var regex = /web\.archive\.org\/web\/(\d+)/g;
+    var id = regex.exec(url)[0];


If the regex doesn't match anything, this will break (because regex.exec(url) === null).

timvandermeij · 2017-09-30T16:34:07Z

The comments are fixed in the new commit, but I'm still doing a full run for verification.

SamyCookie · 2017-09-30T16:47:22Z

No JS error for me now, but some "PDF file" downloaded are still HTML pages :(

timvandermeij · 2017-09-30T16:48:46Z

Yes, I notice that too. For example jai.pdf, which gives null for urlParts even though it works properly if I execute it manually in Firefox... I'm looking into it.

timvandermeij · 2017-09-30T17:31:59Z

I printed all the transformed URLs and noticed that every second URL is not edited. Here is a part of the output:

https://web.archive.org/web/20151019020657if_/https://www.erowid.org/archive/rhodium/chemistry/3base/piperonal.pepper/piperine.pepper/465e03piperine.pdf
http://web.archive.org/web/20150922201828/http://w2.eff.org/legal/cases/betamax/betamax_oral_argument2.pdf
http://web.archive.org/web/20160125220711if_/http://www.star.bnl.gov/public/daq/HARDWARE/21264_data_sheet.pdf
http://web.archive.org/web/20120204112920/http://h20000.www2.hp.com/bc/docs/support/SupportManual/bpl13210/bpl13210.pdf
https://web.archive.org/web/20150915102004if_/http://www.cplusplus.com/files/tutorial.pdf
http://web.archive.org/web/20150212141833/http://geothermal.inel.gov/publications/future_of_geothermal_energy.pdf
https://web.archive.org/web/20111027045653if_/http://agb.traviangames.com/Travian_AR_Terms.pdf
http://web.archive.org/web/20160213124006/http://www.leon.pl/userfiles/file/Zapytanie%20ofertowe%201-2014.pdf
https://web.archive.org/web/20170107214304if_/https://www.pdf-archive.com/2017/01/07/issue6071/issue6071.pdf
http://web.archive.org/web/20160112115354/http://www.fao.org/fileadmin/user_upload/tci/docs/2_About%20Stacks.pdf
http://web.archive.org/web/20160414174617if_/http://cache.lego.com/bigdownloads/buildinginstructions/6030672.pdf
https://web.archive.org/web/20170507102908/https://www.mbank.pl/download/firma/Dyspozycja-zmiany-typu-rachunku-biecego.pdf?noredir

If I move the regex inside the function, everything works correctly. It looks like Node either mutates the regex or does not cache it correctly. This really looks like a bug in Node.js, since it works just fine when I execute the regex in Firefox for all links. To resolve this, I just moved it inside; downloading happens only once, so it is a micro-optimization anyway.

…of the PDF file Moreover, adjust one linked test case that did not conform to the standard Web Archive URL format and adjust one linked test case because the link was dead.

timvandermeij · 2017-09-30T18:03:27Z

I just made a fresh clone and downloaded all the files with the new commit applied. All downloaded files passed the hash checks, so this patch is ready for another review.

Snuffleupagus

Works just fine in my testing (it just took a while to re-download all files to verify); thank you for fixing this!
Also, nice to have those two links updated!

Transform Web Archive URLs to avoid downloading an HTML page instead of the PDF file

timvandermeij added the test label Sep 30, 2017

timvandermeij requested a review from Snuffleupagus September 30, 2017 15:15

SamyCookie reviewed Sep 30, 2017

View reviewed changes

Snuffleupagus reviewed Sep 30, 2017

View reviewed changes

timvandermeij force-pushed the downloads branch from bf07727 to f091dbc Compare September 30, 2017 16:30

timvandermeij force-pushed the downloads branch from f091dbc to e94cd5e Compare September 30, 2017 17:01

timvandermeij force-pushed the downloads branch from e94cd5e to 271f212 Compare September 30, 2017 17:33

Transform Web Archive URLs to avoid downloading an HTML page instead …

f73c9b7

…of the PDF file Moreover, adjust one linked test case that did not conform to the standard Web Archive URL format and adjust one linked test case because the link was dead.

timvandermeij force-pushed the downloads branch from 271f212 to f73c9b7 Compare September 30, 2017 17:51

Snuffleupagus approved these changes Sep 30, 2017

View reviewed changes

Snuffleupagus merged commit f9ce904 into mozilla:master Sep 30, 2017

timvandermeij deleted the downloads branch October 1, 2017 12:01

movsb pushed a commit to movsb/pdf.js that referenced this pull request Jul 14, 2018

Merge pull request mozilla#8979 from timvandermeij/downloads

94eb160

Transform Web Archive URLs to avoid downloading an HTML page instead of the PDF file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform Web Archive URLs to avoid downloading an HTML page instead of the PDF file #8979

Transform Web Archive URLs to avoid downloading an HTML page instead of the PDF file #8979

timvandermeij commented Sep 30, 2017 •

edited

Loading

SamyCookie commented Sep 30, 2017 •

edited

Loading

SamyCookie Sep 30, 2017

Snuffleupagus left a comment •

edited

Loading

Snuffleupagus Sep 30, 2017

Snuffleupagus Sep 30, 2017

Snuffleupagus Sep 30, 2017

timvandermeij commented Sep 30, 2017

SamyCookie commented Sep 30, 2017

timvandermeij commented Sep 30, 2017

timvandermeij commented Sep 30, 2017 •

edited

Loading

timvandermeij commented Sep 30, 2017

Snuffleupagus left a comment •

edited

Loading

Transform Web Archive URLs to avoid downloading an HTML page instead of the PDF file #8979

Transform Web Archive URLs to avoid downloading an HTML page instead of the PDF file #8979

Conversation

timvandermeij commented Sep 30, 2017 • edited Loading

SamyCookie commented Sep 30, 2017 • edited Loading

SamyCookie Sep 30, 2017

Choose a reason for hiding this comment

Snuffleupagus left a comment • edited Loading

Choose a reason for hiding this comment

Snuffleupagus Sep 30, 2017

Choose a reason for hiding this comment

Snuffleupagus Sep 30, 2017

Choose a reason for hiding this comment

Snuffleupagus Sep 30, 2017

Choose a reason for hiding this comment

timvandermeij commented Sep 30, 2017

SamyCookie commented Sep 30, 2017

timvandermeij commented Sep 30, 2017

timvandermeij commented Sep 30, 2017 • edited Loading

timvandermeij commented Sep 30, 2017

Snuffleupagus left a comment • edited Loading

Choose a reason for hiding this comment

timvandermeij commented Sep 30, 2017 •

edited

Loading

SamyCookie commented Sep 30, 2017 •

edited

Loading

Snuffleupagus left a comment •

edited

Loading

timvandermeij commented Sep 30, 2017 •

edited

Loading

Snuffleupagus left a comment •

edited

Loading