Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform Web Archive URLs to avoid downloading an HTML page instead of the PDF file #8979

Merged
merged 1 commit into from
Sep 30, 2017

Conversation

timvandermeij
Copy link
Contributor

@timvandermeij timvandermeij commented Sep 30, 2017

Moreover, adjust one linked test case that did not conform to the standard Web Archive URL format and adjust one linked test case because the link was dead.

Fixes #8920.
Fixes #8412.

@SamyCookie
Copy link

SamyCookie commented Sep 30, 2017

Hi, I tried tests locally just now and got this :

Downloading http://web.archive.org/save/_embed/http://210.243.166.143/prob1.pdf to pdfs/issue8169.pdf...
/home/samuel/src/pdf.js/test/downloadutils.js:34
    var id = regex.exec(url)[0];
                            ^

TypeError: Cannot read property '0' of null
    at downloadFile (/home/samuel/src/pdf.js/test/downloadutils.js:34:29)
    at downloadNext (/home/samuel/src/pdf.js/test/downloadutils.js:103:5)
    at /home/samuel/src/pdf.js/test/downloadutils.js:110:7
    at WriteStream.<anonymous> (/home/samuel/src/pdf.js/test/downloadutils.js:76:9)
    at emitNone (events.js:110:20)
    at WriteStream.emit (events.js:207:7)
    at finishMaybe (_stream_writable.js:603:14)
    at afterWrite (_stream_writable.js:455:3)
    at onwrite (_stream_writable.js:445:7)
    at fs.js:2189:5

// Web Archive URLs need to be transformed to add `if_` after the ID.
// Without this, an HTML page containing an iframe with the PDF file
// will be served instead (issue 8920).
var regex = /web\.archive\.org\/web\/(\d+)/g;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor optimization, make regex global outside downloadFile to prevent to compile it each time ?

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, but I'd like us to be a bit more explicit about what URLs we attempt to rewrite (and have a bit more validation as well). Could we e.g. do something along these lines?

diff --git a/test/downloadutils.js b/test/downloadutils.js
index ef145416..f1b977fd 100644
--- a/test/downloadutils.js
+++ b/test/downloadutils.js
@@ -22,7 +22,19 @@ var crypto = require('crypto');
 var http = require('http');
 var https = require('https');
 
+function rewriteInternetArchiveUrl(url) {
+  var urlParts =
+    /(^https?:\/\/web\.archive\.org\/web\/)(\d+)(\/https?:\/\/.+)/g.exec(url);
+  if (urlParts) {
+    return urlParts[1] + (urlParts[2] + 'if_') + urlParts[3];
+  }
+  return url;
+}
+
 function downloadFile(file, url, callback, redirects) {
+  // Comment here...
+  url = rewriteInternetArchiveUrl(url);
+
   var completed = false;
   var protocol = /^https:\/\//.test(url) ? https : http;
   protocol.get(url, function (response) {

// Web Archive URLs need to be transformed to add `if_` after the ID.
// Without this, an HTML page containing an iframe with the PDF file
// will be served instead (issue 8920).
var regex = /web\.archive\.org\/web\/(\d+)/g;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this regex match already rewritten URLs as well?
Also, I'd like us to be a bit more specific about what URL format we expect; more elsewhere.

// will be served instead (issue 8920).
var regex = /web\.archive\.org\/web\/(\d+)/g;
var id = regex.exec(url)[0];
url = url.replace(regex, id + 'if_');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it sufficient to just do exec, i.e. couldn't you just assemble the URL manually based on the result from a more "complete" regex?

// Without this, an HTML page containing an iframe with the PDF file
// will be served instead (issue 8920).
var regex = /web\.archive\.org\/web\/(\d+)/g;
var id = regex.exec(url)[0];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the regex doesn't match anything, this will break (because regex.exec(url) === null).

@timvandermeij
Copy link
Contributor Author

The comments are fixed in the new commit, but I'm still doing a full run for verification.

@SamyCookie
Copy link

No JS error for me now, but some "PDF file" downloaded are still HTML pages :(

@timvandermeij
Copy link
Contributor Author

Yes, I notice that too. For example jai.pdf, which gives null for urlParts even though it works properly if I execute it manually in Firefox... I'm looking into it.

@timvandermeij
Copy link
Contributor Author

timvandermeij commented Sep 30, 2017

I printed all the transformed URLs and noticed that every second URL is not edited. Here is a part of the output:

https://web.archive.org/web/20151019020657if_/https://www.erowid.org/archive/rhodium/chemistry/3base/piperonal.pepper/piperine.pepper/465e03piperine.pdf
http://web.archive.org/web/20150922201828/http://w2.eff.org/legal/cases/betamax/betamax_oral_argument2.pdf
http://web.archive.org/web/20160125220711if_/http://www.star.bnl.gov/public/daq/HARDWARE/21264_data_sheet.pdf
http://web.archive.org/web/20120204112920/http://h20000.www2.hp.com/bc/docs/support/SupportManual/bpl13210/bpl13210.pdf
https://web.archive.org/web/20150915102004if_/http://www.cplusplus.com/files/tutorial.pdf
http://web.archive.org/web/20150212141833/http://geothermal.inel.gov/publications/future_of_geothermal_energy.pdf
https://web.archive.org/web/20111027045653if_/http://agb.traviangames.com/Travian_AR_Terms.pdf
http://web.archive.org/web/20160213124006/http://www.leon.pl/userfiles/file/Zapytanie%20ofertowe%201-2014.pdf
https://web.archive.org/web/20170107214304if_/https://www.pdf-archive.com/2017/01/07/issue6071/issue6071.pdf
http://web.archive.org/web/20160112115354/http://www.fao.org/fileadmin/user_upload/tci/docs/2_About%20Stacks.pdf
http://web.archive.org/web/20160414174617if_/http://cache.lego.com/bigdownloads/buildinginstructions/6030672.pdf
https://web.archive.org/web/20170507102908/https://www.mbank.pl/download/firma/Dyspozycja-zmiany-typu-rachunku-biecego.pdf?noredir

If I move the regex inside the function, everything works correctly. It looks like Node either mutates the regex or does not cache it correctly. This really looks like a bug in Node.js, since it works just fine when I execute the regex in Firefox for all links. To resolve this, I just moved it inside; downloading happens only once, so it is a micro-optimization anyway.

…of the PDF file

Moreover, adjust one linked test case that did not conform to the
standard Web Archive URL format and adjust one linked test case because
the link was dead.
@timvandermeij
Copy link
Contributor Author

I just made a fresh clone and downloaded all the files with the new commit applied. All downloaded files passed the hash checks, so this patch is ready for another review.

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works just fine in my testing (it just took a while to re-download all files to verify); thank you for fixing this!
Also, nice to have those two links updated!

@Snuffleupagus Snuffleupagus merged commit f9ce904 into mozilla:master Sep 30, 2017
@timvandermeij timvandermeij deleted the downloads branch October 1, 2017 12:01
movsb pushed a commit to movsb/pdf.js that referenced this pull request Jul 14, 2018
Transform Web Archive URLs to avoid downloading an HTML page instead of the PDF file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants