-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transform Web Archive URLs to avoid downloading an HTML page instead of the PDF file #8979
Conversation
Hi, I tried tests locally just now and got this :
|
test/downloadutils.js
Outdated
// Web Archive URLs need to be transformed to add `if_` after the ID. | ||
// Without this, an HTML page containing an iframe with the PDF file | ||
// will be served instead (issue 8920). | ||
var regex = /web\.archive\.org\/web\/(\d+)/g; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor optimization, make regex
global outside downloadFile
to prevent to compile it each time ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this, but I'd like us to be a bit more explicit about what URLs we attempt to rewrite (and have a bit more validation as well). Could we e.g. do something along these lines?
diff --git a/test/downloadutils.js b/test/downloadutils.js
index ef145416..f1b977fd 100644
--- a/test/downloadutils.js
+++ b/test/downloadutils.js
@@ -22,7 +22,19 @@ var crypto = require('crypto');
var http = require('http');
var https = require('https');
+function rewriteInternetArchiveUrl(url) {
+ var urlParts =
+ /(^https?:\/\/web\.archive\.org\/web\/)(\d+)(\/https?:\/\/.+)/g.exec(url);
+ if (urlParts) {
+ return urlParts[1] + (urlParts[2] + 'if_') + urlParts[3];
+ }
+ return url;
+}
+
function downloadFile(file, url, callback, redirects) {
+ // Comment here...
+ url = rewriteInternetArchiveUrl(url);
+
var completed = false;
var protocol = /^https:\/\//.test(url) ? https : http;
protocol.get(url, function (response) {
test/downloadutils.js
Outdated
// Web Archive URLs need to be transformed to add `if_` after the ID. | ||
// Without this, an HTML page containing an iframe with the PDF file | ||
// will be served instead (issue 8920). | ||
var regex = /web\.archive\.org\/web\/(\d+)/g; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this regex match already rewritten URLs as well?
Also, I'd like us to be a bit more specific about what URL format we expect; more elsewhere.
test/downloadutils.js
Outdated
// will be served instead (issue 8920). | ||
var regex = /web\.archive\.org\/web\/(\d+)/g; | ||
var id = regex.exec(url)[0]; | ||
url = url.replace(regex, id + 'if_'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it sufficient to just do exec
, i.e. couldn't you just assemble the URL manually based on the result from a more "complete" regex?
test/downloadutils.js
Outdated
// Without this, an HTML page containing an iframe with the PDF file | ||
// will be served instead (issue 8920). | ||
var regex = /web\.archive\.org\/web\/(\d+)/g; | ||
var id = regex.exec(url)[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the regex doesn't match anything, this will break (because regex.exec(url) === null
).
bf07727
to
f091dbc
Compare
The comments are fixed in the new commit, but I'm still doing a full run for verification. |
No JS error for me now, but some "PDF file" downloaded are still HTML pages :( |
Yes, I notice that too. For example |
f091dbc
to
e94cd5e
Compare
I printed all the transformed URLs and noticed that every second URL is not edited. Here is a part of the output: https://web.archive.org/web/20151019020657if_/https://www.erowid.org/archive/rhodium/chemistry/3base/piperonal.pepper/piperine.pepper/465e03piperine.pdf http://web.archive.org/web/20150922201828/http://w2.eff.org/legal/cases/betamax/betamax_oral_argument2.pdf http://web.archive.org/web/20160125220711if_/http://www.star.bnl.gov/public/daq/HARDWARE/21264_data_sheet.pdf http://web.archive.org/web/20120204112920/http://h20000.www2.hp.com/bc/docs/support/SupportManual/bpl13210/bpl13210.pdf https://web.archive.org/web/20150915102004if_/http://www.cplusplus.com/files/tutorial.pdf http://web.archive.org/web/20150212141833/http://geothermal.inel.gov/publications/future_of_geothermal_energy.pdf https://web.archive.org/web/20111027045653if_/http://agb.traviangames.com/Travian_AR_Terms.pdf http://web.archive.org/web/20160213124006/http://www.leon.pl/userfiles/file/Zapytanie%20ofertowe%201-2014.pdf https://web.archive.org/web/20170107214304if_/https://www.pdf-archive.com/2017/01/07/issue6071/issue6071.pdf http://web.archive.org/web/20160112115354/http://www.fao.org/fileadmin/user_upload/tci/docs/2_About%20Stacks.pdf http://web.archive.org/web/20160414174617if_/http://cache.lego.com/bigdownloads/buildinginstructions/6030672.pdf https://web.archive.org/web/20170507102908/https://www.mbank.pl/download/firma/Dyspozycja-zmiany-typu-rachunku-biecego.pdf?noredir If I move the regex inside the function, everything works correctly. It looks like Node either mutates the regex or does not cache it correctly. This really looks like a bug in Node.js, since it works just fine when I execute the regex in Firefox for all links. To resolve this, I just moved it inside; downloading happens only once, so it is a micro-optimization anyway. |
e94cd5e
to
271f212
Compare
…of the PDF file Moreover, adjust one linked test case that did not conform to the standard Web Archive URL format and adjust one linked test case because the link was dead.
271f212
to
f73c9b7
Compare
I just made a fresh clone and downloaded all the files with the new commit applied. All downloaded files passed the hash checks, so this patch is ready for another review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works just fine in my testing (it just took a while to re-download all files to verify); thank you for fixing this!
Also, nice to have those two links updated!
Transform Web Archive URLs to avoid downloading an HTML page instead of the PDF file
Moreover, adjust one linked test case that did not conform to the standard Web Archive URL format and adjust one linked test case because the link was dead.
Fixes #8920.
Fixes #8412.