WSJ downloads failing because of anti-scraping mechanism #178

thisisparker · 2024-01-19T23:13:53Z

WSJ is returning 401/403 errors to requests from requests, including xword-dl. My guess is that this is in response to traffic patterns they're seeing and they will turn it off again in due course, but that's a waiting game.

In the meantime, the error message should probably differentiate between this kind of connection error and a parsing error (which is what everything sounds like now).

The text was updated successfully, but these errors were encountered:

thisisparker · 2024-01-22T19:42:56Z

Looking into it: this appears to be operated by a company called Datadome and they're setting and checking a cookie called datadome with a long token value. Theoretically we could provide that value with requests similar to an auth token, but I'd rather not have to do that. Still hoping this is temporary!

thisisparker · 2024-03-06T03:17:30Z

Maybe fixed this with #183, though I'm not thrilled with maintaining a list of random cookies that are required for each site and I don't know how long datadome cookies last anyway. Leaving open for now 🙄

thisisparker · 2024-03-06T16:48:37Z

Unsurprisingly, datadome tokens turn out to be very short-lived—on the order of hours, I guess? Maybe back to the drawing board here.

crosswordnexus · 2024-07-20T18:53:44Z

You don't want to just pull from Martin Herbach's site?
http://herbach.dnsalias.com/wsj/wsj240720.puz

thisisparker · 2024-07-20T19:56:09Z

Nope, not in xword-dl itself. Obviously that's a good option for end users who want it, but I've made the design decision that this tool only uses first-party sources and does its own scraping and parsing.

thisisparker mentioned this issue Mar 6, 2024

wsj, base: Move requests to a session object that can accept settings #183

Merged

thisisparker mentioned this issue Jul 7, 2024

wsj: remove WSJ from supported platforms #194

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WSJ downloads failing because of anti-scraping mechanism #178

WSJ downloads failing because of anti-scraping mechanism #178

thisisparker commented Jan 19, 2024

thisisparker commented Jan 22, 2024

thisisparker commented Mar 6, 2024

thisisparker commented Mar 6, 2024

crosswordnexus commented Jul 20, 2024

thisisparker commented Jul 20, 2024

WSJ downloads failing because of anti-scraping mechanism #178

WSJ downloads failing because of anti-scraping mechanism #178

Comments

thisisparker commented Jan 19, 2024

thisisparker commented Jan 22, 2024

thisisparker commented Mar 6, 2024

thisisparker commented Mar 6, 2024

crosswordnexus commented Jul 20, 2024

thisisparker commented Jul 20, 2024