Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WSJ downloads failing because of anti-scraping mechanism #178

Open
thisisparker opened this issue Jan 19, 2024 · 5 comments
Open

WSJ downloads failing because of anti-scraping mechanism #178

thisisparker opened this issue Jan 19, 2024 · 5 comments

Comments

@thisisparker
Copy link
Owner

WSJ is returning 401/403 errors to requests from requests, including xword-dl. My guess is that this is in response to traffic patterns they're seeing and they will turn it off again in due course, but that's a waiting game.

In the meantime, the error message should probably differentiate between this kind of connection error and a parsing error (which is what everything sounds like now).

@thisisparker
Copy link
Owner Author

Looking into it: this appears to be operated by a company called Datadome and they're setting and checking a cookie called datadome with a long token value. Theoretically we could provide that value with requests similar to an auth token, but I'd rather not have to do that. Still hoping this is temporary!

@thisisparker
Copy link
Owner Author

Maybe fixed this with #183, though I'm not thrilled with maintaining a list of random cookies that are required for each site and I don't know how long datadome cookies last anyway. Leaving open for now 🙄

@thisisparker
Copy link
Owner Author

Unsurprisingly, datadome tokens turn out to be very short-lived—on the order of hours, I guess? Maybe back to the drawing board here.

@crosswordnexus
Copy link

You don't want to just pull from Martin Herbach's site?
http://herbach.dnsalias.com/wsj/wsj240720.puz

@thisisparker
Copy link
Owner Author

Nope, not in xword-dl itself. Obviously that's a good option for end users who want it, but I've made the design decision that this tool only uses first-party sources and does its own scraping and parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants