-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WSJ downloads failing because of anti-scraping mechanism #178
Comments
Looking into it: this appears to be operated by a company called Datadome and they're setting and checking a cookie called |
Maybe fixed this with #183, though I'm not thrilled with maintaining a list of random cookies that are required for each site and I don't know how long |
Unsurprisingly, |
You don't want to just pull from Martin Herbach's site? |
Nope, not in |
WSJ is returning 401/403 errors to requests from
requests
, includingxword-dl
. My guess is that this is in response to traffic patterns they're seeing and they will turn it off again in due course, but that's a waiting game.In the meantime, the error message should probably differentiate between this kind of connection error and a parsing error (which is what everything sounds like now).
The text was updated successfully, but these errors were encountered: