Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HttpClient without initial proxy auth limits web scraping capability #100515

Closed
ksmib opened this issue Apr 2, 2024 · 9 comments
Closed

HttpClient without initial proxy auth limits web scraping capability #100515

ksmib opened this issue Apr 2, 2024 · 9 comments
Assignees
Labels
Milestone

Comments

@ksmib
Copy link

ksmib commented Apr 2, 2024

Description

When using HttpClient with a proxy, it defaults to omitting proxy authentication in its initial CONNECT request, sending the credentials only after receiving a 407 HTTP status code from the proxy.

Unlike with server authentication, where SocketHttpHandler.PreAuthenticate allows for sending credentials preemptively, there is no equivalent option for proxy authentication.

This behavior limits its suitability for web scraping tasks. Every major proxy providers are using backconnecting proxy which means clients are connecting to same address even for different proxies

These proxy providers will deny client's new connection temporarily after triggering dozens of "407 Proxy Authentication Required" errors in a short time.

For those unfamiliar with the underlying issue, it will appear as though majority of their requests are failing, resulting in an HttpRequestException.

The lack of a preemptive proxy authentication feature makes .NET unsuitable for web scraping.

In contrast, curl and Python's urllib send proxy authentication credentials on the initial CONNECT request by default.

I tried manually adding a "Proxy-Authorization" header but it helps only with HTTP requests but not with HTTPS requests (which are most of the cases) since CONNECT request headers are independent.

The only workaround I could think of is that similar to SocketHttpHandler.PreAuthenticate for preemptive server authenication, adding a PreAuthenticateProxy to add "Proxy-Authorization" header on initial CONNECT request.

@ksmib ksmib added the tenet-performance Performance related issue label Apr 2, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 2, 2024
@ManickaP
Copy link
Member

ManickaP commented Apr 3, 2024

@wfurt could you help here? This is beyond my expertise.

@wfurt
Copy link
Member

wfurt commented Apr 5, 2024

In essence this looks like dup of #66244. Do you expect base auth for the proxy @ksmib or do you expect something else?

Besides just sending the credentials, we could have separate property as mentioned on #66244. Also @blowdart and @GrabYourPitchforks may have some take on this.

Since the proxy is generally single site I feel it would be reasonable to remember if it required authentication and send the credentials on subsequent request. Even if basic auth, the damage has been already done. That would at least eliminate the issue with many 407.

@ksmib
Copy link
Author

ksmib commented Apr 5, 2024

Yes, basic auth is what these proxy providers use.
It seems unreasonable to me if they request less common auth schemes but ban with too many 407 at the same time

Issue #66244 author talks about the extra round trip, which doesn't bother me much. The critical issue, as mentioned by another user matteocontrini, is the blacklist situation.

I think remembering proxy auth is a good take as well besides the PreAuthenticateProxy.
However backconnecting proxies use the same address but different ports for different proxies, we need separate HttpClient instances for different proxies.
If the remember feature works per HttpClient, it might still lead to bans due to too many initial 407s in larger scale scraping.
In my test case, bans occurred with 50-100 of 407 triggers in a few seconds.

@ManickaP
Copy link
Member

ManickaP commented Apr 8, 2024

remember feature works per HttpClient

Yes, it's even per connection pool. I.e. different proxy URIs will lead to different cache even on the same client. So if you're connection via different port to the same proxy, pre-auth would not help here. @wfurt please correct me if I'm reading the code incorrectly here.

@rzikm rzikm self-assigned this Apr 9, 2024
@rzikm
Copy link
Member

rzikm commented Apr 9, 2024

Triage: may be related to #93340 which we aim to do something about in 9.0, so let's try to include this one as well.

@rzikm rzikm added this to the 9.0.0 milestone Apr 9, 2024
@rzikm rzikm removed the untriaged New issue has not been triaged by the area owner label Apr 9, 2024
@ksmib
Copy link
Author

ksmib commented Apr 10, 2024

Triage: may be related to #93340 which we aim to do something about in 9.0, so let's try to include this one as well.

Not sure if CredentialCache would solve this issue since @ManickaP had pointed out cache works per connection pool. The proxies are of same host but different ports which require multiple HttpClient instances

@wfurt
Copy link
Member

wfurt commented Apr 10, 2024

That may be what the code does today. Even for the servers, HttpClient keeps cache of URL paths that needs pre-authentication. We can do something similar for proxy(es) across the pools. In my mind the first connection would get 407 and authenticate but all subsequent connections would have the Proxy-Authorization headers if connecting to the same proxy. e.g. for each HttpClient/HttpHandler, you would see one failed attempt for any given proxy. (+plus any race conditions on start)

That should IMHO avoid the "triggering dozens of "407 Proxy Authentication Required" errors in a short time.".
But are we talking about few different proxies? Hundreds? Thousands? And are they cooperating e.g. would one 407 for given proxy account also for different proxies?

While I feel we can make some improvements, I also think this is somewhat corner case e.g. low propriety comparing to some of the other issues.

@ksmib
Copy link
Author

ksmib commented Apr 11, 2024

That may be what the code does today. Even for the servers, HttpClient keeps cache of URL paths that needs pre-authentication. We can do something similar for proxy(es) across the pools. In my mind the first connection would get 407 and authenticate but all subsequent connections would have the Proxy-Authorization headers if connecting to the same proxy. e.g. for each HttpClient/HttpHandler, you would see one failed attempt for any given proxy. (+plus any race conditions on start)

That should IMHO avoid the "triggering dozens of "407 Proxy Authentication Required" errors in a short time.". But are we talking about few different proxies? Hundreds? Thousands? And are they cooperating e.g. would one 407 for given proxy account also for different proxies?

While I feel we can make some improvements, I also think this is somewhat corner case e.g. low propriety comparing to some of the other issues.

Got it. When web scraping, we mostly use proxy providers that charge by bandwidth rather than proxy amount, so it's common to use hundreds of proxies. In my scenario, using about 150 proxies at once quickly led to a blacklist situation. This ban affects the connecting IP, stopping new connections for the remaining proxies.

With the new implementation, while the blacklist issue might not disappear entirely, it's likely to only happen initially and then stabilize, which is a significant improvement over the current state.

@ksmib ksmib closed this as completed Apr 11, 2024
@github-actions github-actions bot locked and limited conversation to collaborators May 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants