Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request url of client session gets malformed #3424

Closed
festinuz opened this issue Dec 3, 2018 · 12 comments
Closed

Request url of client session gets malformed #3424

festinuz opened this issue Dec 3, 2018 · 12 comments
Labels
client invalid This doesn't seem right outdated

Comments

@festinuz
Copy link

festinuz commented Dec 3, 2018

Long story short

When using ClientSession to make HTTP-request and request url gets parsed in client session (here), some percent-encoded strings get un-encoded before executing request.

Expected behaviour

Valid urls should not be changed.

Actual behaviour

Urls that contain percent-encoded params that are considered "safe" are un-encoded before request. This makes some cases where url is processed incorrectly. For example, im currently working with an internal api that has endpoint like GET ../abc/def/{resource_id}, where resource_id can be any string. The problem is that while it expects some encoded string, the only thing i can manage with aiohttp is malformed id, because percent-encoded strings get partially un-encoded.

For example url value before and after this code line:

http://192.168.30.11:22000/api/v2/find/content/FB_%2Fquestions.php%3Fquestion_id%3D1187591871380130%26amp%3Brefid%3D18%26amp%3B_ft_%3Dqid.6624257984876347865%253Amf_story_key.1187591881380129%253Atop_level_post_id.1187591881380129%253Atl_objid.1187591881380129%253Acontent_owner_id_new.100003399030447%253Asrc.22%253Astory_location.6%253Astory_attachment_style.question%26amp%3B__tn__%3D-R

http://192.168.30.11:22000/api/v2/find/content/FB_%2Fquestions.php%3Fquestion_id=1187591871380130&refid=18&_ft_=qid.6624257984876347865%253Amf_story_key.1187591881380129%253Atop_level_post_id.1187591881380129%253Atl_objid.1187591881380129%253Acontent_owner_id_new.100003399030447%253Asrc.22%253Astory_location.6%253Astory_attachment_style.question&__tn__=-R

I'm not 100% sure that the problem i got is located there, but that makes sense to me, as second line does not resolve in browser while first one resolves just fine.

Steps to reproduce

Issue can be somewhat reproduced with following script. Note that httpbin is a bad example in that case because it will percent-decode in its JSON response.

# I use requests library as reference
import aiohttp
import requests


def get_requests_response(url):
    response = requests.get(url)
    return response.url


def get_aiohttp_response(url):
    import asyncio

    async def get_response(url):
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                return response.url

    loop = asyncio.get_event_loop()
    return loop.run_until_complete(get_response(url))


if __name__ == "__main__":
    url_template = "https://httpbin.org/anything/{id}"
    good_id = r"asdasdas"
    bad_id = r"path%2Fsome_encoded_param%3Dsome_value%26abc"

    good_url = url_template.format(id=good_id)
    bad_url = url_template.format(id=bad_id)

    for url in [good_url, bad_url]:
        print("\n\nTESTING", url)
        requests_resp_url = get_requests_response(url)
        print("requests:", url == requests_resp_url, requests_resp_url)
        aiohttp_resp_url = get_aiohttp_response(url)
        print("aiohttp:", url == str(aiohttp_resp_url), aiohttp_resp_url)

#
#
# TESTING        https://httpbin.org/anything/asdasdas
# requests: True https://httpbin.org/anything/asdasdas
# aiohttp:  True https://httpbin.org/anything/asdasdas
#
#
# Note that while %2F remains encoded, other params like '=' and '&' do not
# TESTING         https://httpbin.org/anything/path%2Fsome_encoded_param%3Dsome_value%26abc
# requests: True  https://httpbin.org/anything/path%2Fsome_encoded_param%3Dsome_value%26abc
# aiohttp:  False https://httpbin.org/anything/path%2Fsome_encoded_param=some_value&abc

# While i do test it against response_url, please note that this problem occurs when
# url is first processed in ClientSession._request method

Your environment

OS: ubuntu 17.04
Python: 3.6.6
aiohttp: 2.3.10 or 3.4.4 (can be reproduced on both)

@aio-libs-bot
Copy link

GitMate.io thinks the contributor most likely able to help you is @asvetlov.

Possibly related issues are #660 (aiohttp.request hangs on some URLs), #1686 (Client Session Timeouts - Usage Question), #3203 (session.get(...) timeout overriding ClientSession timeout and cancelling all requests), #388 (request.GET ignores blank values), and #2039 (Graceful shutdown for client sessions).

@asvetlov
Copy link
Member

asvetlov commented Dec 3, 2018

URL processing is performed by yarl library: https://yarl.readthedocs.io/en/latest/
In particular, the library replaces all allowed but percent-encoded symbols with original non-PCT forms.
For example %41 is replaced with A.
A part of ABNF related to path part is https://tools.ietf.org/html/rfc3986#section-3.3:

      path          = path-abempty    ; begins with "/" or is empty
                    / path-absolute   ; begins with "/" but not "//"
                    / path-noscheme   ; begins with a non-colon segment
                    / path-rootless   ; begins with a segment
                    / path-empty      ; zero characters

      path-abempty  = *( "/" segment )
      path-absolute = "/" [ segment-nz *( "/" segment ) ]
      path-noscheme = segment-nz-nc *( "/" segment )
      path-rootless = segment-nz *( "/" segment )
      path-empty    = 0<pchar>

      segment       = *pchar
      segment-nz    = 1*pchar
      segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                    ; non-zero-length segment without any colon ":"

      pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

Roughly, path contains strings or pchar segments delimited by slashes.

Both = and & are members of sub-delims set:

sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

Thus, the conversion is legal.
Looks like your server is not RFC-compliant.
To keep the ability to work with the server please read a chapter from the very top of aiohttp client documentation: http://docs.aiohttp.org/en/stable/client_quickstart.html#passing-parameters-in-urls
Essentially, it suggests to use encoded=True with malformed servers: await session.get(URL('http://example.com/%30', encoded=True)).

Hope it helps.

@asvetlov asvetlov closed this as completed Dec 3, 2018
@asvetlov asvetlov added invalid This doesn't seem right and removed bug labels Dec 3, 2018
@festinuz
Copy link
Author

festinuz commented Dec 3, 2018

Thank you for your swift yet full answer!

Sorry i didn't find it on my own and wasted your time :)

@adamhooper
Copy link

@asvetlov I urge you to reopen this issue and nix the invalid tag. Not because the resulting URL is malformed: but because it will always be a bug to rewrite URLs.

In my case, rewriting the URL broke my OAuth requests. I calculated an OAuth signature header from the URL, and then I passed the URL and signature header to aiohttp. aiohttp silently rewrote the URL, invalidating the signature, making the server respond with an Access Denied message.

Please be patient while I rant for one paragraph. Simply put, this is unbecoming of an HTTP library. I found this issue after spending a day today fiddling with OAuth libraries: it never crossed my mind to read this doc. I've used dozens of HTTP libraries in dozens of programming languages, and never have I even heard of automatic URL rewrites. Especially not when my URL is valid to begin with. Rewriting URLs is akin to an IRC client auto-correcting typos, or a CSV encoder trimming whitespace: just, no.

I realize changing this default would require a major version change. I think it's worth it. Python needs a low-level async HTTP library, far more than it needs ... uh ... a library that rewrites URLs and does async HTTP requests.

adamhooper added a commit to CJWorkbench/cjworkbench that referenced this issue Dec 20, 2018
[finishes #162176560]. The problem was _not_ on Twitter's end: it's a bug in
our HTTP library. This is infuriating. The upstream bug is
aio-libs/aiohttp#3424 and it is marked "invalid".

We use aiohttp in a few places; I think I covered them all. This should resolve
similar issues we may face in the future.
@asvetlov
Copy link
Member

requests does URL requoting for redirect URLs and as a part of PrepareRequest.prepare() (look for requote_uri() function usage). Thus your assumption about aiohttp uniqueness is not correct.
Unlike requests aiohttp allows to disable requoting, which is good.

I feel your pain, but things like request signature need a slightly different feature: request preparing as a separate execution step. Now you call session.request() and aiohttp does all internal work without giving a user a chance to do something in the middle, e.g. modifying HTTP headers just before sending them to a server.

@adamhooper
Copy link

Thanks for the response. That leads me to two separate questions.

  1. Why does aiohttp use a different requoting mechanism than requests? In particular, why does aiohttp requote a colon, or '%3A', whereare requests doesn't? If aiohttp is going to do this bizarre thing (requoting requests), it might as well adhere to the only pseudo-standard that exists in the wild, no?

  2. What is aiohttp? http is "HTTP client for Python," urllib3 is "http with simplified networking"; requests is "urllib3 for humans". It makes sense for requests to do silly "magic" stuff, because that's the only reason people use requests. Which is aiohttp: is it "http, async," "urllib3, async" or "requests, async?" If it's "requests, async," maybe we should split it into two libraries -- one for "urllib3, async" and another (the current aiohttp) for "requests, async"? I'm tempted to invest the effort to collaborate here, because I seem to be building an "aiohttp, un-magicked" wrapper in my codebase.

@asvetlov
Copy link
Member

Man, aiohttp has no relation to urllib3 and requests.
Not every HTTP client should follow requests design, as not every web server should follow Django (while all three are great libraries, sure).

aiohttp is a Free Open Source Software, distributed under Apache License:

software distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND

Basically you pay nothing but get no guarantees (including a support guarantee) back.

You have choices:

  1. Propose an aiohttp fix and convince people that your fix is worth to be applied.
  2. Fork aiohttp and apply all desired changes on you own.
  3. Give up the aiohttp usage, switch to tool of your dream.

The decision is up to you, but I would mention that your current attitude doesn't help to convince.

@adamhooper
Copy link

I understand the whole "no guarantees" thing. I am trying to be constructive here.

A fork can be totally amicable. I felt jarred when I found that aiohttp rewrote my URL, because I had jumped to the conclusion that aiohttp is "urllib3, async." That was my mistake as a user. I had assumed aiohttp is like urllib3 but async, simply because I assumed "urllib3 but async" would exist somewhere, and since aiohttp is popular I assumed that was it. That's why I'm asking about aiohttp's intent: is aiohttp is meant to fill that gap, or is aiohttp something else? (If aiohttp is something else, then forking would mean everybody wins, in the same way that http and urllib3 and requests are symbiotic.)

Or maybe I'm leaping over swaths of conversation here with my assumptions....

The aiohttp fix I propose is: when given a str URL and no params, don't rewrite it. (In other words, set encoded=True by default.) I hope to convince people because not-rewriting URLs is Pythonic -- just like urllib3 and http.

Does that fix sound compelling?

@asvetlov
Copy link
Member

asvetlov commented Dec 21, 2018

Let's return to the root.

aiohttp does requote URLs by default.
It was implemented with good intentions: sending arbitrary data violates too many HTTP parsers. For example http:/example.com/путь (Russian translation for path) cannot be sent, it should be percent-encoded (PCT) as http://example.com/%D0%BF%D1%83%D1%82%D1%8C.
So far so good.

Requoting is more questionable. Should %41 be requoted as A? Why not, both %41 and A are equal.
Wikipedia suggests to unquote unreserved characters:

Characters from the unreserved set never need to be percent-encoded.
URIs that differ only by whether an unreserved character is percent-encoded or appears literally are equivalent by definition, but URI processors, in practice, may not always recognize this equivalence. For example, URI consumers shouldn't treat %41 differently from A or %7E differently from ~, but some do. For maximum interoperability, URI producers are discouraged from percent-encoding unreserved characters.

Should : be requoted? Things become even more complicated.
We need to look at RFC 3986.
I assume that all allowed characters should be unquoted and all not-allowed characters should be quoted.
requests do the same but uses a simplified regex-based approach to detect allowed chars set.
aiohttp uses yarl library for this task with different allowed sets for any URL part.
The RFC has ABNF for describing URL grammar: https://tools.ietf.org/html/rfc3986#page-49
Let's assume we are talking about : in path part.
This is a cutting from RFC:

   path          = path-abempty    ; begins with "/" or is empty
                 / path-absolute   ; begins with "/" but not "//"
                 / path-noscheme   ; begins with a non-colon segment
                 / path-rootless   ; begins with a segment
                 / path-empty      ; zero characters
   path-abempty  = *( "/" segment )
   path-absolute = "/" [ segment-nz *( "/" segment ) ]
   path-noscheme = segment-nz-nc *( "/" segment )
   path-rootless = segment-nz *( "/" segment )
   path-empty    = 0<pchar>

   segment       = *pchar
   segment-nz    = 1*pchar
   segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                 ; non-zero-length segment without any colon ":"

   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

Essentially a path is a sequence of pchar delimited by /.
: is allowed pchar value (while it is in reserved set in general, I believe it is the main point of confusion).

That's why yarl requotes %3A back to :.

I hope this text will help you to understand aiohttp design decision better and can be the backgound for discussion.

@adamhooper
Copy link

As I understand it, the reason for quoting is: if the user makes a mistake, aiohttp should fix it.

Is that the intent?

@asvetlov
Copy link
Member

Kind of.
There are many ways to make a mistake.
One of them is don't using PCT for non-ASCII texts but the problem is not limited to this.

@adamhooper
Copy link

I totally understand the rationale for correcting users' mistakes in a similar vein to how requests does it.

My view: yes, sometimes it's nice to correct users' mistakes silently; but I also want a library that doesn't correct mistakes silently. I doubt I'm alone in that desire.

In Python, I want int('hi') to raise ValueError and int(None) to raise TypeError, even though it's possible for int('hi') and int(None) to simply return 0. Cherry-picking from PEP20:

  • Explicit is better than implicit.
  • Simple is better than complex.
  • Errors should never pass silently. Unless explicitly silenced.
  • In the face of ambiguity, refuse the temptation to guess.

So for me, there are two issues:

  1. Ideologically: I feel the Python ecosystem would be better if there were an asyncio HTTP library that lets me send an invalid URL to an HTTP server -- just like most HTTP libraries I've used in the past in all sorts of programming languages (including the two HTTP libraries built-in to Python). This may mean "fork", but I bear no hostility! I'm picturing some division of labor -- aiohttpbase/aiohttp mimicking urllib3/requests. I know, I know -- this discussion is beyond the scope of a GitHub issue. Starting now, I'll stay mum about it on this issue page.
  2. Short-term, concrete: I hope we can agree that aiohttp should not rewrite valid URLs. I think my personal struggle using OAuth through aiohttp provides ample evidence.

(And yes, of course import yarl can work around this issue. But I end up needing to import yarl every time I use aiohttp; surely this is not the intent.)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
client invalid This doesn't seem right outdated
Projects
None yet
Development

No branches or pull requests

4 participants