Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support chinese or others non-ascii language #86

Closed
laomo opened this issue Nov 6, 2014 · 11 comments
Closed

Support chinese or others non-ascii language #86

laomo opened this issue Nov 6, 2014 · 11 comments
Labels
Milestone

Comments

@laomo
Copy link

laomo commented Nov 6, 2014

add

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

before https://github.com/joeyespo/grip/blob/master/grip/server.py#L11

to support chinese file name

@joeyespo
Copy link
Owner

joeyespo commented Nov 6, 2014

Ok, thanks for bringing this up!

What's the reload(sys) part do? I've had to use setdefaultencoding before, but never reload.

@ssokolow
Copy link

ssokolow commented Nov 6, 2014

That function gets deleted in normal operation with the intent being that only site.py can access it. reload(sys) reloads the module, which re-creates the definition without rerunning the code which deleted it, just making it available to be called.

@joeyespo
Copy link
Owner

I'm still a little confused why reload(sys) should be part of this project's source code. import sys should be enough to run sys.setdefaultencoding.

From the docs, "This is useful if you have edited the module source file using an external editor and want to try out the new version without leaving the Python interpreter." That's not something grip is doing.

@ssokolow
Copy link

Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> 'setdefaultencoding' in dir(sys)
False
>>> reload(sys)
<module 'sys' (built-in)>
>>> 'setdefaultencoding' in dir(sys)
True

The Python runtime imports sys prior to your program being run and runs del setdefaultencoding to prevent you from overriding the site-specific configuration hook system that nobody uses. (Sort of a Python-level LD_PRELOAD analogue)

When you run import sys, you get a cached reference to it. reload(sys) forces a reload so you get access to a version where del setdefaultencoding hasn't been called.

(When it was implemented, setdefaultencoding was intended only to be used for site-specific configuration but, given how disruptive a non-UTF8 default encoding has become since it was chosen and how few people even know about site.py, let alone are willing to risk monkey-patching every Python app on their system, it's become expected to hack around the Python 2.x runtime's behaviour to achieve 3.x-style UTF8 defaults.)

@joeyespo
Copy link
Owner

Thanks for the explanation.

So it seems like there's an underlying issue when it comes to Unicode filenames. setdefaultencoding is making a global mutation across the entire Python process, which seems like a big hack. It just so happens that fixes this particular issue, but could easily introduce other hard-to-find problems, which is probably why it's discouraged. As per the docs, the intent is to customize Python locally, not necessarily to be done in a library that could change behavior underneath another script that imports grip. That's probably why it's deleted in the first place.

Is there another way to address the problem without the global side-effects? Using .encode('etf-8') or #-*- coding: utf-8 -*- at the top of each file, for example?

@ssokolow
Copy link

#-*- coding: utf-8 -*- controls how the runtime parses the .py source files, so it has no relevance to filenames.

.encode('utf-8') can't be used for filenames because Windows doesn't use UTF-8 for them and other OSes aren't guaranteed to use it either. (On Windows, you're supposed to use a special encoding named mbcs.)

The proper way to handle filename encoding portably is to use .encode(sys.getfilesystemencoding()).

Also, os.getcwdu() will return the current directory as a unicode string and os.listdir(path) will return unicode strings if path is a unicode string. (Except for undecodable ones. Those will still be byte strings.)

As I understand it, os.walk() relies on os.listdir(), so the same semantics apply.

@joeyespo
Copy link
Owner

Awesome, thank you for clarifying. So as I understand it now, the fix would be to use unicode OS functions for names coming into Python, and encode the names interfacing with the filesystem with .encode(sys.getfilesystemencoding()).

@ssokolow
Copy link

That would fix file handling but you'd still risk "error while attempting to report error" situations if a traceback or error handler attempts to output strings containing non-ASCII characters.

That's why a lot of Python 2.x code uses setdefaultencoding('utf-8') inside an if __name__ == '__main__': block so it only runs when acting as the top level application rather than a library.

(Basically, replicating Python 3.x behaviour. In Py3x, the default is UTF-8 and setdefaultencoding has been removed. By design, UTF-8 encodes codepoints 0-127 identically to 7-bit ASCII, so it doesn't introduce new breakages for human-readable output compared to the ascii codec and, if the terminal encoding isn't UTF-8, it'd better to have gibberish embedded in your error message than having a useless "exception in Python stdlib" traceback masking the one you actually want to see.)

@joeyespo
Copy link
Owner

joeyespo commented Feb 6, 2015

Ok. Makes sense.

@ssokolow Would adding the following to command.py work? That'll make sure that using grip as a library won't unintentionally monkey-patch the parent project.

def main(argv=None, force_utf8=True):
    """The entry point of the application."""
    if force_utf8 and sys.version_info.major == 2:
      reload(sys)
      sys.setdefaultencoding('utf-8')
    ...

@ssokolow
Copy link

ssokolow commented Feb 6, 2015

That looks like it should work.

@joeyespo joeyespo added the bug label Feb 7, 2015
@joeyespo joeyespo added this to the 3.1 milestone Feb 7, 2015
@joeyespo
Copy link
Owner

joeyespo commented Feb 9, 2015

Excellent. I added that to the 3.1.0 release. Be sure to pip install --upgrade grip, @laomo. And thanks for your thorough help with encodings, @ssokolow!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants