Bugzilla – Bug 156
Converting encoding of log messages
Last modified: 2006-12-09 12:12:23 EST
For some languages (for example, Russian, which is my native language) there is several possible text encoding. For example, Russian text may be encoded as koi8, cp1251, utf8, iso8859-5, cp866... At least first 3 encodings are quite common. Currently, Mercurial stores and displays log messages "as-is", without any encoding transformation, which makes usage of non-english languages in log messages very inconvinient if different developers use locales with different encodings. I have written a patch for Mercurial 0.8, which solves (at least for me) this problem. It detects current locale encoding by analyzing LANG environment variable during ui initialization, then convert all text being output from utf8 to detected encoding, and converts log messages entered interactively (not directly from file or command line - if you specify it with -l or -m, we suppose that this message is already converted to utf8) from local encoding to utf8. So we have all log messages stored in utf8 encoding and being output in system encoding. Known problems: currently, this patch won't work on systems which don't use LANG environment variable (for exmaple, windows). But this problem can be easily fixed by adding to ui.__init__ appropriate detection of system character encoding. Probably there will be problems with localized messages, created with gettext, because gettext returns message text in system encoding, but it still will be passed through utf8->system encoding conversion.
I was working on this some months ago and I think I have everything I need to make it run on Windows, too. I just didn't find enough time+motivation to complete it cleanly. But I haven't forgotten this and I'll take this issue as a hint that users really want this and this is nothing only Mercurial developers care about :)
slav: can you provide some performance numbers ? (like hg log from the kernel repo)
On Tue, 2006-03-07 at 11:08 +0000, Yaroslav Gorbunov wrote: > I have written a patch for Mercurial 0.8, which solves (at least for me) this > problem. This is a very nice idea, but I've had a few problems with it. Unfortunately, the patch is broken against the current tip of the crew tree. If you can make it work against http://hg.intevation.org/mercurial/crew that would be a good thing. Also, your patch increases hg startup time by 50%, whether LANG is set or not. Without it, "time hg -q version" takes 0.111 seconds on my desktop, and with it, the same command takes 0.157 seconds. What's weird is that running "hg -q tip" doesn't show this increase in startup time. Finally, the patch uses tabs as well as spaces for indentation. You should clean that up before you resubmit. If you can find some way to make the added startup cost in some cases disappear, this looks good to me. Thanks, <b
> Also, your patch increases hg startup time by 50%, whether LANG is set > or not. Without it, "time hg -q version" takes 0.111 seconds on my > desktop, and with it, the same command takes 0.157 seconds. Looks quite strange. At least for me, the difference of startup time is not significant: With patch: $ time hg -q version Mercurial Distributed SCM (version 0.8) real 0m0.069s user 0m0.056s sys 0m0.012s Without patch: $ time ./hg -q version Mercurial Distributed SCM (version 0.8) real 0m0.068s user 0m0.052s sys 0m0.016s Of course I have verified that those `hg` programs use different (modified and unmodified) modules. Maybe you have forgotten to byte-compile modified ui.py? Or you measured this time only once, so some delays not connected with mercurial code (like disk operations) could influence measured time? Or there is some problems with your Python installation (particularly in codecs module)? I'll fix problems with tabs and try to adapt this patch for the current tip a few days later...
Here is part of my original work on this topic. This _should_ work on Windows, too, which is not trivial. class ui: def __init__(self, verbose=False, debug=False, quiet=False, interactive=True): + # init encoding first, because config is read in this encoding + self.encoding = locale.getpreferredencoding() + self.encoding = self.config("ui", "encoding") or self.encoding + + # sys.stdin.encoding is used to make "hg log|more" work on Windows + self.stdio_encoding = (self.config("ui", "stdio_encoding") + or self.config("ui", "encoding") + or sys.stdout.encoding or sys.stdin.encoding + or locale.getpreferredencoding()) + self.cdata = ConfigParser.SafeConfigParser() self.cdata.read(os.path.expanduser("~/.hgrc")) Plan is to decode strings to unicode strings and immediately turn them into changelog encoding (UTF-8). As there are probably already repositories with different encoding, there should be a way to disable this or to force a different changelog encoding in .hg/hgrc
> As there are probably already repositories with different encoding, there should > be a way to disable this or to force a different changelog encoding in .hg/hgrc Probably not just force a single different encoding, but a list of encodings to try. For example, the kernel repo has many revisions using UTF-8 and ISO8859-1 and a handful of others in different encodings. Doing a brain dump: - what should hg use internally? unicode's? UTF-8 encoded str's? - the CGI scripts will probably want to ignore the current locale and force the output to e.g. UTF-8 - what should be the encoding of .hgrc's? Since it's a file that the user can edit, it'd be nice for them to be encoded in the user's locale, but that would make it hard for hgweb to use them to get contact information and repo description. - I think there were some issues with the encoding of environment variables on windows - depending on how you access them, you get either a unicode or a str. Sorry, I don't really remember the details. - templates - there should be some way to specify the encoding of template files. At least the obfuscate filter has trouble with multibyte encodings today. - at least "hg log -p" will need some way to output recoded text (the changelog entry) and binary data (the patch - as an extreme case, think of a patch that changes the file encoding. There is no good way to recode this patch) - since there's interest in having a hg export/import pair generate the same changeset number, it'd probably be good to have export always use UTF-8 and make import not do any encoding conversion. Or at least an option for that.
> Here is part of my original work on this topic. This _should_ work on Windows, too, which is not trivial. Looks like a nice idea, but what is the exact difference between self.encoding and self.stdio_encoding? Also, bzr guys claim they have already full Unicode support, so we can peek into their implementation. ;-)
encoding is what e.g. notepad saves, stdio_encoding is what you get when using pipes. On Windows these are usually different because of some DOS compatibility foobar.
This is done in tip
--- Bug imported by bugzilla@serpentine.com 2012-05-12 08:30 EDT --- This bug was previously known as _bug_ 156 at http://mercurial.selenic.com/bts/issue156 Imported an attachment (id=537) Imported an attachment (id=538)