Bug 156 - Converting encoding of log messages (edit)
:
:
Status: RESOLVED FIXED
:
:
:
Assigned To:
Thomas Arendsen Hein (edit) (take)

(edit)
:
Depends on: (edit)
Blocks: (edit)
  Show dependency treegraph
 
Reported: 2006-03-07 05:08 EST by Yaroslav Gorbunov
Modified: 2006-12-09 12:12 EST (History)
6 users (show)

(add)



Attachments
(32 bytes, application/octet-stream)
2006-03-07 05:08 EST, Yaroslav Gorbunov
Details
(32 bytes, application/octet-stream)
2006-03-07 05:09 EST, Yaroslav Gorbunov
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yaroslav Gorbunov 2006-03-07 05:08:45 EST
For some languages (for example, Russian, which is my native language) there is
several possible text encoding. For example, Russian text may be encoded as
koi8, cp1251, utf8, iso8859-5, cp866... At least first 3 encodings are quite
common. Currently, Mercurial stores and displays log messages "as-is", without
any encoding transformation, which makes usage of non-english languages in log
messages very inconvinient if different developers use locales with different
encodings.

I have written a patch for Mercurial 0.8, which solves (at least for me) this
problem. It detects current locale encoding by analyzing LANG environment
variable during ui initialization, then convert all text being output from utf8
to detected encoding, and converts log messages entered interactively (not
directly from file or command line - if you specify it with -l or -m, we suppose
that this message is already converted to utf8) from local encoding to utf8. So
we have all log messages stored in utf8 encoding and being output in system
encoding.

Known problems: currently, this patch won't work on systems which don't use LANG
environment variable (for exmaple, windows). But this problem can be easily
fixed by adding to ui.__init__ appropriate detection of system character encoding.

Probably there will be problems with localized messages, created with gettext,
because gettext returns message text in system encoding, but it still will be
passed through utf8->system encoding conversion.
Comment 1 Thomas Arendsen Hein 2006-03-07 06:58:09 EST
I was working on this some months ago and I think I have everything I need to
make it run on Windows, too. I just didn't find enough time+motivation to
complete it cleanly.

But I haven't forgotten this and I'll take this issue as a hint that users
really want this and this is nothing only Mercurial developers care about :)
Comment 2 Benoit Boissinot 2006-03-07 09:43:42 EST
slav: can you provide some performance numbers ? (like hg log from the kernel repo)
Comment 3 Bryan O'Sullivan 2006-03-07 09:58:19 EST
On Tue, 2006-03-07 at 11:08 +0000, Yaroslav Gorbunov wrote:

> I have written a patch for Mercurial 0.8, which solves (at least for me) this
> problem.

This is a very nice idea, but I've had a few problems with it.

Unfortunately, the patch is broken against the current tip of the crew
tree.  If you can make it work against
http://hg.intevation.org/mercurial/crew that would be a good thing.

Also, your patch increases hg startup time by 50%, whether LANG is set
or not.  Without it, "time hg -q version" takes 0.111 seconds on my
desktop, and with it, the same command takes 0.157 seconds.

What's weird is that running "hg -q tip" doesn't show this increase in
startup time.

Finally, the patch uses tabs as well as spaces for indentation.  You
should clean that up before you resubmit.

If you can find some way to make the added startup cost in some cases
disappear, this looks good to me.

Thanks,

	<b
Comment 4 Yaroslav Gorbunov 2006-03-07 12:20:02 EST
> Also, your patch increases hg startup time by 50%, whether LANG is set
> or not.  Without it, "time hg -q version" takes 0.111 seconds on my
> desktop, and with it, the same command takes 0.157 seconds.

Looks quite strange. At least for me, the difference of startup time is not
significant:

With patch:
$ time hg -q version
Mercurial Distributed SCM (version 0.8)

real    0m0.069s
user    0m0.056s
sys     0m0.012s

Without patch:
$ time ./hg -q version
Mercurial Distributed SCM (version 0.8)

real    0m0.068s
user    0m0.052s
sys     0m0.016s

Of course I have verified that those `hg` programs use different (modified and
unmodified) modules.

Maybe you have forgotten to byte-compile modified ui.py? Or you measured this
time only once, so some delays not connected with mercurial code (like disk
operations) could influence measured time? Or there is some problems with your
Python installation (particularly in codecs module)?

I'll fix problems with tabs and try to adapt this patch for the current tip a
few days later...
Comment 5 Thomas Arendsen Hein 2006-05-19 14:14:09 EDT
Here is part of my original work on this topic. This _should_ work on Windows,
too, which is not trivial.

 class ui:
     def __init__(self, verbose=False, debug=False, quiet=False,
                  interactive=True):
+        # init encoding first, because config is read in this encoding
+        self.encoding = locale.getpreferredencoding()
+        self.encoding = self.config("ui", "encoding") or self.encoding
+
+        # sys.stdin.encoding is used to make "hg log|more" work on Windows
+        self.stdio_encoding = (self.config("ui", "stdio_encoding")
+                               or self.config("ui", "encoding")
+                               or sys.stdout.encoding or sys.stdin.encoding
+                               or locale.getpreferredencoding())
+
         self.cdata = ConfigParser.SafeConfigParser()
         self.cdata.read(os.path.expanduser("~/.hgrc"))

Plan is to decode strings to unicode strings and immediately turn them into
changelog encoding (UTF-8).

As there are probably already repositories with different encoding, there should
be a way to disable this or to force a different changelog encoding in .hg/hgrc
Comment 6 Alexis S. L. Carvalho 2006-05-19 15:36:12 EDT
> As there are probably already repositories with different encoding, there should
> be a way to disable this or to force a different changelog encoding in .hg/hgrc

Probably not just force a single different encoding, but a list of encodings to
try.  For example, the kernel repo has many revisions using UTF-8 and ISO8859-1
and a handful of others in different encodings.

Doing a brain dump:
- what should hg use internally? unicode's? UTF-8 encoded str's?

- the CGI scripts will probably want to ignore the current locale and force the
output to e.g. UTF-8

- what should be the encoding of .hgrc's? Since it's a file that the user can
edit, it'd be nice for them to be encoded in the user's locale, but that would
make it hard for hgweb to use them to get contact information and repo description.

- I think there were some issues with the encoding of environment variables on
windows - depending on how you access them, you get either a unicode or a str.
Sorry, I don't really remember the details.

- templates - there should be some way to specify the encoding of template
files. At least the obfuscate filter has trouble with multibyte encodings today.

- at least "hg log -p" will need some way to output recoded text (the changelog
entry) and binary data (the patch - as an extreme case, think of a patch that
changes the file encoding. There is no good way to recode this patch)

- since there's interest in having a hg export/import pair generate the same
changeset number, it'd probably be good to have export always use UTF-8 and make
import not do any encoding conversion. Or at least an option for that.
Comment 7 Andrey 2006-11-11 00:28:00 EST
> Here is part of my original work on this topic. This _should_ work on 
Windows, too, which is not trivial.

Looks like a nice idea, but what is the exact difference between self.encoding 
and self.stdio_encoding?

Also, bzr guys claim they have already full Unicode support, so we can peek 
into their implementation. ;-)
Comment 8 Thomas Arendsen Hein 2006-11-11 01:15:58 EST
encoding is what e.g. notepad saves, stdio_encoding is what you get when using
pipes.
On Windows these are usually different because of some DOS compatibility foobar.
Comment 9 Matt Mackall 2006-12-09 12:12:23 EST
This is done in tip
Comment 10 Bugzilla 2012-05-12 08:30:07 EDT

--- Bug imported by bugzilla@serpentine.com 2012-05-12 08:30 EDT  ---

This bug was previously known as _bug_ 156 at http://mercurial.selenic.com/bts/issue156
Imported an attachment (id=537)
Imported an attachment (id=538)