You are viewing bsittler

Benjamin C. Wiley Sittler - binary vs. UTF-8, and why it need not matter [entries|archive|friends|userinfo]
Benjamin C. Wiley Sittler

[ website | bsittler ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

binary vs. UTF-8, and why it need not matter [Apr. 2nd, 2006|07:37 pm]
Previous Entry Add to Memories Share Next Entry
it turns out there's a way to handle UTF-8 and binary data intermixed at the same time in the same file that is simple, fully reversible, and fully compatible with all valid UTF-8 data (even the very uncommon and completely invalid CESU-8 mess used to cram UTF-16 into UTF-8 without proper decoding, although the invalid parts are handled as a sequence of binary data rather than valid characters.)

the basic technique comes from:
From: Markus Kuhn <Markus.Kuhn@...>
Subject: Substituting malformed UTF-8 sequences in a decoder
Date: Sun, 23 Jul 2000 22:44:35 +0100
Message-Id: <E13GJ5O-00064N-00@...>


the original text is archived at:
http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html

summary: use U+DCyz to represent each invalid input byte 0xyz rather
than treating these bytes as decoding errors.

so last year i implemented a version in python as part of tpd (see tpd/mainline/encodings/utf_8b.py)

today i re-implemented it in c as a patch against GNU libiconv:
http://xent.com/~bsittler/libiconv-1.9.1-utf-8b.diff

Implementation Notes:

This implementation of UTF-8B produces no errors on decoding, but
produces encoding errors for Unicode characters that cannot be
round-tripped successfully. The supported Unicode range is the UTF-16
range:

    U+0000 ... U+D7FF
and U+E000 ... U+10FFFF
linkReply

Comments:
[User Picture]From: smws
2006-04-03 07:34 pm (UTC)

(Link)

See, that one really WAS an obscure unicode entry. :)
[User Picture]From: kragen
2006-04-03 07:49 pm (UTC)

(Link)

Obscure but very important.
From: (Anonymous)
2006-04-04 04:41 pm (UTC)

(Link)

Ben attempted to explain it's importance to me. As far as I understand it, it's because data with mixed binary and unicode (in this case represented by UTF-8) is sometimes all interwingled (like in .tar files, or programs?), and that right now it's hard to edit that kind of data without corrupting it.

So I can see that this sort of thing is pretty important, but not in the same way his last post was important.
[User Picture]From: smws
2006-04-04 04:42 pm (UTC)

(Link)

That was me, btw
[User Picture]From: kragen
2006-04-06 09:00 pm (UTC)

(Link)

The significance of UTF-8B is much broader than that. The added complexity of having to tell your software whether it's chewing on UTF-8 data or binary data (or something else) runs all the way from the most primitive routines in the software to its user interface. Also, UTF-8 can generate errors in the most surprising places. When every primitive routine that deals with strings --- string length, concatenate with another string, etc. --- can generate an error, your program gains a lot of complexity that doesn't actually get you any benefit, and it's usually untestable complexity. You can't even tell if some of those error cases are correct because you can't figure out how to get them to happen.
From: chamewco
2006-04-04 02:26 am (UTC)

(Link)

You're side-stepping a more important problem. We need to teach everyone to speak the king's english and we won't have this unicode problem.
[User Picture]From: bsittler
2006-04-04 03:54 pm (UTC)

(Link)

i'm sure æðelred unræd would disagree ;)
[User Picture]From: kragen
2006-04-14 06:13 pm (UTC)

(Link)

Another UTF-8B implementation: http://hyperreal.org/~est/utf-8b/
[User Picture]From: kragen
2007-01-03 08:13 am (UTC)

(Link)

I've just registered utf8b.org.