|binary vs. UTF-8, and why it need not matter
||[Apr. 2nd, 2006|07:37 pm]
it turns out there's a way to handle UTF-8 and binary data intermixed at the same time in the same file that is simple, fully reversible, and fully compatible with all valid UTF-8 data (even the very uncommon and completely invalid CESU-8 mess used to cram UTF-16 into UTF-8 without proper decoding, although the invalid parts are handled as a sequence of binary data rather than valid characters.)|
the basic technique comes from:
From: Markus Kuhn <Markus.Kuhn@...>
Subject: Substituting malformed UTF-8 sequences in a decoder
Date: Sun, 23 Jul 2000 22:44:35 +0100
the original text is archived at:
summary: use U+DCyz to represent each invalid input byte 0xyz rather
than treating these bytes as decoding errors.
so last year i implemented a version in python as part of tpd (see tpd/mainline/encodings/utf_8b.py)
today i re-implemented it in c as a patch against GNU libiconv:
This implementation of UTF-8B produces no errors on decoding, but
produces encoding errors for Unicode characters that cannot be
round-tripped successfully. The supported Unicode range is the UTF-16
U+0000 ... U+D7FF
and U+E000 ... U+10FFFF
2006-04-03 07:34 pm (UTC)
See, that one really WAS an obscure unicode entry. :)
Obscure but very important.
2006-04-04 04:41 pm (UTC)
Ben attempted to explain it's importance to me. As far as I understand it, it's because data with mixed binary and unicode (in this case represented by UTF-8) is sometimes all interwingled (like in .tar files, or programs?), and that right now it's hard to edit that kind of data without corrupting it.
So I can see that this sort of thing is pretty important, but not in the same way his last post was important.
2006-04-04 04:42 pm (UTC)
That was me, btw
The significance of UTF-8B is much broader than that. The added complexity of having to tell your software whether it's chewing on UTF-8 data or binary data (or something else) runs all the way from the most primitive routines in the software to its user interface. Also, UTF-8 can generate errors in the most surprising places. When every primitive routine that deals with strings --- string length, concatenate with another string, etc. --- can generate an error, your program gains a lot of complexity that doesn't actually get you any benefit, and it's usually untestable complexity. You can't even tell if some of those error cases are correct because you can't figure out how to get them to happen.
You're side-stepping a more important problem. We need to teach everyone to speak the king's english and we won't have this unicode problem.
I've just registered utf8b.org.