Log in

No account? Create an account
binary vs. UTF-8, and why it need not matter - Benjamin C. Wiley Sittler [entries|archive|friends|userinfo]
Benjamin C. Wiley Sittler

[ website | bsittler ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

binary vs. UTF-8, and why it need not matter [Apr. 2nd, 2006|07:37 pm]
Benjamin C. Wiley Sittler
it turns out there's a way to handle UTF-8 and binary data intermixed at the same time in the same file that is simple, fully reversible, and fully compatible with all valid UTF-8 data (even the very uncommon and completely invalid CESU-8 mess used to cram UTF-16 into UTF-8 without proper decoding, although the invalid parts are handled as a sequence of binary data rather than valid characters.)

the basic technique comes from:
From: Markus Kuhn <Markus.Kuhn@...>
Subject: Substituting malformed UTF-8 sequences in a decoder
Date: Sun, 23 Jul 2000 22:44:35 +0100
Message-Id: <E13GJ5O-00064N-00@...>

the original text is archived at:

summary: use U+DCyz to represent each invalid input byte 0xyz rather
than treating these bytes as decoding errors.

so last year i implemented a version in python as part of tpd (see tpd/mainline/encodings/utf_8b.py)

today i re-implemented it in c as a patch against GNU libiconv:

Implementation Notes:

This implementation of UTF-8B produces no errors on decoding, but
produces encoding errors for Unicode characters that cannot be
round-tripped successfully. The supported Unicode range is the UTF-16

    U+0000 ... U+D7FF
and U+E000 ... U+10FFFF

[User Picture]From: smws
2006-04-04 04:42 pm (UTC)
That was me, btw
(Reply) (Parent) (Thread)
[User Picture]From: kragen
2006-04-06 09:00 pm (UTC)
The significance of UTF-8B is much broader than that. The added complexity of having to tell your software whether it's chewing on UTF-8 data or binary data (or something else) runs all the way from the most primitive routines in the software to its user interface. Also, UTF-8 can generate errors in the most surprising places. When every primitive routine that deals with strings --- string length, concatenate with another string, etc. --- can generate an error, your program gains a lot of complexity that doesn't actually get you any benefit, and it's usually untestable complexity. You can't even tell if some of those error cases are correct because you can't figure out how to get them to happen.
(Reply) (Parent) (Thread)