| Benjamin C. Wiley Sittler ( @ 2006-04-02 19:37:00 |
binary vs. UTF-8, and why it need not matter
it turns out there's a way to handle UTF-8 and binary data intermixed at the same time in the same file that is simple, fully reversible, and fully compatible with all valid UTF-8 data (even the very uncommon and completely invalid CESU-8 mess used to cram UTF-16 into UTF-8 without proper decoding, although the invalid parts are handled as a sequence of binary data rather than valid characters.)
the basic technique comes from:
the original text is archived at:
http://mail.nl.linux.org/linux-utf8/2000-0 7/msg00040.html
summary: use U+DCyz to represent each invalid input byte 0xyz rather
than treating these bytes as decoding errors.
so last year i implemented a version in python as part of tpd (see tpd/mainline/encodings/utf_8b.py)
today i re-implemented it in c as a patch against GNU libiconv:
http://xent.com/~bsittler/libiconv-1.9.1-u tf-8b.diff
it turns out there's a way to handle UTF-8 and binary data intermixed at the same time in the same file that is simple, fully reversible, and fully compatible with all valid UTF-8 data (even the very uncommon and completely invalid CESU-8 mess used to cram UTF-16 into UTF-8 without proper decoding, although the invalid parts are handled as a sequence of binary data rather than valid characters.)
the basic technique comes from:
From: Markus Kuhn <Markus.Kuhn@...>
Subject: Substituting malformed UTF-8 sequences in a decoder
Date: Sun, 23 Jul 2000 22:44:35 +0100
Message-Id: <E13GJ5O-00064N-00@...>
the original text is archived at:
http://mail.nl.linux.org/linux-utf8/2000-0
summary: use U+DCyz to represent each invalid input byte 0xyz rather
than treating these bytes as decoding errors.
so last year i implemented a version in python as part of tpd (see tpd/mainline/encodings/utf_8b.py)
today i re-implemented it in c as a patch against GNU libiconv:
http://xent.com/~bsittler/libiconv-1.9.1-u
Implementation Notes:
This implementation of UTF-8B produces no errors on decoding, but
produces encoding errors for Unicode characters that cannot be
round-tripped successfully. The supported Unicode range is the UTF-16
range:U+0000 ... U+D7FF and U+E000 ... U+10FFFF