?

Log in

No account? Create an account
binary vs. UTF-8, and why it need not matter - Benjamin C. Wiley Sittler [entries|archive|friends|userinfo]
Benjamin C. Wiley Sittler

[ website | bsittler ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

binary vs. UTF-8, and why it need not matter [Apr. 2nd, 2006|07:37 pm]
Benjamin C. Wiley Sittler
it turns out there's a way to handle UTF-8 and binary data intermixed at the same time in the same file that is simple, fully reversible, and fully compatible with all valid UTF-8 data (even the very uncommon and completely invalid CESU-8 mess used to cram UTF-16 into UTF-8 without proper decoding, although the invalid parts are handled as a sequence of binary data rather than valid characters.)

the basic technique comes from:
From: Markus Kuhn <Markus.Kuhn@...>
Subject: Substituting malformed UTF-8 sequences in a decoder
Date: Sun, 23 Jul 2000 22:44:35 +0100
Message-Id: <E13GJ5O-00064N-00@...>


the original text is archived at:
http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html

summary: use U+DCyz to represent each invalid input byte 0xyz rather
than treating these bytes as decoding errors.

so last year i implemented a version in python as part of tpd (see tpd/mainline/encodings/utf_8b.py)

today i re-implemented it in c as a patch against GNU libiconv:
http://xent.com/~bsittler/libiconv-1.9.1-utf-8b.diff

Implementation Notes:

This implementation of UTF-8B produces no errors on decoding, but
produces encoding errors for Unicode characters that cannot be
round-tripped successfully. The supported Unicode range is the UTF-16
range:

    U+0000 ... U+D7FF
and U+E000 ... U+10FFFF
linkReply

Comments:
[User Picture]From: kragen
2006-04-14 06:13 pm (UTC)
Another UTF-8B implementation: http://hyperreal.org/~est/utf-8b/
(Reply) (Thread)