Log in

No account? Create an account
binary vs. UTF-8, and why it need not matter - Benjamin C. Wiley Sittler [entries|archive|friends|userinfo]
Benjamin C. Wiley Sittler

[ website | bsittler ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

binary vs. UTF-8, and why it need not matter [Apr. 2nd, 2006|07:37 pm]
Benjamin C. Wiley Sittler
it turns out there's a way to handle UTF-8 and binary data intermixed at the same time in the same file that is simple, fully reversible, and fully compatible with all valid UTF-8 data (even the very uncommon and completely invalid CESU-8 mess used to cram UTF-16 into UTF-8 without proper decoding, although the invalid parts are handled as a sequence of binary data rather than valid characters.)

the basic technique comes from:
From: Markus Kuhn <Markus.Kuhn@...>
Subject: Substituting malformed UTF-8 sequences in a decoder
Date: Sun, 23 Jul 2000 22:44:35 +0100
Message-Id: <E13GJ5O-00064N-00@...>

the original text is archived at:

summary: use U+DCyz to represent each invalid input byte 0xyz rather
than treating these bytes as decoding errors.

so last year i implemented a version in python as part of tpd (see tpd/mainline/encodings/utf_8b.py)

today i re-implemented it in c as a patch against GNU libiconv:

Implementation Notes:

This implementation of UTF-8B produces no errors on decoding, but
produces encoding errors for Unicode characters that cannot be
round-tripped successfully. The supported Unicode range is the UTF-16

    U+0000 ... U+D7FF
and U+E000 ... U+10FFFF

From: Bill Spitzak
2014-10-08 05:42 pm (UTC)

Re: This is not round-trip safe if UTF-16 is unknown

Yes the original UTF-8 where unpaired surrogates can be encoded will work for round-tripping both invalid UTF-8 and invalid UTF-16.

The problem with this scheme is that it becomes fiendishly complex if the UTF-16 is anything other than a conversion of UTF-8. This basically precludes *any* processing of the data in UTF-16 form. So this is output-only. I recommend using this for 2 purposes: one is to convert UTF-8 to Windows filenames, the second is to send UTF-8 to drawing functions that insist on UTF-16.

You cannot use UTF-16 for storing data or editing it if you want to preserve these errors. Believe me I have tried. It is not possible.
(Reply) (Parent) (Thread)