Log in

No account? Create an account
binary vs. UTF-8, and why it need not matter - Benjamin C. Wiley Sittler [entries|archive|friends|userinfo]
Benjamin C. Wiley Sittler

[ website | bsittler ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

binary vs. UTF-8, and why it need not matter [Apr. 2nd, 2006|07:37 pm]
Benjamin C. Wiley Sittler
it turns out there's a way to handle UTF-8 and binary data intermixed at the same time in the same file that is simple, fully reversible, and fully compatible with all valid UTF-8 data (even the very uncommon and completely invalid CESU-8 mess used to cram UTF-16 into UTF-8 without proper decoding, although the invalid parts are handled as a sequence of binary data rather than valid characters.)

the basic technique comes from:
From: Markus Kuhn <Markus.Kuhn@...>
Subject: Substituting malformed UTF-8 sequences in a decoder
Date: Sun, 23 Jul 2000 22:44:35 +0100
Message-Id: <E13GJ5O-00064N-00@...>

the original text is archived at:

summary: use U+DCyz to represent each invalid input byte 0xyz rather
than treating these bytes as decoding errors.

so last year i implemented a version in python as part of tpd (see tpd/mainline/encodings/utf_8b.py)

today i re-implemented it in c as a patch against GNU libiconv:

Implementation Notes:

This implementation of UTF-8B produces no errors on decoding, but
produces encoding errors for Unicode characters that cannot be
round-tripped successfully. The supported Unicode range is the UTF-16

    U+0000 ... U+D7FF
and U+E000 ... U+10FFFF

From: Bill Spitzak
2014-10-08 05:49 pm (UTC)

Re: This is not round-trip safe if UTF-16 is unknown

The editor must also not delete the text between two 0xDCxx values, or they may paste together into a valid character.

I'm sure there are a million other things the editor cannot do. Just saying "don't change 0xDCxx" is insufficient.

The only safe thing to do is have the UTF-16->UTF-8 converter check each sequence of 0xDCxx for whether it produces a valid UTF-8 character, and do something else (probably convert the first one to 3 bytes instead of one). However this test explicitly relies on matching whatever the UTF-8 converter considers valid. And this may vary for surrogate halves, I recommend only 0xDCxx be invalid, but some encoders are going to make all of them invalid, none of them, or only trailing surrogates.

The end result is that this is a one-way translation. You cannot get the UTF-8 back, no matter how tempting it is to believe it.

That does not make this useless. It is an excellent scheme if you are forced to use something that needs UTF-16. Just don't convert back. Windows filenames and drawing text on the screen are good targets. Editors and databases are not recommended.

It is not difficult to work directly with UTF-8. You can find the breaks between code points, even if errors are allowed, with limited searches that don't need to look at more than 4 bytes. Considering UTF-16 is variable-length too, and there are combining characters in Unicode, this is trivial, though it may look scary to a novice programmer.

(Reply) (Parent) (Thread)