Log in

No account? Create an account
binary vs. UTF-8, and why it need not matter - Benjamin C. Wiley Sittler [entries|archive|friends|userinfo]
Benjamin C. Wiley Sittler

[ website | bsittler ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

binary vs. UTF-8, and why it need not matter [Apr. 2nd, 2006|07:37 pm]
Benjamin C. Wiley Sittler
it turns out there's a way to handle UTF-8 and binary data intermixed at the same time in the same file that is simple, fully reversible, and fully compatible with all valid UTF-8 data (even the very uncommon and completely invalid CESU-8 mess used to cram UTF-16 into UTF-8 without proper decoding, although the invalid parts are handled as a sequence of binary data rather than valid characters.)

the basic technique comes from:
From: Markus Kuhn <Markus.Kuhn@...>
Subject: Substituting malformed UTF-8 sequences in a decoder
Date: Sun, 23 Jul 2000 22:44:35 +0100
Message-Id: <E13GJ5O-00064N-00@...>

the original text is archived at:

summary: use U+DCyz to represent each invalid input byte 0xyz rather
than treating these bytes as decoding errors.

so last year i implemented a version in python as part of tpd (see tpd/mainline/encodings/utf_8b.py)

today i re-implemented it in c as a patch against GNU libiconv:

Implementation Notes:

This implementation of UTF-8B produces no errors on decoding, but
produces encoding errors for Unicode characters that cannot be
round-tripped successfully. The supported Unicode range is the UTF-16

    U+0000 ... U+D7FF
and U+E000 ... U+10FFFF

From: (Anonymous)
2014-10-07 05:12 am (UTC)

Re: This is not round-trip safe if UTF-16 is unknown

Actually, it's not blindly converting them - instead, it's only converting them specially when they aren't preceded by the other half-surrogate. Since the UTF-16 aware editing toolchain won't let you break a character in the middle, the U+DCxx-escaped "raw bytes" and any surrogate pairs containing U+DCxx are completely unambiguous with respect to each other.

Note that UTF-8b is an extension of UTF-8 and *not* of CESU-8. This means there's no such thing as a multibyte UTF-8 sequence corresponding to an unpaired surrogate.

CESU-8 really isn't as common as that, at least in my limited experience (it's
seen a lot in older stuff needing to be compatible with Java's serialization format or with Oracle's, but not much outside of those narrow contexts so far as I can tell.) Properly-implemented UTF-8 with correct encoding and decoding of surrogate pairs seems a lot more common, and growing more so (thanks in large part to Emoji, I think.)

(Reply) (Parent) (Thread)
[User Picture]From: bsittler
2014-10-07 05:14 am (UTC)

Re: This is not round-trip safe if UTF-16 is unknown

That was my comment, sorry about the anonymity.
(Reply) (Parent) (Thread)
From: Bill Spitzak
2014-10-07 10:08 pm (UTC)

Re: This is not round-trip safe if UTF-16 is unknown

The sequence 0xDCA2,0xDC80 cannot be converted losslessly to UTF-8b. If it does the "obvious" thing then it will turn into 0xA2, 0x80, which is the UTF-8 encoding of 0x100, thus a round-trip back to UTF-16 will produce a different sequence, making this useless for storing arbitrary UTF-16 such as Windows filenames. Also this looks like a big security hole since it can produce arbitrary letters.

You could attempt to solve this by actually recognizing if the sequence will produce valid UTF-8 and do something different, such as encoding them as 3 bytes. But that means that sequence of 3 bytes must convert back to 0xDCxx instead of 3 error bytes. This means the UTF-8->UTF-16 converter must have more complicated rules for what is a valid sequence (rather than the implied one that any attempts to encode 0xDCxx are invalid). But these rules control exactly what sequences can be converted from UTF-16 to UTF-8! Thus it makes those rules more complicated, and thus the UTF-8 rules more complicated. I have actually worked on this and not been able to find a stable set, I suspect it is impossible. The end result is that both directions of conversion are lossy!

The rules I recommend are that unpaired surrogate halves are considered valid UTF-8 and are converted from/to unpaired surrogate halves in UTF-16. This makes all possible UTF-16 strings, including invalid ones, produce a unique UTF-8 string. Therefore conversion of UTF-16 to UTF-8 and back again is lossless.

The opposite direction is always going to be lossy. I recommend each byte of error in UTF-8 be converted to a value that depends on that byte's value. Converting to 0xDCxx is a good idea because the resulting UTF-16 is invalid as well, but you cannot recover the original UTF-8 from it (you can make a good guess). I have also just converted them to CP1252 characters, since the only reason you should convert to UTF-16 is to feed to a display or filename api that requires it. Never store UTF-16 in your program, all internal data should be in byte arrays containing (possibly invalid) UTF-8.
(Reply) (Parent) (Thread)
[User Picture]From: bsittler
2014-10-08 01:14 am (UTC)

Re: This is not round-trip safe if UTF-16 is unknown

The intent of UTF-8b is to allow arbitrary mixing of binary and UTF-8 with lossless round-tripping through UTF-16+unpaired surrogates. It does not, however, help with intermixed binary and UTF-16, nor with lossless round-tripping of such through UTF-8. It certainly does not attempt to solve these problems both at the same time. If someone wants to design a UTF-16b for the purpose of editing intermixed binary and UTF-16 they are welcome to (and in fact the original specification of UTF-8 might work for this.) I do not know of a way to guarantee lossless bidirectional round-tripping between arbitrarily intermixed binary+UTF-8 and arbitrarily intermixed binary+UTF-16.
(Reply) (Parent) (Thread)
From: Bill Spitzak
2014-10-08 05:42 pm (UTC)

Re: This is not round-trip safe if UTF-16 is unknown

Yes the original UTF-8 where unpaired surrogates can be encoded will work for round-tripping both invalid UTF-8 and invalid UTF-16.

The problem with this scheme is that it becomes fiendishly complex if the UTF-16 is anything other than a conversion of UTF-8. This basically precludes *any* processing of the data in UTF-16 form. So this is output-only. I recommend using this for 2 purposes: one is to convert UTF-8 to Windows filenames, the second is to send UTF-8 to drawing functions that insist on UTF-16.

You cannot use UTF-16 for storing data or editing it if you want to preserve these errors. Believe me I have tried. It is not possible.
(Reply) (Parent) (Thread)
[User Picture]From: bsittler
2014-10-08 01:29 am (UTC)

Re: This is not round-trip safe if UTF-16 is unknown

In fact the double conversion from arbitrary binary (treated as UTF-8b) -> UTF-16+unpaired surrogates->arbitrary binary (treated as UTF-8b) is lossless and the codec rules are straightforward. It allows use of UTF-16-based editing controls for manipulating the text parts and (peovided the editing control doesn't mutate the unpaired surrogates) the unmodified portions will still round-trip. The same is also true of UCS-4/UTF-32 representations of the text provided (again) that code units in the 0x0000DCxx range are not mutated.
(Reply) (Parent) (Thread)
From: Bill Spitzak
2014-10-08 05:49 pm (UTC)

Re: This is not round-trip safe if UTF-16 is unknown

The editor must also not delete the text between two 0xDCxx values, or they may paste together into a valid character.

I'm sure there are a million other things the editor cannot do. Just saying "don't change 0xDCxx" is insufficient.

The only safe thing to do is have the UTF-16->UTF-8 converter check each sequence of 0xDCxx for whether it produces a valid UTF-8 character, and do something else (probably convert the first one to 3 bytes instead of one). However this test explicitly relies on matching whatever the UTF-8 converter considers valid. And this may vary for surrogate halves, I recommend only 0xDCxx be invalid, but some encoders are going to make all of them invalid, none of them, or only trailing surrogates.

The end result is that this is a one-way translation. You cannot get the UTF-8 back, no matter how tempting it is to believe it.

That does not make this useless. It is an excellent scheme if you are forced to use something that needs UTF-16. Just don't convert back. Windows filenames and drawing text on the screen are good targets. Editors and databases are not recommended.

It is not difficult to work directly with UTF-8. You can find the breaks between code points, even if errors are allowed, with limited searches that don't need to look at more than 4 bytes. Considering UTF-16 is variable-length too, and there are combining characters in Unicode, this is trivial, though it may look scary to a novice programmer.

(Reply) (Parent) (Thread)