You are viewing bsittler

Benjamin C. Wiley Sittler - binary vs. UTF-8, and why it need not matter [entries|archive|friends|userinfo]
Benjamin C. Wiley Sittler

[ website | bsittler ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

binary vs. UTF-8, and why it need not matter [Apr. 2nd, 2006|07:37 pm]
Previous Entry Share Next Entry
it turns out there's a way to handle UTF-8 and binary data intermixed at the same time in the same file that is simple, fully reversible, and fully compatible with all valid UTF-8 data (even the very uncommon and completely invalid CESU-8 mess used to cram UTF-16 into UTF-8 without proper decoding, although the invalid parts are handled as a sequence of binary data rather than valid characters.)

the basic technique comes from:
From: Markus Kuhn <Markus.Kuhn@...>
Subject: Substituting malformed UTF-8 sequences in a decoder
Date: Sun, 23 Jul 2000 22:44:35 +0100
Message-Id: <E13GJ5O-00064N-00@...>


the original text is archived at:
http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html

summary: use U+DCyz to represent each invalid input byte 0xyz rather
than treating these bytes as decoding errors.

so last year i implemented a version in python as part of tpd (see tpd/mainline/encodings/utf_8b.py)

today i re-implemented it in c as a patch against GNU libiconv:
http://xent.com/~bsittler/libiconv-1.9.1-utf-8b.diff

Implementation Notes:

This implementation of UTF-8B produces no errors on decoding, but
produces encoding errors for Unicode characters that cannot be
round-tripped successfully. The supported Unicode range is the UTF-16
range:

    U+0000 ... U+D7FF
and U+E000 ... U+10FFFF
linkReply

Comments:
[User Picture]From: smws
2006-04-03 07:34 pm (UTC)

(Link)

See, that one really WAS an obscure unicode entry. :)
[User Picture]From: kragen
2006-04-03 07:49 pm (UTC)

(Link)

Obscure but very important.
From: (Anonymous)
2006-04-04 04:41 pm (UTC)

(Link)

Ben attempted to explain it's importance to me. As far as I understand it, it's because data with mixed binary and unicode (in this case represented by UTF-8) is sometimes all interwingled (like in .tar files, or programs?), and that right now it's hard to edit that kind of data without corrupting it.

So I can see that this sort of thing is pretty important, but not in the same way his last post was important.
[User Picture]From: smws
2006-04-04 04:42 pm (UTC)

(Link)

That was me, btw
[User Picture]From: kragen
2006-04-06 09:00 pm (UTC)

(Link)

The significance of UTF-8B is much broader than that. The added complexity of having to tell your software whether it's chewing on UTF-8 data or binary data (or something else) runs all the way from the most primitive routines in the software to its user interface. Also, UTF-8 can generate errors in the most surprising places. When every primitive routine that deals with strings --- string length, concatenate with another string, etc. --- can generate an error, your program gains a lot of complexity that doesn't actually get you any benefit, and it's usually untestable complexity. You can't even tell if some of those error cases are correct because you can't figure out how to get them to happen.
From: chamewco
2006-04-04 02:26 am (UTC)

(Link)

You're side-stepping a more important problem. We need to teach everyone to speak the king's english and we won't have this unicode problem.
[User Picture]From: bsittler
2006-04-04 03:54 pm (UTC)

(Link)

i'm sure æðelred unræd would disagree ;)
[User Picture]From: kragen
2006-04-14 06:13 pm (UTC)

(Link)

Another UTF-8B implementation: http://hyperreal.org/~est/utf-8b/
[User Picture]From: kragen
2007-01-03 08:13 am (UTC)

(Link)

I've just registered utf8b.org.
From: Bill Spitzak
2014-09-26 07:38 pm (UTC)

This is not round-trip safe if UTF-16 is unknown

(Link)

You cannot use this to store and work on a UTF-16 version of your UTF-8 string. This is because the back-translation cannot blindly convert U+DCxx back into xx bytes. It will have to check if in fact the resulting arrangement would be valid UTF-8 and do something else, otherwise you have a security hole.

In addition if you consider UTF-8 encodings of surrogate halves invalid, you will have to figure out some way to encode surrogate halves from UTF-16 back. And you will also be incompatable with 99% of the UTF-8/UTF-16 encoders and decoders out there, which translate surrogate halves to the obvious 3-byte sequences.

If you don't allow arbirary UTF-16 arrays you have exactly the same problem you are trying to solve for UTF-8 arrays. So using this to preserver binary UTF-8 is not going to work.

My recommendation is to keep the UTF-8 in memory and work on it directly. This is not very hard, and will save considerable time by not allocating arrays and translating back and forth.

I agree the U+DCxx for error bytes is an excellent solution for *displaying* UTF-8 if it has to pass through some code that insists on UTF-16. But don't ever consider it a lossless conversion and thus somehow avoiding the need to store everything in UTF-8.

Note also that the opposite is not a problem. You can store invalid UTF-16 as UTF-8 in a completely transparent and lossless way, by translating unpaired surrogate halves as 3 bytes of UTF-8 each. In fact this is what CESU and perhaps 50% of the translators do (they unfortunatly also do this to *paired* surrogate halves...)


From: (Anonymous)
2014-10-07 05:12 am (UTC)

Re: This is not round-trip safe if UTF-16 is unknown

(Link)

Actually, it's not blindly converting them - instead, it's only converting them specially when they aren't preceded by the other half-surrogate. Since the UTF-16 aware editing toolchain won't let you break a character in the middle, the U+DCxx-escaped "raw bytes" and any surrogate pairs containing U+DCxx are completely unambiguous with respect to each other.

Note that UTF-8b is an extension of UTF-8 and *not* of CESU-8. This means there's no such thing as a multibyte UTF-8 sequence corresponding to an unpaired surrogate.

CESU-8 really isn't as common as that, at least in my limited experience (it's
seen a lot in older stuff needing to be compatible with Java's serialization format or with Oracle's, but not much outside of those narrow contexts so far as I can tell.) Properly-implemented UTF-8 with correct encoding and decoding of surrogate pairs seems a lot more common, and growing more so (thanks in large part to Emoji, I think.)

-bsittler
[User Picture]From: bsittler
2014-10-07 05:14 am (UTC)

Re: This is not round-trip safe if UTF-16 is unknown

(Link)

That was my comment, sorry about the anonymity.
From: Bill Spitzak
2014-10-07 10:08 pm (UTC)

Re: This is not round-trip safe if UTF-16 is unknown

(Link)

The sequence 0xDCA2,0xDC80 cannot be converted losslessly to UTF-8b. If it does the "obvious" thing then it will turn into 0xA2, 0x80, which is the UTF-8 encoding of 0x100, thus a round-trip back to UTF-16 will produce a different sequence, making this useless for storing arbitrary UTF-16 such as Windows filenames. Also this looks like a big security hole since it can produce arbitrary letters.

You could attempt to solve this by actually recognizing if the sequence will produce valid UTF-8 and do something different, such as encoding them as 3 bytes. But that means that sequence of 3 bytes must convert back to 0xDCxx instead of 3 error bytes. This means the UTF-8->UTF-16 converter must have more complicated rules for what is a valid sequence (rather than the implied one that any attempts to encode 0xDCxx are invalid). But these rules control exactly what sequences can be converted from UTF-16 to UTF-8! Thus it makes those rules more complicated, and thus the UTF-8 rules more complicated. I have actually worked on this and not been able to find a stable set, I suspect it is impossible. The end result is that both directions of conversion are lossy!

The rules I recommend are that unpaired surrogate halves are considered valid UTF-8 and are converted from/to unpaired surrogate halves in UTF-16. This makes all possible UTF-16 strings, including invalid ones, produce a unique UTF-8 string. Therefore conversion of UTF-16 to UTF-8 and back again is lossless.

The opposite direction is always going to be lossy. I recommend each byte of error in UTF-8 be converted to a value that depends on that byte's value. Converting to 0xDCxx is a good idea because the resulting UTF-16 is invalid as well, but you cannot recover the original UTF-8 from it (you can make a good guess). I have also just converted them to CP1252 characters, since the only reason you should convert to UTF-16 is to feed to a display or filename api that requires it. Never store UTF-16 in your program, all internal data should be in byte arrays containing (possibly invalid) UTF-8.
[User Picture]From: bsittler
2014-10-08 01:14 am (UTC)

Re: This is not round-trip safe if UTF-16 is unknown

(Link)

The intent of UTF-8b is to allow arbitrary mixing of binary and UTF-8 with lossless round-tripping through UTF-16+unpaired surrogates. It does not, however, help with intermixed binary and UTF-16, nor with lossless round-tripping of such through UTF-8. It certainly does not attempt to solve these problems both at the same time. If someone wants to design a UTF-16b for the purpose of editing intermixed binary and UTF-16 they are welcome to (and in fact the original specification of UTF-8 might work for this.) I do not know of a way to guarantee lossless bidirectional round-tripping between arbitrarily intermixed binary+UTF-8 and arbitrarily intermixed binary+UTF-16.
From: Bill Spitzak
2014-10-08 05:42 pm (UTC)

Re: This is not round-trip safe if UTF-16 is unknown

(Link)

Yes the original UTF-8 where unpaired surrogates can be encoded will work for round-tripping both invalid UTF-8 and invalid UTF-16.

The problem with this scheme is that it becomes fiendishly complex if the UTF-16 is anything other than a conversion of UTF-8. This basically precludes *any* processing of the data in UTF-16 form. So this is output-only. I recommend using this for 2 purposes: one is to convert UTF-8 to Windows filenames, the second is to send UTF-8 to drawing functions that insist on UTF-16.

You cannot use UTF-16 for storing data or editing it if you want to preserve these errors. Believe me I have tried. It is not possible.
[User Picture]From: bsittler
2014-10-08 01:29 am (UTC)

Re: This is not round-trip safe if UTF-16 is unknown

(Link)

In fact the double conversion from arbitrary binary (treated as UTF-8b) -> UTF-16+unpaired surrogates->arbitrary binary (treated as UTF-8b) is lossless and the codec rules are straightforward. It allows use of UTF-16-based editing controls for manipulating the text parts and (peovided the editing control doesn't mutate the unpaired surrogates) the unmodified portions will still round-trip. The same is also true of UCS-4/UTF-32 representations of the text provided (again) that code units in the 0x0000DCxx range are not mutated.
From: Bill Spitzak
2014-10-08 05:49 pm (UTC)

Re: This is not round-trip safe if UTF-16 is unknown

(Link)

The editor must also not delete the text between two 0xDCxx values, or they may paste together into a valid character.

I'm sure there are a million other things the editor cannot do. Just saying "don't change 0xDCxx" is insufficient.

The only safe thing to do is have the UTF-16->UTF-8 converter check each sequence of 0xDCxx for whether it produces a valid UTF-8 character, and do something else (probably convert the first one to 3 bytes instead of one). However this test explicitly relies on matching whatever the UTF-8 converter considers valid. And this may vary for surrogate halves, I recommend only 0xDCxx be invalid, but some encoders are going to make all of them invalid, none of them, or only trailing surrogates.

The end result is that this is a one-way translation. You cannot get the UTF-8 back, no matter how tempting it is to believe it.

That does not make this useless. It is an excellent scheme if you are forced to use something that needs UTF-16. Just don't convert back. Windows filenames and drawing text on the screen are good targets. Editors and databases are not recommended.

It is not difficult to work directly with UTF-8. You can find the breaks between code points, even if errors are allowed, with limited searches that don't need to look at more than 4 bytes. Considering UTF-16 is variable-length too, and there are combining characters in Unicode, this is trivial, though it may look scary to a novice programmer.