|binary vs. UTF-8, and why it need not matter
||[Apr. 2nd, 2006|07:37 pm]
Benjamin C. Wiley Sittler
UTF-8 and binary data intermixed at the same time in the same file that is simple, fully reversible, and fully compatible with all valid UTF-8 data (even the very uncommon and completely invalid CESU-8 mess used to cram UTF-16 into UTF-8 without proper decoding, although the invalid parts are handled as a sequence of binary data rather than valid characters.)it turns out there's a way to handle |
the basic technique comes from:
From: Markus Kuhn <Markus.Kuhn@...>
Subject: Substituting malformed UTF-8 sequences in a decoder
Date: Sun, 23 Jul 2000 22:44:35 +0100
the original text is archived at:
summary: use U+DCyz to represent each invalid input byte 0xyz rather
than treating these bytes as decoding errors.
so last year i implemented a version in python as part of tpd (see tpd/mainline/encodings/utf_8b.py)
today i re-implemented it in c as a patch against GNU libiconv:
This implementation of UTF-8B produces no errors on decoding, but
produces encoding errors for Unicode characters that cannot be
round-tripped successfully. The supported Unicode range is the UTF-16
U+0000 ... U+D7FF
and U+E000 ... U+10FFFF
|From: Bill Spitzak|
2014-10-08 05:49 pm (UTC)
Re: This is not round-trip safe if UTF-16 is unknown
The editor must also not delete the text between two 0xDCxx values, or they may paste together into a valid character.
I'm sure there are a million other things the editor cannot do. Just saying "don't change 0xDCxx" is insufficient.
The only safe thing to do is have the UTF-16->UTF-8 converter check each sequence of 0xDCxx for whether it produces a valid UTF-8 character, and do something else (probably convert the first one to 3 bytes instead of one). However this test explicitly relies on matching whatever the UTF-8 converter considers valid. And this may vary for surrogate halves, I recommend only 0xDCxx be invalid, but some encoders are going to make all of them invalid, none of them, or only trailing surrogates.
The end result is that this is a one-way translation. You cannot get the UTF-8 back, no matter how tempting it is to believe it.
That does not make this useless. It is an excellent scheme if you are forced to use something that needs UTF-16. Just don't convert back. Windows filenames and drawing text on the screen are good targets. Editors and databases are not recommended.
It is not difficult to work directly with UTF-8. You can find the breaks between code points, even if errors are allowed, with limited searches that don't need to look at more than 4 bytes. Considering UTF-16 is variable-length too, and there are combining characters in Unicode, this is trivial, though it may look scary to a novice programmer.