Benjamin C. Wiley Sittler (bsittler) wrote,

binary vs. UTF-8, and why it need not matter

it turns out there's a way to handle UTF-8 and binary data intermixed at the same time in the same file that is simple, fully reversible, and fully compatible with all valid UTF-8 data (even the very uncommon and completely invalid CESU-8 mess used to cram UTF-16 into UTF-8 without proper decoding, although the invalid parts are handled as a sequence of binary data rather than valid characters.)

the basic technique comes from:
From: Markus Kuhn <Markus.Kuhn@...>
Subject: Substituting malformed UTF-8 sequences in a decoder
Date: Sun, 23 Jul 2000 22:44:35 +0100
Message-Id: <E13GJ5O-00064N-00@...>


the original text is archived at:
http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html

summary: use U+DCyz to represent each invalid input byte 0xyz rather
than treating these bytes as decoding errors.

so last year i implemented a version in python as part of tpd (see tpd/mainline/encodings/utf_8b.py)

today i re-implemented it in c as a patch against GNU libiconv:
http://xent.com/~bsittler/libiconv-1.9.1-utf-8b.diff

Implementation Notes:

This implementation of UTF-8B produces no errors on decoding, but
produces encoding errors for Unicode characters that cannot be
round-tripped successfully. The supported Unicode range is the UTF-16
range:

    U+0000 ... U+D7FF
and U+E000 ... U+10FFFF
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

  • 9 comments