Escape! 2


In case anybody happens to need some C that can take a char* and encode any UTF-8 byte sequences in there (as I needed to — our cribbed UTF-8 encoder proved brittle), here’s a little bit of code that may help. It’s probably more verbose than it needs to be, but it works — and working code wins! 🙂

Just loop over the bytes in the string and push them into another string, or do like us and use Bill’s handy dandy growing buffer utility.


if ((unsigned char)string[idx] >= 0x80) { // It's a wide UTF-8 character.

  if ( (unsigned char)string[idx] >= 0xC0
       && (unsigned char)string[idx] < = 0xF4) { // We know we're starting a UTF-8 string
    clen = 1;

    if (((unsigned char)string[idx] & 0xF0) == 0xF0) { // It's a 4 byte character
      clen = 3;
      c = (unsigned char)string[idx] ^ 0xF0;
    } else if (((unsigned char)string[idx] & 0xE0) == 0xE0) { // This means 3 bytes.
      clen = 2;
      c = (unsigned char)string[idx] ^ 0xE0;
    } else if (((unsigned char)string[idx] & 0xC0) == 0xC0) { // And that's 2.
      clen = 1;
      c = (unsigned char)string[idx] ^ 0xC0;
    }

    for (;clen;clen--) {
      idx++; // look at the next byte

      // only the last 6 bits are used for data
      c = (c << 6) | ((unsigned char)string[idx] & 0x3F);
    }

    /* Use sprintf or the like to shove the
        hex value of 'c' into, well, something.
        We have a handy growing buffer thing
        with printf format support, so we do this: */

      buffer_fadd(buf, "\\u%04x", c);

    } else { // Arg!  It doesn't start with a valid first byte.
      return NULL;
    }

  } else {
    // It's not a wide character, treat is as ASCII ...
}

Maybe, just maybe, someone else out there won’t waste 6 hours of their life that they’ll never get back trying to do this same thing that’s been done a thousand times but never documented simply …


2 thoughts on “Escape!

  • miker Post author

    No, the earlier code was from a small chunk of example code I found online with a “eh, do whatever you want with this” message next to it.

    The new version started out as a set of macros from a Perl header file, but was totally rewritten over the course of a couple days — after I spent about 6 hours digesting the RFPs for Unicode, UCS and UTF-8 … 🙂

Comments are closed.