In case anybody happens to need some C that can take a char* and encode any UTF-8 byte sequences in there (as I needed to — our cribbed UTF-8 encoder proved brittle), here’s a little bit of code that may help. It’s probably more verbose than it needs to be, but it works — and working code wins! 🙂
Just loop over the bytes in the string and push them into another string, or do like us and use Bill’s handy dandy growing buffer utility.
if ((unsigned char)string[idx] >= 0x80) { // It's a wide UTF-8 character.
if ( (unsigned char)string[idx] >= 0xC0
&& (unsigned char)string[idx] < = 0xF4) { // We know we're starting a UTF-8 string
clen = 1;
if (((unsigned char)string[idx] & 0xF0) == 0xF0) { // It's a 4 byte character
clen = 3;
c = (unsigned char)string[idx] ^ 0xF0;
} else if (((unsigned char)string[idx] & 0xE0) == 0xE0) { // This means 3 bytes.
clen = 2;
c = (unsigned char)string[idx] ^ 0xE0;
} else if (((unsigned char)string[idx] & 0xC0) == 0xC0) { // And that's 2.
clen = 1;
c = (unsigned char)string[idx] ^ 0xC0;
}
for (;clen;clen--) {
idx++; // look at the next byte
// only the last 6 bits are used for data
c = (c << 6) | ((unsigned char)string[idx] & 0x3F);
}
/* Use sprintf or the like to shove the
hex value of 'c' into, well, something.
We have a handy growing buffer thing
with printf format support, so we do this: */
buffer_fadd(buf, "\\u%04x", c);
} else { // Arg! It doesn't start with a valid first byte.
return NULL;
}
} else {
// It's not a wide character, treat is as ASCII ...
}
Maybe, just maybe, someone else out there won’t waste 6 hours of their life that they’ll never get back trying to do this same thing that’s been done a thousand times but never documented simply …
