In case anybody happens to need some C that can take a char* and encode any UTF-8 byte sequences in there (as I needed to — our cribbed UTF-8 encoder proved brittle), here’s a little bit of code that may help. It’s probably more verbose than it needs to be, but it works — and working code wins! 🙂
Just loop over the bytes in the string and push them into another string, or do like us and use Bill’s handy dandy growing buffer utility.
if ((unsigned char)string[idx] >= 0x80) { // It's a wide UTF-8 character. if ( (unsigned char)string[idx] >= 0xC0 && (unsigned char)string[idx] < = 0xF4) { // We know we're starting a UTF-8 string clen = 1; if (((unsigned char)string[idx] & 0xF0) == 0xF0) { // It's a 4 byte character clen = 3; c = (unsigned char)string[idx] ^ 0xF0; } else if (((unsigned char)string[idx] & 0xE0) == 0xE0) { // This means 3 bytes. clen = 2; c = (unsigned char)string[idx] ^ 0xE0; } else if (((unsigned char)string[idx] & 0xC0) == 0xC0) { // And that's 2. clen = 1; c = (unsigned char)string[idx] ^ 0xC0; } for (;clen;clen--) { idx++; // look at the next byte // only the last 6 bits are used for data c = (c << 6) | ((unsigned char)string[idx] & 0x3F); } /* Use sprintf or the like to shove the hex value of 'c' into, well, something. We have a handy growing buffer thing with printf format support, so we do this: */ buffer_fadd(buf, "\\u%04x", c); } else { // Arg! It doesn't start with a valid first byte. return NULL; } } else { // It's not a wide character, treat is as ASCII ... }
Maybe, just maybe, someone else out there won’t waste 6 hours of their life that they’ll never get back trying to do this same thing that’s been done a thousand times but never documented simply …