The one about Unicode and UTF-8

Aug 22, 2019

Computers don't understand human language. They don't even understand numbers greater than or equal to 2. They only understand 0s and 1s. In order to facilitate communication between computers to users and users to other users through computers, early computer scientists needed a way to translate 0s and 1s to real world concepts.

They invented character encoding standards.

A real world practical example is morse code. Before any communication takes place, humans defined a standard set of rules to encode and decode morse code messages. It defined a system which uses two types of "units": dots and dashes (not so unlike 0s and 1s). Patterns of these dots and dashes would correspond to an English letter or Arabic numeral, and would later encode some non-English letters in the International Morse Code standard.

For example, a dot-dash (· –) corresponds to the letter "A". A dash-dot-dot-dot (– · · ·) corresponds to the letter "B". Dash-dash-dash-dash-dash (– – – – –) corresponds to the number 0.

The Unicode Standard is the de facto encoding standard and is present nearly everywhere in computer programming. Like morse code, Unicode maps patterns of dots and dashes in the form of 0s and 1s to "characters". However, unlike morse code, Unicode has the ambitious goal to map every conceivable written "character" across all human languages (and even some fictional ones!)

In modern languages like Python that already use the UTF-8 encoding format by default, you typically never have to worry about how your strings are encoded until you start needing to send or receive strings in a different encoding.

Before we discuss how the Unicode Standard and the UTF-8 format works, we need to define a few concepts.

Code points

A code point is a one-to-one mapping from a number to a conceptual "thing". These things are typically going to be linguistic concepts, like letters, diacritics, and even emoji. But they can also contain non-linguistic concepts like numbers and "control characters", such as EOL (end-of-line), BACKSPACE, DELETE, CR (carriage return), etc.

Grapheme

A grapheme is the closest thing to what a regular person on the street would think of as a written "character". It represents the smallest atomical unit of writing: letters, ligatures, punctuation marks, symbols, numbers, etc.

Glyph

A glyph is the visual representation of a grapheme. On computers, fonts, such as Times New Roman, Courier, or Comic Sans, each define the exact glyph(s) used for any given grapheme. Changing your font would change the glyph used for the same graphemes.

Fonts do not necessarily define a glyph for every possible grapheme in existence. If you try to render a grapheme on your computer and it can't find the appropriate glyph to use for it, then you will typically get either nothing rendered to the screen or usually some kind of placeholder glyph like a question mark.

The first encodings: ASCII and its shortcomings

The first character encoding standard that rose to prominence was the American Standard for Information Interchange, or ASCII (pronounced ass-kee). ASCII uses 7-bits to define 128 different code points. Since any given pattern of seven 0s and 1s is also a binary representation of a real world number, you could basically have a table with the numbers 0 through 127 in one column and the grapheme it represents in a second column.

For example, the binary number 1000001 means 65 in decimal (real world) numbers, and corresponds to the Latin Capital Letter A. In the old days, roughly speaking, when you typed the letter "A" into a word processor, the following would happen:

The word processor would receive 1000001 as input and store it in the body of your text file.
The word processor would identify that 1000001 corresponds to the Latin Capital Letter A grapheme.
The word processor would identify what font it was configured to use, find the glyph for Latin Capital Letter A, and then display that glyph to the user.

IBM also had their own proprietary format, called the Extended Binary Coded Decimal Interchange Code, or EBCDIC (pronounced ebb-see-dick). EBCDIC was used primarily on IBM computers and IBM mainframes. I've never coded with EBCDIC before, but I have heard it was a huge pain for programmers at the time. Since IBM shipped their mainframes all over the world, IBM architects at the time thought it would be a good idea to define different EBCDIC standards for different places in the world, where one EBCDIC standard could be totally incompatible with another EBCDIC standard. Good luck having two IBM mainframes located in two different countries talk to each other. Maybe it wasn't so obvious at the time, but nowadays it's kind of a no-brainer why EBCDIC would ultimately fall out of style.

ASCII worked pretty well for a while. But since it's a 7-bit standard, there's only so many patterns of bits you can use until you hit 1111111. What happens then? You have already exhausted all 128 code points possible in the encoding space. If you only ever care about English words, this would be fine, but the computing world is becoming increasingly international.

New code points were needed to handle international letters, accents, symbols, and punctuation marks. If you wanted to transmit a message to your friends in Japan, you would need code points to map to Japanese characters. And on social media - how else are people going to convey how they feel without emoji? With words? It's 2019. We need code points for every emoji. Every. Single. Emoji.

The Unicode Standard and the UTF-8 encoding

Enter Unicode.

As of this writing, there have been 137,994 code points defined in the Unicode Standard by the Unicode Consortium. Each code point in unicode is just a number, but is depicted as U+xxxxxx where the x's form the number represented in hexadecimal format. All 128 of your favorite ASCII characters are defined here using the same code point numbers. Latin Capital Letter A has the code point 65 in ASCII and now its Unicode code point is U+0041 (41 in hexadecimal is 65 in decimal).

All 137,994 of the current code points are not defined contiguously from 0 to 137,993 (U+0000 to U+21B09). The Unicode Consortium leave large gaps between some of the codepoints for organizational purposes. Due to its design, The Unicode Standard define code points for at most 1.1 million different graphemes. With the current number at 137,994, we have used roughly only 10% of the code point space.

Back when we were talking about ASCII, the code points themselves were kind of the same thing as the encoding. To render any given code point, you encode the code point value in binary, make sure your binary number uses at least 7 bits by left-padding the binary number with 0s, and then voila, you've got yourself an encoded ASCII string.. But Unicode can and needs to eventually support up to 1.1 million code points!

Let's imagine what would happen if we encoded all 1.1 million code points by their code point value. We would need enough bits (0s and 1s) to count up to 1.1 million in binary. As it so happens, we would need 21 bits to count to a minimum of 1.1 million (in fact, with 21 bits we can count up to 2,087,152). If you wanted to store the Latin Capital Letter A in a file with a code point of U+0041, the letter A with a fixed 21-bit wide encoding would look like this: 000000000000001000001. Look at all of those left-padded 0s! For English text, by switching the encoding from 7-bit ASCII to our imaginary 21-bit fixed width encoding, we would be tripling the size of every text document.

Instead, the bright folks who work on Unicode came up with a set of encoding formats that map Unicode code points to their values in much more clever ways. The de facto standard encoding of these formats, and probably the most clever of them all, in my opinion, is the Unicode Transformat Format - 8-bit, aka UTF-8.

UTF-8 is a dynamic width encoding format that can use a sequence of one, two, three or four 8-bit bytes to map to a Unicode code point. It is dynamic because if you only need one byte to render a code point, you will only use one byte. If you need four bytes, then you'll use four. No space wasted! The way it works is a little confusing at first, but easy to get once you think about it. The key concept is that it reserve some of the bits at the start of each byte as a "header" to help Unicode text parsers understand what they're reading.

Here's how UTF-8 works:

UTF-8 can use 1, 2, 3, or 4 bytes to encode a value to a code point.
Every byte in a UTF-8 sequence begins with a set of header bits.
The "header" bits are a sequence of zero or more 1s followed by a single 0.
The first byte in the sequence is special. In the first byte, the number of 1s in the header indicate the total number of bytes in the sequence.
The second, third, and fourth byte (if any) always begins with the header 10.
All remaining bits within the byte are usable for encoding the code point value in binary.
When decoding a UTF-8 encoded set of bytes, all of the headers are chopped off and the resulting binary digits are concatenated together to form one binary number. That number represents the code point.

	Byte Structure	Number of Usable bits	Highest Code Point Value
#1	0xxxxxxx	7	128
#2	110xxxxx 10xxxxxx	5 + 6 = 11	2,048
#3	1110xxxx 10xxxxxx 10xxxxxx	4 + 6 + 6 = 16	65,536
#4	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	3 + 6 + 6 + 6 = 21	2,097,152

Just to drive the point home in the above table, the first row #1 has zero 1s in its header, meaning that this is a standalone byte. Using this header, you can reach any of the 0-127 code points originally defined in ASCII. The third row #3 has three 1s before the first 0. After coming across these header bits, Unicode text parsers will be able to expect three total bytes to represent some code point.

Let's use a real example and encode the grapheme by hand. Let's say we want to encode the 🔥 fire emoji using UTF-8. What would its bytes look like?

First, we determine what the code point is for the fire emoji. Using a Unicode table, we find that it is U+1F525, which in decimal is 128,293. Using the byte structure table above, we can see that we're going to need all 4 bytes to encode it. 3 Bytes wouldn't nearly be enough to count up to 128,293.

The binary representation of 128,293 is 11111010100100101. All we have to do is just left pad this number with 0s until we have 21 bits, and then interpolate the bits into the 21 x's in the byte structure shown on the fourth row. Copied below is the result byte structure.

11110000 10011111 10010100 10100101  # this is what the fire emoji looks like!

_____000 __011111 __010100 __100101  # with the header bits hidden

The second row above with the header bits replaced with underscores illustrates what the computer does when it decodes the bytes back into binary to identify what code point it refers to. After chopping off the header bits, the remaining bits are concatenated together, which reforms the binary representation of 128,293.

Computers are extremely fast at this. Just think of every text message you send containing emoji. Whether it's your phone, your tablet, or your computer, it's able to quickly figure out its code point, its grapheme, and finally the glyph.

And that is 👏 👏 🔥 👌 pretty cool to me.