Unicode Fundamentals

Before diving into the technical details of UTF-16, we need to take a step back and understand the fundamental problem that Unicode solves. After all, why do we need something so complex just to display letters on the screen?

A World of Characters

Think about your keyboard for a moment. Every key you press needs to be translated into something the computer can understand and store. In the early days of computing, this was simple: the ASCII standard used only 7 bits (128 possible values) to represent English letters, numbers, punctuation, and some control characters.

But what about the "ç" in "coração"? And the accents in "José"? And the thousands of Chinese characters? And the Cyrillic alphabet? And the emojis we use every day? 🤔

For decades, different regions of the world created their own solutions – encoding tables like ISO-8859-1 (Latin-1) for Western European languages, Windows-1252, Shift-JIS for Japanese, and hundreds of others. The result was predictable: a document created on one system appeared as gibberish on another.

Unicode as a Universal Catalog

Unicode arrived with a revolutionary proposal: to create a single, universal catalog of all characters from all writing systems in the world. Each character would receive a unique and permanent number, called a code point.

For example:

  • The uppercase letter "A" is code point U+0041
  • The letter "é" is code point U+00E9
  • The character "中" (Chinese) is code point U+4E2D
  • The emoji "😀" is code point U+1F600

The "U+" prefix indicates that we're talking about a Unicode code point, and the number is usually written in hexadecimal.

The Social Security Number of Characters

Think of code points as the social security number of characters – each one has its unique identifier in the great Unicode system. Currently, Unicode can represent more than 1.1 million different code points (from U+0000 to U+10FFFF), although only about 150 thousand are currently assigned to characters.

These code points are organized into planes:

  • Plane 0 (Basic Multilingual Plane - BMP): U+0000 to U+FFFF (65,536 positions)
    • Contains the most common characters from virtually all modern languages
    • Includes Latin, Greek, Cyrillic, Arabic, Hebrew alphabets, large portions of CJK (Chinese, Japanese, Korean), and much more
  • Planes 1-16 (Supplementary Planes): U+10000 to U+10FFFF
    • Less common characters, historical scripts, emojis, specialized mathematical symbols, etc.

The Encoding Challenge

Here's the crucial point: Unicode defines WHICH characters exist and their identifying numbers, but does not define HOW these numbers should be stored in bytes in memory or files.

It's like having a list of all addresses in the world, but still needing to decide how to write these addresses on envelopes of different sizes. Would you use one line? Two? How much space would you reserve for each field?

This is exactly what encoding schemes are for:

  • UTF-8: Uses 1 to 4 bytes per character
  • UTF-16: Uses 2 or 4 bytes per character
  • UTF-32: Always uses 4 bytes per character

Each has its advantages, disadvantages, and ideal use cases. UTF-16, our focus, sits in an interesting – and sometimes controversial – middle ground between efficiency and simplicity.

Now that we understand what code points are and why we need encoding schemes, in the next article we'll uncover how UTF-16 actually works.