UTF-16 Deep Dive

Understanding UTF-16 encoding: how it works, implementation patterns, and best practices for modern applications.

UTF-16: Fundamentals

UTF-16 is a variable-width character encoding standard defined in ISO/IEC 10646. It uses 16-bit code units to represent Unicode code points.


// JavaScript example
const str = "안녕";
let codeUnits = [...str].map(c => c.charCodeAt(0));
console.log(codeUnits); // [54620, 55361]

Encoding Process

BMP Characters

Characters in the Basic Multilingual Plane (BMP) encode as single 16-bit code units. For example, 'A' is encoded as 0x0041.


const ascii = String.fromCodePoint(0x0041); // "A"

Supplementary Planes

Characters outside the BMP use surrogate pairs: two 16-bit code units following a specific algorithm.


// Emoji (U+1F60D) in JavaScript
const emoji = "😊";
const hex = emoji.codePointAt(0).toString(16); // 1f60d

Surrogate Pairs Demystified

UTF-16 encoding algorithm

Algorithm Summary

  1. Subtract 0x10000 to get 20-bit value
  2. Divide into high/low 10 bits
  3. Combine with 0xD800/0xDC00

Implementation Best Practices

Validation

Check for complete surrogate pairs

Endianness

Handle BOM correctly for cross-platform compatibility

Conversion

Use proper libraries for UTF-8/16/32 conversions

UTF-16 Decoding

JavaScript Example

const data = [0xD834, 0xDD1E]; // Music note emoji
const point = ((data[0] - 0xD800) * 0x400) + (data[1] - 0xDC00) + 0x10000;
console.log(point.toString(16)); // 1d11e
                

Want to Master UTF-16?

Join our free webinars to learn about encoding standards, validation techniques, and security best practices.

📢 Register Now