UTF-16 Deep Dive
Understanding UTF-16 encoding: how it works, implementation patterns, and best practices for modern applications.
UTF-16: Fundamentals
UTF-16 is a variable-width character encoding standard defined in ISO/IEC 10646. It uses 16-bit code units to represent Unicode code points.
// JavaScript example
const str = "안녕";
let codeUnits = [...str].map(c => c.charCodeAt(0));
console.log(codeUnits); // [54620, 55361]
Encoding Process
BMP Characters
Characters in the Basic Multilingual Plane (BMP) encode as single 16-bit code units. For example, 'A' is encoded as 0x0041.
const ascii = String.fromCodePoint(0x0041); // "A"
Supplementary Planes
Characters outside the BMP use surrogate pairs: two 16-bit code units following a specific algorithm.
// Emoji (U+1F60D) in JavaScript
const emoji = "😊";
const hex = emoji.codePointAt(0).toString(16); // 1f60d
Surrogate Pairs Demystified
Algorithm Summary
- Subtract 0x10000 to get 20-bit value
- Divide into high/low 10 bits
- Combine with 0xD800/0xDC00
Implementation Best Practices
Validation
Check for complete surrogate pairs
Endianness
Handle BOM correctly for cross-platform compatibility
Conversion
Use proper libraries for UTF-8/16/32 conversions
UTF-16 Decoding
JavaScript Example
const data = [0xD834, 0xDD1E]; // Music note emoji const point = ((data[0] - 0xD800) * 0x400) + (data[1] - 0xDC00) + 0x10000; console.log(point.toString(16)); // 1d11e
Want to Master UTF-16?
Join our free webinars to learn about encoding standards, validation techniques, and security best practices.
📢 Register Now