Encoding Standards
Unicode encoding standards define how characters are represented in binary form — enabling global software interoperability at the lowest level.
UTF-8
A variable-width encoding using 1-4 bytes per character. Backward-compatible with ASCII and dominates internet usage.
- Internet standard (98% of websites)
- Efficient for English/CJK text
UTF-16
Uses 2 or 4 bytes per character (endianness matters). Common in Windows and Java ecosystems.
- Fixed 2-byte for Basic Multilingual Plane
- Efficient for non-Latin scripts
UTF-32
Fixed-width 4 bytes per Unicode code point. Simple but inefficient, primarily used for internal processing.
- 1:1 mapping to Unicode code points
- High memory overhead
Encoding Comparison Table
Feature | UTF-8 | UTF-16 | UTF-32 |
---|---|---|---|
Character Set | All Unicode | All Unicode | All Unicode |
Byte Width | 1-4 bytes | 2 or 4 bytes | Fixed 4 bytes |
Endianness | None | Matters | Matters |
Compatibility | ASCII | UTF-8 (Windows) | N/A |
Use Cases | Web/Networking | Windows/Java | Internal Processing |