Text Encoding and Decoding in Rust

Handling text encoding and decoding is essential for working with strings in Rust, especially when dealing with binary data or interoping with systems using different character encodings.

Common Encoding Formats

Rust provides built-in support for UTF-8, UTF-16, and ASCII encodings through the std::str module as well as the EncodeUtf8, from_utf8, and related functions in the standard library.

UTF-8 Encoding

UTF-8 is the default encoding in Rust for String and &str types. Rust guarantees that a String is always valid UTF-8, and provides methods to convert to and from UTF-8.

fn main() {
 // Create a Rust String from UTF-8 bytes
 let text: String = "Hello, world!".to_string();
 
 // Get UTF-8 bytes from a String
 let bytes: Vec = text.into_bytes();
 
 // Convert bytes back to a UTF-8 String
 match String::from_utf8(bytes) {
 Ok(str) => println!("Decoded string: {}", str),
 Err(e) => println!("Invalid UTF-8 sequence: {}", e),
 }
}
 

UTF-16 Encoding

UTF-16 is a popular encoding that uses 2 bytes per character (some characters may use 4 bytes). Rust provides encode_utf16 to convert from &str to UTF-16, and char type for manipulation with char16_t types.

fn main() {
 // Convert a String to UTF-16
 let str_utf16: Vec = "Hello, world!".encode_utf16().collect();
 
 // Print UTF-16 output
 println!("UTF-16 values: {:?}", str_utf16);
}
 

ASCII Encoding

ASCII is a simple 7-bit encoding. For binary text using only ASCII characters, Rust can handle conversion easily since ASCII is a subset of UTF-8:

fn main() {
 // ASCII as UTF-8 (ASCII is a subset of UTF-8)
 let ascii_bytes: &[u8] = b"Rust is awesome!";
 
 // Convert to a UTF-8 &str
 let ascii_string = std::str::from_utf8(ascii_bytes).unwrap();
 
 println!("ASCII string: {}", ascii_string);
}
 

Decoding Binary Data into Strings

When decoding non-UTF-8 or possibly corrupted bytes (for example, from network or file sources in unknown encoding), Rust's String::from_utf8 provides safe decoding and error handling:

fn main() {
 let corrupted_bytes: Vec = vec![0, 159, 146, 150]; // Not valid UTF-8
 
 match String::from_utf8(corrupted_bytes) {
 Ok(valid_string) => println!("Decoded string: {}", valid_string),
 Err(e) => {
 let invalid_bytes = e.into_bytes();
 println!("Invalid UTF-8 bytes: {:?}", invalid_bytes);
 }
 }
}
 

Handling Arbitrary Encodings (Non-UTF)

For non-UTF encodings like ISO-8859-1 or Shift-JIS, you can use external crates, like encoding_rs or chardet, instead of the built-in Rust standard library. These libraries provide detection and decoding support for a wide range of encodings:

extern crate encoding_rs;

use encoding_rs::Encoding;

fn main() {
 let iso_bytes: &[u8] = &[65, 136, 145]; // ASCII "A" and some ISO-8859-1 extended code

 // Use encoding_rs to detect and decode the data
 let (text, _, _) = Encoding