Commit 7cfe718d authored by Milo Yip's avatar Milo Yip

Minor update to encoding documentation

parent e590e075
......@@ -6,8 +6,7 @@ According to [ECMA-404](http://www.ecma-international.org/publications/files/ECM
The earlier [RFC4627](http://www.ietf.org/rfc/rfc4627.txt) stated that,
> (in §3) JSON text SHALL be encoded in Unicode. The default encoding is
UTF-8.
> (in §3) JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.
> (in §6) JSON may be represented using UTF-8, UTF-16, or UTF-32. When JSON is written in UTF-8, JSON is 8bit compatible. When JSON is written in UTF-16 or UTF-32, the binary content-transfer-encoding must be used.
......@@ -28,9 +27,9 @@ Those unique numbers are called code points, which is in the range `0x0` to `0x1
There are various encodings for storing Unicode code points. These are called Unicode Transformation Format (UTF). RapidJSON supports the most commonly used UTFs, including
* UTF-8: 8-bit variable-width encoding. It maps a code point to 1-4 bytes.
* UTF-16: 16-bit variable-width encoding. It maps a code point to 1-2 16-bit code units (i.e., 2-4 bytes).
* UTF-32: 32-bit fixed-width encoding. It directly maps a code point to 1 32-bit code unit (i.e. 4 bytes).
* UTF-8: 8-bit variable-width encoding. It maps a code point to 14 bytes.
* UTF-16: 16-bit variable-width encoding. It maps a code point to 1–2 16-bit code units (i.e., 2–4 bytes).
* UTF-32: 32-bit fixed-width encoding. It directly maps a code point to a single 32-bit code unit (i.e. 4 bytes).
For UTF-16 and UTF-32, the byte order (endianness) does matter. Within computer memory, they are often stored in the computer's endianness. However, when it is stored in file or transferred over network, we need to state the byte order of the byte sequence, either little-endian (LE) or big-endian (BE).
......@@ -78,7 +77,7 @@ For a detail example, please check the example in [DOM's Encoding](doc/stream.md
## Character Type {#CharacterType}
As shown in the declaration, each encoding has a `CharType` template parameter. Actually, it may be a little bit confusing, but each `CharType` stores a code unit, not a character (code point). As mentioned in previous section, a code point may be encoded to 1-4 code units for UTF-8.
As shown in the declaration, each encoding has a `CharType` template parameter. Actually, it may be a little bit confusing, but each `CharType` stores a code unit, not a character (code point). As mentioned in previous section, a code point may be encoded to 14 code units for UTF-8.
For `UTF16(LE|BE)`, `UTF32(LE|BE)`, the `CharType` must be integer type of at least 2 and 4 bytes respectively.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment