@@ -15,7 +15,7 @@ RapidJSON supports various encodings. It can also validate the encodings of JSON
...
@@ -15,7 +15,7 @@ RapidJSON supports various encodings. It can also validate the encodings of JSON
[TOC]
[TOC]
# Unicode {Unicode}
# Unicode {#Unicode}
From [Unicode's official website](http://www.unicode.org/standard/WhatIsUnicode.html):
From [Unicode's official website](http://www.unicode.org/standard/WhatIsUnicode.html):
> Unicode provides a unique number for every character,
> Unicode provides a unique number for every character,
> no matter what the platform,
> no matter what the platform,
...
@@ -24,7 +24,7 @@ From [Unicode's official website](http://www.unicode.org/standard/WhatIsUnicode.
...
@@ -24,7 +24,7 @@ From [Unicode's official website](http://www.unicode.org/standard/WhatIsUnicode.
Those unique numbers are called code points, which is in the range `0x0` to `0x10FFFF`.
Those unique numbers are called code points, which is in the range `0x0` to `0x10FFFF`.
## Unicode Transformation Format {UTF}
## Unicode Transformation Format {#UTF}
There are various encodings for storing Unicode code points. These are called Unicode Transformation Format (UTF). RapidJSON supports the most commonly used UTFs, including
There are various encodings for storing Unicode code points. These are called Unicode Transformation Format (UTF). RapidJSON supports the most commonly used UTFs, including
For a detail example, please check the example in [DOM's Encoding](doc/stream.md#Encoding) section.
For a detail example, please check the example in [DOM's Encoding](doc/stream.md#Encoding) section.
## Character Type {CharacterType}
## Character Type {#CharacterType}
As shown in the declaration, each encoding has a `CharType` template parameter. Actually, it may be a little bit confusing, but each `CharType` stores a code unit, not a character (code point). As mentioned in previous section, a code point may be encoded to 1-4 code units for UTF-8.
As shown in the declaration, each encoding has a `CharType` template parameter. Actually, it may be a little bit confusing, but each `CharType` stores a code unit, not a character (code point). As mentioned in previous section, a code point may be encoded to 1-4 code units for UTF-8.
...
@@ -84,13 +84,13 @@ For `UTF16(LE|BE)`, `UTF32(LE|BE)`, the `CharType` must be integer type of at le
...
@@ -84,13 +84,13 @@ For `UTF16(LE|BE)`, `UTF32(LE|BE)`, the `CharType` must be integer type of at le
Note that C++11 introduces `char16_t` and `char32_t`, which can be used for `UTF16` and `UTF32` respectively.
Note that C++11 introduces `char16_t` and `char32_t`, which can be used for `UTF16` and `UTF32` respectively.
## AutoUTF {AutoUTF}
## AutoUTF {#AutoUTF}
Previous encodings are statically bound in compile-time. In other words, user must know exactly which encodings will be used in the memory or streams. However, sometimes we may need to read/write files of different encodings. The encoding needed to be decided in runtime.
Previous encodings are statically bound in compile-time. In other words, user must know exactly which encodings will be used in the memory or streams. However, sometimes we may need to read/write files of different encodings. The encoding needed to be decided in runtime.
`AutoUTF` is an encoding designed for this purpose. It chooses which encoding to be used according to the input or output stream. Currently, it should be used with `EncodedInputStream` and `EncodedOutputStream`.
`AutoUTF` is an encoding designed for this purpose. It chooses which encoding to be used according to the input or output stream. Currently, it should be used with `EncodedInputStream` and `EncodedOutputStream`.
## ASCII {ASCII}
## ASCII {#ASCII}
Although the JSON standards did not mention about [ASCII](http://en.wikipedia.org/wiki/ASCII), sometimes we would like to write 7-bit ASCII JSON for applications that cannot handle UTF-8. Since any JSON can represent unicode characters in escaped sequence `\uXXXX`, JSON can always be encoded in ASCII.
Although the JSON standards did not mention about [ASCII](http://en.wikipedia.org/wiki/ASCII), sometimes we would like to write 7-bit ASCII JSON for applications that cannot handle UTF-8. Since any JSON can represent unicode characters in escaped sequence `\uXXXX`, JSON can always be encoded in ASCII.
...
@@ -110,13 +110,13 @@ ASCII can be used in input stream. If the input stream contains bytes with value
...
@@ -110,13 +110,13 @@ ASCII can be used in input stream. If the input stream contains bytes with value
ASCII *cannot* be used in memory (encoding of `Document` or target encoding of `Reader`), as it cannot represent Unicode code points.
ASCII *cannot* be used in memory (encoding of `Document` or target encoding of `Reader`), as it cannot represent Unicode code points.
When RapidJSON parses a JSON, it can validate the input JSON, whether it is a valid sequence of a specified encoding. This option can be turned on by adding `kParseValidateEncodingFlag` in `parseFlags` template parameter.
When RapidJSON parses a JSON, it can validate the input JSON, whether it is a valid sequence of a specified encoding. This option can be turned on by adding `kParseValidateEncodingFlag` in `parseFlags` template parameter.
If the input encoding and output encoding is different, `Reader` and `Writer` will automatically transcode (convert) the text. In this case, `kParseValidateEncodingFlag` is not necessary, as it must decode the input sequence. And if the sequence was unable to be decoded, it must be invalid.
If the input encoding and output encoding is different, `Reader` and `Writer` will automatically transcode (convert) the text. In this case, `kParseValidateEncodingFlag` is not necessary, as it must decode the input sequence. And if the sequence was unable to be decoded, it must be invalid.
## Transcoder {Transcoder}
## Transcoder {#Transcoder}
Although the encoding functions in RapidJSON are designed for JSON parsing/generation, user may abuse them for transcoding of non-JSON strings.
Although the encoding functions in RapidJSON are designed for JSON parsing/generation, user may abuse them for transcoding of non-JSON strings.