encoding.md 6.55 KB
Newer Older
miloyip's avatar
miloyip committed
1
# Encoding
Milo Yip's avatar
Milo Yip committed
2

Milo Yip's avatar
Milo Yip committed
3
According to [ECMA-404](http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf),
Milo Yip's avatar
Milo Yip committed
4

Milo Yip's avatar
Milo Yip committed
5
> (in Introduction) JSON text is a sequence of Unicode code points.
Milo Yip's avatar
Milo Yip committed
6

Milo Yip's avatar
Milo Yip committed
7
The earlier [RFC4627](http://www.ietf.org/rfc/rfc4627.txt) stated that,
Milo Yip's avatar
Milo Yip committed
8

9
> (in §3) JSON text SHALL be encoded in Unicode.  The default encoding is UTF-8.
Milo Yip's avatar
Milo Yip committed
10 11 12

> (in §6) JSON may be represented using UTF-8, UTF-16, or UTF-32. When JSON is written in UTF-8, JSON is 8bit compatible.  When JSON is written in UTF-16 or UTF-32, the binary content-transfer-encoding must be used.

Martin Lindhe's avatar
Martin Lindhe committed
13
RapidJSON supports various encodings. It can also validate the encodings of JSON, and transcoding JSON among encodings. All these features are implemented internally, without the need for external libraries (e.g. [ICU](http://site.icu-project.org/)).
Milo Yip's avatar
Milo Yip committed
14 15 16

[TOC]

Milo Yip's avatar
Milo Yip committed
17
# Unicode {#Unicode}
Milo Yip's avatar
Milo Yip committed
18 19 20 21 22 23 24 25
From [Unicode's official website](http://www.unicode.org/standard/WhatIsUnicode.html):
> Unicode provides a unique number for every character, 
> no matter what the platform,
> no matter what the program,
> no matter what the language.

Those unique numbers are called code points, which is in the range `0x0` to `0x10FFFF`.

Milo Yip's avatar
Milo Yip committed
26
## Unicode Transformation Format {#UTF}
Milo Yip's avatar
Milo Yip committed
27 28 29

There are various encodings for storing Unicode code points. These are called Unicode Transformation Format (UTF). RapidJSON supports the most commonly used UTFs, including

30 31 32
* UTF-8: 8-bit variable-width encoding. It maps a code point to 1–4 bytes.
* UTF-16: 16-bit variable-width encoding. It maps a code point to 1–2 16-bit code units (i.e., 2–4 bytes).
* UTF-32: 32-bit fixed-width encoding. It directly maps a code point to a single 32-bit code unit (i.e. 4 bytes).
Milo Yip's avatar
Milo Yip committed
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75

For UTF-16 and UTF-32, the byte order (endianness) does matter. Within computer memory, they are often stored in the computer's endianness. However, when it is stored in file or transferred over network, we need to state the byte order of the byte sequence, either little-endian (LE) or big-endian (BE). 

RapidJSON provide these encodings via the structs in `rapidjson/encodings.h`:

~~~~~~~~~~cpp
namespace rapidjson {

template<typename CharType = char>
struct UTF8;

template<typename CharType = wchar_t>
struct UTF16;

template<typename CharType = wchar_t>
struct UTF16LE;

template<typename CharType = wchar_t>
struct UTF16BE;

template<typename CharType = unsigned>
struct UTF32;

template<typename CharType = unsigned>
struct UTF32LE;

template<typename CharType = unsigned>
struct UTF32BE;

} // namespace rapidjson
~~~~~~~~~~

For processing text in memory, we normally use `UTF8`, `UTF16` or `UTF32`. For processing text via I/O, we may use `UTF8`, `UTF16LE`, `UTF16BE`, `UTF32LE` or `UTF32BE`.

When using the DOM-style API, the `Encoding` template parameter in `GenericValue<Encoding>` and `GenericDocument<Encoding>` indicates the encoding to be used to represent JSON string in memory. So normally we will use `UTF8`, `UTF16` or `UTF32` for this template parameter. The choice depends on operating systems and other libraries that the application is using. For example, Windows API represents Unicode characters in UTF-16, while most Linux distributions and applications prefer UTF-8.

Example of UTF-16 DOM declaration:

~~~~~~~~~~cpp
typedef GenericDocument<UTF16<> > WDocument;
typedef GenericValue<UTF16<> > WValue;
~~~~~~~~~~

76
For a detail example, please check the example in [DOM's Encoding](doc/stream.md) section.
Milo Yip's avatar
Milo Yip committed
77

Milo Yip's avatar
Milo Yip committed
78
## Character Type {#CharacterType}
Milo Yip's avatar
Milo Yip committed
79

80
As shown in the declaration, each encoding has a `CharType` template parameter. Actually, it may be a little bit confusing, but each `CharType` stores a code unit, not a character (code point). As mentioned in previous section, a code point may be encoded to 1–4 code units for UTF-8.
Milo Yip's avatar
Milo Yip committed
81 82 83 84 85

For `UTF16(LE|BE)`, `UTF32(LE|BE)`, the `CharType` must be integer type of at least 2 and 4 bytes  respectively.

Note that C++11 introduces `char16_t` and `char32_t`, which can be used for `UTF16` and `UTF32` respectively.

Milo Yip's avatar
Milo Yip committed
86
## AutoUTF {#AutoUTF}
Milo Yip's avatar
Milo Yip committed
87 88 89 90 91

Previous encodings are statically bound in compile-time. In other words, user must know exactly which encodings will be used in the memory or streams. However, sometimes we may need to read/write files of different encodings. The encoding needed to be decided in runtime.

`AutoUTF` is an encoding designed for this purpose. It chooses which encoding to be used according to the input or output stream. Currently, it should be used with `EncodedInputStream` and `EncodedOutputStream`.

Milo Yip's avatar
Milo Yip committed
92
## ASCII {#ASCII}
Milo Yip's avatar
Milo Yip committed
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111

Although the JSON standards did not mention about [ASCII](http://en.wikipedia.org/wiki/ASCII), sometimes we would like to write 7-bit ASCII JSON for applications that cannot handle UTF-8. Since any JSON can represent unicode characters in escaped sequence `\uXXXX`, JSON can always be encoded in ASCII.

Here is an example for writing a UTF-8 DOM into ASCII:

~~~~~~~~~~cpp
using namespace rapidjson;
Document d; // UTF8<>
// ...
StringBuffer buffer;
Writer<StringBuffer, Document::EncodingType, ASCII<> > writer(buffer);
d.Accept(writer);
std::cout << buffer.GetString();
~~~~~~~~~~

ASCII can be used in input stream. If the input stream contains bytes with values above 127, it will cause `kParseErrorStringInvalidEncoding` error.

ASCII *cannot* be used in memory (encoding of `Document` or target encoding of `Reader`), as it cannot represent Unicode code points.

Milo Yip's avatar
Milo Yip committed
112
# Validation & Transcoding {#ValidationTranscoding}
Milo Yip's avatar
Milo Yip committed
113 114 115 116 117

When RapidJSON parses a JSON, it can validate the input JSON, whether it is a valid sequence of a specified encoding. This option can be turned on by adding `kParseValidateEncodingFlag` in `parseFlags` template parameter.

If the input encoding and output encoding is different, `Reader` and `Writer` will automatically transcode (convert) the text. In this case, `kParseValidateEncodingFlag` is not necessary, as it must decode the input sequence. And if the sequence was unable to be decoded, it must be invalid.

Milo Yip's avatar
Milo Yip committed
118
## Transcoder {#Transcoder}
Milo Yip's avatar
Milo Yip committed
119 120 121 122 123 124 125 126 127 128 129 130 131 132

Although the encoding functions in RapidJSON are designed for JSON parsing/generation, user may abuse them for transcoding of non-JSON strings.

Here is an example for transcoding a string from UTF-8 to UTF-16:

~~~~~~~~~~cpp
#include "rapidjson/encodings.h"

using namespace rapidjson;

const char* s = "..."; // UTF-8 string
StringStream source(s);
GenericStringBuffer<UTF16<> > target;

miloyip's avatar
miloyip committed
133
bool hasError = false;
liujiayang's avatar
liujiayang committed
134
while (source.Peek() != '\0')
liujiayang's avatar
liujiayang committed
135
    if (!Transcoder<UTF8<>, UTF16<> >::Transcode(source, target)) {
miloyip's avatar
miloyip committed
136 137 138 139 140 141 142 143
        hasError = true;
        break;
    }

if (!hasError) {
    const wchar_t* t = target.GetString();
    // ...
}
Milo Yip's avatar
Milo Yip committed
144 145 146
~~~~~~~~~~

You may also use `AutoUTF` and the associated streams for setting source/target encoding in runtime.