dom.md 13.5 KB
Newer Older
miloyip's avatar
miloyip committed
1
# DOM
Milo Yip's avatar
Milo Yip committed
2

3
Document Object Model(DOM) is an in-memory representation of JSON for query and manipulation. The basic usage of DOM is described in [Tutorial](doc/tutorial.md). This section will describe some details and more advanced usages.
Milo Yip's avatar
Milo Yip committed
4

5 6 7
[TOC]

# Template {#Template}
Milo Yip's avatar
Milo Yip committed
8

Milo Yip's avatar
Milo Yip committed
9 10 11
In the tutorial,  `Value` and `Document` was used. Similarly to `std::string`, these are actually `typedef` of template classes:

~~~~~~~~~~cpp
12 13
namespace rapidjson {

Milo Yip's avatar
Milo Yip committed
14 15 16 17 18 19 20 21 22 23 24 25
template <typename Encoding, typename Allocator = MemoryPoolAllocator<> >
class GenericValue {
    // ...
};

template <typename Encoding, typename Allocator = MemoryPoolAllocator<> >
class GenericDocument : public GenericValue<Encoding, Allocator> {
    // ...
};

typedef GenericValue<UTF8<> > Value;
typedef GenericDocument<UTF8<> > Document;
26 27

} // namespace rapidjson
Milo Yip's avatar
Milo Yip committed
28 29 30 31
~~~~~~~~~~

User can customize these template parameters.

32
## Encoding {#Encoding}
Milo Yip's avatar
Milo Yip committed
33

Milo Yip's avatar
Milo Yip committed
34 35 36 37 38
The `Encoding` parameter specifies the encoding of JSON String value in memory. Possible options are `UTF8`, `UTF16`, `UTF32`. Note that, these 3 types are also template class. `UTF8<>` is `UTF8<char>`, which means using char to store the characters. You may refer to [Encoding](encoding.md) for details.

Suppose a Windows application would query localization strings stored in JSON files. Unicode-enabled functions in Windows use UTF-16 (wide character) encoding. No matter what encoding was used in JSON files, we can store the strings in UTF-16 in memory.

~~~~~~~~~~cpp
39 40
using namespace rapidjson;

Milo Yip's avatar
Milo Yip committed
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
typedef GenericDocument<UTF16<> > WDocument;
typedef GenericValue<UTF16<> > WValue;

FILE* fp = fopen("localization.json", "rb"); // non-Windows use "r"

char readBuffer[256];
FileReadStream bis(fp, readBuffer, sizeof(readBuffer));

AutoUTFInputStream<unsigned, FileReadStream> eis(bis);  // wraps bis into eis

WDocument d;
d.ParseStream<0, AutoUTF<unsigned> >(eis);

const WValue locale(L"ja"); // Japanese

MessageBoxW(hWnd, d[locale].GetString(), L"Test", MB_OK);
~~~~~~~~~~

59
## Allocator {#Allocator}
Milo Yip's avatar
Milo Yip committed
60

Milo Yip's avatar
Milo Yip committed
61 62 63 64 65 66
The `Allocator` defines which allocator class is used when allocating/deallocating memory for `Document`/`Value`. `Document` owns, or references to an `Allocator` instance. On the other hand, `Value` does not do so, in order to reduce memory consumption.

The default allocator used in `GenericDocument` is `MemoryPoolAllocator`. This allocator actually allocate memory sequentially, and cannot deallocate one by one. This is very suitable when parsing a JSON to generate a DOM tree.

Another allocator is `CrtAllocator`, of which CRT is short for C RunTime library. This allocator simply calls the standard `malloc()`/`realloc()`/`free()`. When there is a lot of add and remove operations, this allocator may be preferred. But this allocator is far less efficient than `MemoryPoolAllocator`.

67
# Parsing {#Parsing}
Milo Yip's avatar
Milo Yip committed
68

Milo Yip's avatar
Milo Yip committed
69 70 71
`Document` provides several functions for parsing. In below, (1) is the fundamental function, while the others are helpers which call (1).

~~~~~~~~~~cpp
72 73
using namespace rapidjson;

Milo Yip's avatar
Milo Yip committed
74 75
// (1) Fundamental
template <unsigned parseFlags, typename SourceEncoding, typename InputStream>
76
GenericDocument& GenericDocument::ParseStream(InputStream& is);
Milo Yip's avatar
Milo Yip committed
77 78 79

// (2) Using the same Encoding for stream
template <unsigned parseFlags, typename InputStream>
80
GenericDocument& GenericDocument::ParseStream(InputStream& is);
Milo Yip's avatar
Milo Yip committed
81 82 83

// (3) Using default parse flags
template <typename InputStream>
84
GenericDocument& GenericDocument::ParseStream(InputStream& is);
Milo Yip's avatar
Milo Yip committed
85 86 87

// (4) In situ parsing
template <unsigned parseFlags, typename SourceEncoding>
88
GenericDocument& GenericDocument::ParseInsitu(Ch* str);
Milo Yip's avatar
Milo Yip committed
89 90 91

// (5) In situ parsing, using same Encoding for stream
template <unsigned parseFlags>
92
GenericDocument& GenericDocument::ParseInsitu(Ch* str);
Milo Yip's avatar
Milo Yip committed
93 94

// (6) In situ parsing, using default parse flags
95
GenericDocument& GenericDocument::ParseInsitu(Ch* str);
Milo Yip's avatar
Milo Yip committed
96 97 98

// (7) Normal parsing of a string
template <unsigned parseFlags, typename SourceEncoding>
99
GenericDocument& GenericDocument::Parse(const Ch* str);
Milo Yip's avatar
Milo Yip committed
100 101 102

// (8) Normal parsing of a string, using same Encoding for stream
template <unsigned parseFlags>
103
GenericDocument& GenericDocument::Parse(const Ch* str);
Milo Yip's avatar
Milo Yip committed
104 105

// (9) Normal parsing of a string, using default parse flags
106
GenericDocument& GenericDocument::Parse(const Ch* str);
Milo Yip's avatar
Milo Yip committed
107 108 109 110 111 112 113 114 115 116 117 118 119 120
~~~~~~~~~~

The examples of [tutorial](tutorial.md) uses (9) for normal parsing of string. The examples of [stream](stream.md) uses the first three. *In situ* parsing will be described soon.

The `parseFlags` are combination of the following bit-flags:

Parse flags                   | Meaning
------------------------------|-----------------------------------
`kParseDefaultFlags = 0`      | Default parse flags. 
`kParseInsituFlag`            | In-situ(destructive) parsing.
`kParseValidateEncodingFlag`  | Validate encoding of JSON strings.

By using a non-type template parameter, instead of a function parameter, C++ compiler can generate code which is optimized for specified combinations, improving speed, and reducing code size (if only using a single specialization). The downside is the flags needed to be determined in compile-time.

121
The `SourceEncoding` parameter defines what encoding is in the stream. This can be differed to the `Encoding` of the `Document`. See [Transcoding and Validation](#TranscodingAndValidation) section for details.
Milo Yip's avatar
Milo Yip committed
122 123 124

And the `InputStream` is type of input stream.

125
## Parse Error {#ParseError}
Milo Yip's avatar
Milo Yip committed
126

Milo Yip's avatar
Milo Yip committed
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158
When the parse processing succeeded, the `Document` contains the parse results. When there is an error, the original DOM is *unchanged*. And the error state of parsing can be obtained by `bool HasParseError()`,  `ParseErrorCode GetParseError()` and `size_t GetParseOffet()`.

Parse Error Code                            | Description
--------------------------------------------|---------------------------------------------------
`kParseErrorNone`                           | No error.
`kParseErrorDocumentEmpty`                  | The document is empty.
`kParseErrorDocumentRootNotObjectOrArray`   | The document root must be either object or array.
`kParseErrorDocumentRootNotSingular`        | The document root must not follow by other values.
`kParseErrorValueInvalid`                   | Invalid value.
`kParseErrorObjectMissName`                 | Missing a name for object member.
`kParseErrorObjectMissColon`                | Missing a colon after a name of object member.
`kParseErrorObjectMissCommaOrCurlyBracket`  | Missing a comma or `}` after an object member.
`kParseErrorArrayMissCommaOrSquareBracket`  | Missing a comma or `]` after an array element.
`kParseErrorStringUnicodeEscapeInvalidHex`  | Incorrect hex digit after `\\u` escape in string.
`kParseErrorStringUnicodeSurrogateInvalid`  | The surrogate pair in string is invalid.
`kParseErrorStringEscapeInvalid`            | Invalid escape character in string.
`kParseErrorStringMissQuotationMark`        | Missing a closing quotation mark in string.
`kParseErrorStringInvalidEncoding`          | Invalid encoding in string.
`kParseErrorNumberTooBig`                   | Number too big to be stored in `double`.
`kParseErrorNumberMissFraction`             | Miss fraction part in number.
`kParseErrorNumberMissExponent`             | Miss exponent in number.

The offset of error is defined as the character number from beginning of stream. Currently RapidJSON does not keep track of line number.

To get an error message, RapidJSON provided a English messages in `rapidjson/error/en.h`. User can customize it for other locales, or use a custom localization system.

Here shows an example of parse error handling.

~~~~~~~~~~cpp
// TODO: example
~~~~~~~~~~

159
## In Situ Parsing {#InSituParsing}
Milo Yip's avatar
Milo Yip committed
160 161 162 163 164 165 166 167 168 169 170

From [Wikipedia](http://en.wikipedia.org/wiki/In_situ):

> *In situ* ... is a Latin phrase that translates literally to "on site" or "in position". It means "locally", "on site", "on the premises" or "in place" to describe an event where it takes place, and is used in many different contexts.
> ...
> (In computer science) An algorithm is said to be an in situ algorithm, or in-place algorithm, if the extra amount of memory required to execute the algorithm is O(1), that is, does not exceed a constant no matter how large the input. For example, heapsort is an in situ sorting algorithm.

In normal parsing process, a large overhead is to decode JSON strings and copy them to other buffers. *In situ* parsing decodes those JSON string at the place where it is stored. It is possible in JSON because the decoded string is always shorter than the one in JSON. In this context, decoding a JSON string means to process the escapes, such as `"\n"`, `"\u1234"`, etc., and add a null terminator (`'\0'`)at the end of string.

The following diagrams compare normal and *in situ* parsing. The JSON string values contain pointers to the decoded string.

Milo Yip's avatar
Milo Yip committed
171
![normal parsing](diagram/normalparsing.png)
Milo Yip's avatar
Milo Yip committed
172 173 174

In normal parsing, the decoded string are copied to freshly allocated buffers. `"\\n"` (2 characters) is decoded as `"\n"` (1 character). `"\\u0073"` (6 characters) is decoded as "s" (1 character).

Milo Yip's avatar
Milo Yip committed
175
![instiu parsing](diagram/insituparsing.png)
Milo Yip's avatar
Milo Yip committed
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214

*In situ* parsing just modified the original JSON. Updated characters are highlighted in the diagram. If the JSON string does not contain escape character, such as `"msg"`, the parsing process merely replace the closing double quotation mark with a null character.

Since *in situ* parsing modify the input, the parsing API needs `char*` instead of `const char*`.

~~~~~~~~~~cpp
// Read whole file into a buffer
FILE* fp = fopen("test.json", "r");
fseek(fp, 0, SEEK_END);
size_t filesize = (size_t)ftell(fp);
fseek(fp, 0, SEEK_SET);
char* buffer = (char*)malloc(filesize + 1);
size_t readLength = fread(buffer, 1, filesize, fp);
buffer[readLength] = '\0';
fclose(fp);

// In situ parsing the buffer into d, buffer will also be modified
Document d;
d.ParseInsitu(buffer);

// Query/manipulate the DOM here...

free(buffer);
// Note: At this point, d may have dangling pointers pointed to the deallocated buffer.
~~~~~~~~~~

The JSON strings are marked as constant-string. But they may not be really "constant". The life cycle of it depends on the JSON buffer.

In situ parsing minimizes allocation overheads and memory copying. Generally this improves cache coherence, which is an important factor of performance in modern computer.

There are some limitations of *in situ* parsing:

1. The whole JSON is in memory.
2. The source encoding in stream and target encoding in document must be the same.
3. The buffer need to be retained until the document is no longer used.
4. If the DOM need to be used for long period after parsing, and there are few JSON strings in the DOM, retaining the buffer may be a memory waste.

*In situ* parsing is mostly suitable for short-term JSON that only need to be processed once, and then be released from memory. In practice, these situation is very common, for example, deserializing JSON to C++ objects, processing web requests represented in JSON, etc.

215
## Transcoding and Validation {#TranscodingAndValidation}
Milo Yip's avatar
Milo Yip committed
216

217
RapidJSON supports conversion between Unicode formats (officially termed UCS Transformation Format) internally. During DOM parsing, the source encoding of the stream can be different from the encoding of the DOM. For example, the source stream contains a UTF-8 JSON, while the DOM is using UTF-16 encoding. There is an example code in [EncodedInputStream](doc/stream.md#EncodedInputStream).
Milo Yip's avatar
Milo Yip committed
218

219
When writing a JSON from DOM to output stream, transcoding can also be used. An example is in [EncodedOutputStream](stream.md##EncodedOutputStream).
Milo Yip's avatar
Milo Yip committed
220 221 222 223

During transcoding, the source string is decoded to into Unicode code points, and then the code points are encoded in the target format. During decoding, it will validate the byte sequence in the source string. If it is not a valid sequence, the parser will be stopped with `kParseErrorStringInvalidEncoding` error.

When the source encoding of stream is the same as encoding of DOM, by default, the parser will *not* validate the sequence. User may use `kParseValidateEncodingFlag` to force validation.
Milo Yip's avatar
Milo Yip committed
224

225
# Techniques {#Techniques}
Milo Yip's avatar
Milo Yip committed
226

Milo Yip's avatar
Milo Yip committed
227 228
Some techniques about using DOM API is discussed here.

Milo Yip's avatar
Milo Yip committed
229
## DOM as SAX Event Publisher
Milo Yip's avatar
Milo Yip committed
230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246

In RapidJSON, stringifying a DOM with `Writer` may be look a little bit weired.

~~~~~~~~~~cpp
// ...
Writer<StringBuffer> writer(buffer);
d.Accept(writer);
~~~~~~~~~~

Actually, `Value::Accept()` is responsible for publishing SAX events about the value to the handler. With this design, `Value` and `Writer` are decoupled. `Value` can generate SAX events, and `Writer` can handle those events.

User may create customer handlers for transforming the DOM into other formats. For example, a handler which converts the DOM into XML.

~~~~~~~~~~cpp
// TODO: example
~~~~~~~~~~

247
For more about SAX events and handler, please refer to [SAX](doc/sax.md).
Milo Yip's avatar
Milo Yip committed
248

249
## User Buffer {#UserBuffer}
Milo Yip's avatar
Milo Yip committed
250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268

Some applications may try to avoid memory allocations whenever possible.

`MemoryPoolAllocator` can support this by letting user to provide a buffer. The buffer can be on the program stack, or a "scratch buffer" which is statically allocated (a static/global array) for storing temporary data.

`MemoryPoolAllocator` will use the user buffer to satisfy allocations. When the user buffer is used up, it will allocate a chunk of memory from the base allocator (by default the `CrtAllocator`).

Here is an example of using stack memory.

~~~~~~~~~~cpp
char buffer[1024];
MemoryPoolAllocator allocator(buffer, sizeof(buffer));

Document d(&allocator);
d.Parse(json);
~~~~~~~~~~

If the total size of allocation is less than 1024 during parsing, this code does not invoke any heap allocation (via `new` or `malloc()`) at all.

Milo Yip's avatar
Milo Yip committed
269
User can query the current memory consumption in bytes via `MemoryPoolAllocator::Size()`. And then user can determine a suitable size of user buffer.