More docs.

1590c336 · Kenton Varda · 871d90c6 · 1590c336 · 1590c336 · 1590c336
Commit 1590c336 authored Mar 30, 2013 by Kenton Varda
Showing with 264 additions and 23 deletions

page.html doc/_layouts/page.html +1 -0

encoding.md doc/encoding.md +224 -0

index.md doc/index.md +5 -17

install.md doc/install.md +7 -6

otherlang.md doc/otherlang.md +11 -0

rpc.md doc/rpc.md +16 -0

No files found.
--- a/doc/_layouts/page.html
+++ b/doc/_layouts/page.html
@@ -69,6 +69,7 @@
      </script>
      <section id="main_content" class="inner">
        {{ content }}
+        <div style="clear: left;"></div>
      </section>
    </div>


--- a/doc/encoding.md
+++ b/doc/encoding.md
+---
+layout: page
+---
+
+# Encoding Spec
+
+## NOT FINALIZED
+
+The Cap'n Proto encoding is still evolving.  In fact, as of this writing, the format described
+by this spec is newer than what is actually implemented.
+
+## 64-bit Words
+
+For the purpose of Cap'n Proto, a "word" is defined as 8 bytes, or 64 bits.  Since alignment of
+data is important, all objects are aligned to word boundaries, and sizes are usually expressed in
+terms of words.
+
+## Messages
+
+The unit of communication in Cap'n Proto is a "message".  A message is a tree of objects, with
+the root always being a struct.
+
+Physically, messages may be split into several "segments", each of which is a flat blob of bytes.
+Typically, a segment must be loaded into a contiguous block of memory before it can be accessed,
+so that the relative pointers within the segment can be followed quickly.  However, when a message
+has multiple segments, it does not matter where those segments are located in memory relative to
+each other; inter-segment pointers are encoded differently, as we'll see later.
+
+Ideally, every message would have only one segment.  However, there are a few reasons why splitting
+a message into multiple segments may be convenient:
+
+* It can be difficult to predict how large a message might be until you start writing it, and you
+  can't start writing it until you have a segment to write to.  If it turns out the segment you
+  allocated isn't big enough, you can allocate additional segments without the need to relocate the
+  data you've already written.
+* Allocating excessively large blocks of memory can make life difficult for memory allocators,
+  especially on 32-bit systems with limited address space.
+
+The first word of the first segment of the message is always a pointer pointing to the message's
+root struct.
+
+Note that users of Cap'n Proto never need to understand segments; this is all taken care of
+automatically by the runtime library.
+
+## Built-in Types
+
+The built-in primitive types are encoded as follows:
+
+* `Void`:  Not encoded at all.  It has only one possible value thus carries no information.
+* `Bool`:  One bit.  1 = true, 0 = false.
+* Integers:  Encoded in little-endian format.  Signed integers use two's complement.
+* Floating-points:  Encoded in little-endian IEEE-754 format.
+
+Primitive types must always be aligned to a multiple of their size.  Note that since the size of
+a `Bool` is one bit, this means eight `Bool` values can be encoded in a single byte -- this differs
+from C++, where the `bool` type takes a whole byte.
+
+The built-in blob types are encoded as follows:
+
+* `Data`:  Encoded as a pointer, identical to `List(UInt8)`.
+* `Text`:  Like `Data`, but the content must be valid UTF-8, the last byte of the content must be
+  zero, and no other byte of the content can be zero.
+
+## Enums
+
+Enums are encoded the same as 16-bit integers.
+
+## Lists
+
+A list value is encoded as a pointer to a flat array of values.
+
+    lsb                       list pointer                        msb
+    +-+-----------------------------+--+----------------------------+
+    |A|             B               |C |             D              |
+    +-+-----------------------------+--+----------------------------+
+
+    A (2 bits) = 01, to indicate that this is a list pointer.
+    B (30 bits) = Offset, in words, from the start of the pointer to the
+        start of the list.  Signed.
+    C (3 bits) = Size of each element:
+        0 = 0 (e.g. List(Void))
+        1 = 1 bit
+        2 = 1 byte
+        3 = 2 bytes
+        4 = 4 bytes
+        5 = 8 bytes (non-pointer)
+        6 = 8 bytes (pointer)
+        7 = composite (see below)
+    D (29 bits) = Number of elements in the list, except when C is 7
+        (see below).
+
+The pointed-to values are tightly-packed.  In particular, `Bool`s are packed bit-by-bit in
+little-endian order (the first bit is the least-significant bit of the first byte).
+
+When C = 7, the elements of the list are fixed-width composite values -- usually, structs.  In
+this case, the list content is prefixed by a "tag" word that describes each individual element.
+The tag has the same layout as a struct pointer, except that the pointer offset (B) instead
+indicates the number of elements in the list.  Meanwhile, section (D) of the list pointer -- which
+normally would store this element count -- instead stores the total number of _words_ in the list
+(not counting the tag word).  The reason we store a word count in the pointer rather than an element
+count is to ensure that the extents of the list's location can always be determined by inspecting
+the pointer alone, without having to look at the tag; this may allow more-efficient prefetching in
+some use cases.  The reason we don't store struct lists as a list of pointers is because doing so
+would take significantly more space (an extra pointer per element) and may be less cache-friendly.
+
+In the future, we could consider implementing matrixes using the "composite" element type, with the
+elements being fixed-size lists rather than structs.  In this case, the tag would look like a list
+pointer rather than a struct pointer.  As of this writing, no such feature has been implemented.
+
+## Structs
+
+A struct value is encoded as a pointer to its content.  The content is split into two sections:
+data and pointers, with the pointer section appearing immediately after the data section.  This
+split allows structs to be traversed (e.g., copied) without knowing their type.
+
+A struct pointer looks like this:
+
+    lsb                      struct pointer                       msb
+    +-+-----------------------------+---------------+---------------+
+    |A|             B               |       C       |       D       |
+    +-+-----------------------------+---------------+---------------+
+
+    A (2 bits) = 00, to indicate that this is a struct pointer.
+    B (30 bits) = Offset, in words, from the start of the pointer to the
+        start of the struct's data section.  Signed.
+    C (16 bits) = Size of the struct's data section, in words.
+    D (16 bits) = Size of the struct's pointer section, in words.
+
+### Field Positioning
+
+Ignoring unions, the layout of fields within the struct is determined by the following algorithm:
+
+    For each field of the struct, ordered by field number {
+        If the field is a pointer {
+            Add it to the end of the pointer section.
+        } else if the data section layout so far includes padding large
+                enough and properly-aligned to hold this field {
+            Replace the padding space with the new field, preferring to
+                put the field as close to the beginning of the section as
+                possible.
+        } else {
+            Add one word to the end of the data section.
+            Place the new field at the beginning of the new word.
+            Mark the rest of the new word as padding.
+        }
+    }
+
+Keep in mind that `Bool` fields are bit-aligned, so multiple booleans will be packed into a
+single byte.  As always, little-endian ordering is the standard -- the first boolean will be
+located at the least-significant bit of its byte.
+
+When unions are present, add the following logic:
+
+    For each field and union of the struct, ordered by field number {
+        If this is a union, not a field {
+            Treat it like a 16-bit field, representing the union tag.
+                (See no-union logic, above.)
+        } else if this field is a member of a union {
+            If an earlier member of the union is in the same section as
+                    this field and it combined with any following padding
+                    is at least as large as the new field {
+                Give the new field the same offset, so they overlap.
+            } else {
+                Assign a new offset to this field as if it were not a union
+                    member at all.  (See no-union logic, above.)
+            }
+        } else {
+            Treat it as a regular field.  (See no-union logic, above.)
+        }
+    }
+
+Note that in the worst case, the members of a union could end up using 23 bytes plus one bit (one
+pointer plus data section locations of 64, 32, 16, 8, and 1 bits).  This is an unfortunate side
+effect of the desire to pack fields in the smallest space where they will fit and the need to
+maintain backwards-compatibility as fields are added.  The worst case should be rare in practice.
+
+### Default Values
+
+A default struct is always all-zeros.  To achieve this, fields in the data section are stored xor'd
+with their defined default values.  An all-zero pointer is considered "null" (since otherwise it
+would point at itself, which makes no sense); accessor methods for pointer fields check for null
+and return a pointer to their default value in this case.
+
+There are several reasons why this is desirable:
+
+* Cap'n Proto messages are often "packed" with a simple compression algorithm that deflates
+  zero-value bytes.
+* Newly-allocated structs only need to be zero-initialized, which is fast and requires no knowledge
+  of the struct type except its size.
+* If a newly-added field is placed in space that was previously padding, messages written by old
+  binaries that do not know about this field will still have its default value set correctly --
+  because it is always zero.
+
+## Inter-Segment Pointers
+
+When a pointer needs to point to a different segment, offsets no longer work.  We instead encode
+the pointer as a "far pointer", which looks like this:
+
+    lsb                        far pointer                        msb
+    +-+-----------------------------+-------------------------------+
+    |A|             B               |               C               |
+    +-+-----------------------------+-------------------------------+
+
+    A (2 bits) = 02, to indicate that this is a far pointer.
+    B (30 bits) = Offset, in words, from the start of the target segment
+        to the location of the far-pointer landing-pad within that
+        segment.
+    C (32 bits) = ID of the target segment.  (Segments are numbered
+        sequentially starting from zero.)
+
+The "landing pad" of a far pointer is normally just another pointer, which in turn points to the
+actual object.
+
+However, if the "landing pad" pointer is itself another far pointer, then it is interpreted
+differently:  This far pointer points to the start of the object's _content_, located in some other
+segment.  The landing pad is itself immediately followed by a tag word.  The tag word looks exactly
+like an intra-segment pointer to the target object would look, except that the offset is always
+zero.
+
+The reason for the convoluted double-far convention is to make it possible to form a new pointer
+to an object in a segment that is full.  If you can't allocate even one word in the segment where
+the target resides, then you will need to allocate a landing pad in some other segment, and use
+this double-far approach.  This should be exceedingly rare in practice since pointers are normally
+set to point to _new_ objects.
--- a/doc/index.md
+++ b/doc/index.md
@@ -24,7 +24,7 @@ embedded as pointers. Pointers are offset-based rather than absolute so that mes
 position-independent. Integers use little-endian byte order because most CPUs are little-endian,
 and even big-endian CPUs usually have instructions for reading little-endian data.

-**_Doesn't that back backwards-compatibility hard?_**
+**_Doesn't that make backwards-compatibility hard?_**

 Not at all! New fields are always added to the end of a struct (or replace padding space), so
 existing field positions are unchanged. The recipient simply needs to do a bounds check when
@@ -34,7 +34,7 @@ always knows how to arrange them for backwards-compatibility.
 **_Won't fixed-width integers, unset optional fields, and padding waste space on the wire?_**

 Yes. However, since all these extra bytes are zeros, when bandwidth matters, we can apply an
-extremely fast compression scheme to remove them. Cap'n Proto calls this "packing"; the message,
+extremely fast compression scheme to remove them. Cap'n Proto calls this "packing" the message;
 it achieves similar (better, even) message sizes to protobuf encoding, and it's still faster.

 When bandwidth really matters, you should apply general-purpose compression, like
@@ -59,10 +59,10 @@ Glad you asked!
  process can be just as fast and easy as calling another thread.
 * **Arena allocation:** Manipulating Protobuf objects tends to be bogged down by memory
  allocation, unless you are very careful about object reuse. Cap'n Proto objects are always
-  allocated in an "arena"; or "region"; style, which is faster and promotes cache locality.
+  allocated in an "arena" or "region" style, which is faster and promotes cache locality.
 * **Tiny generated code:** Protobuf generates dedicated parsing and serialization code for every
  message type, and this code tends to be enormous. Cap'n Proto generated code is smaller by an
-  order of magnitude or more.
+  order of magnitude or more.  In fact, usually it's no more than some inline accessor methods!
 * **Tiny runtime library:** Due to the simplicity of the Cap'n Proto format, the runtime library
  can be much smaller.

@@ -73,16 +73,4 @@ version 2, which is the version that Google released open source. Cap'n Proto is
 years of experience working on Protobufs, listening to user feedback, and thinking about how
 things could be done better.

-I am no longer employed by Google. Cap'n Proto is not affiliated with Google or any other company.
-
-**_Tell me about the RPC system._**
-
-_As of this writing, the RPC system is not yet implemented._
-
-Cap'n Proto defines a [capability-based](http://en.wikipedia.org/wiki/Capability-based_security)
-RPC protocol. In such a system, any message passed over the wire can itself contain references to
-callable objects. Passing such a reference over the wire implies granting the recipient permission
-to call the referenced object -- until a reference is sent, the recipient has no way of addressing
-it in order to form a request to it, or even knowing that it exists.
-
-Such a system makes it very easy to define stateful, secure object-oriented protocols.
+I no longer work for Google. Cap'n Proto is not affiliated with Google or any other company.
--- a/doc/install.md
+++ b/doc/install.md
@@ -15,8 +15,9 @@ many essential features:
 * **Stability:** The Cap'n Proto format is still changing. Any data written today probably won't
  be understood by future versions. Additionally, the programming interface is still evolving, so
  code written today probably won't work with future versions.
-* **Performance:** While already beating the pants off other systems, Cap'n Proto has not yet
-  undergone serious profiling and optimization.
+* **Performance:** While Cap'n Proto is inherently fast by design, the implementation has not yet
+  undergone serious profiling and optimization.  Currenlty it only beats Protobufs in realistic-ish
+  end-to-end benchmarks by, like, 2x-5x.  We can do better.
 * **RPC:** The RPC protocol has not yet been specified, much less implemented.
 * **Support for languages other than C++:** Hasn't been started yet.

@@ -56,8 +57,8 @@ code without instructions.  It also supports continuous builds, where it watches
 changes (via inotify) and immediately rebuilds as necessary.  Instant feedback is key to
 productivity, so I really like using Ekam.

-Unfortunately it's very much unfinished.  It works (for me), but it is very quirky.  It only works
-on Linux, and is best used together with Eclipse.
+Unfortunately it's very much unfinished.  It works (for me), but it is quirky and rough around the
+edges.  It only works on Linux, and is best used together with Eclipse.

 The Cap'n Proto repo includes a script which will attempt to set up Ekam for you.

@@ -65,8 +66,8 @@ The Cap'n Proto repo includes a script which will attempt to set up Ekam for you
    cd capnproto/c++
    ./setup-ekam.sh

-If all goes well, this downloads the Ekam code into `.ekam` and adds some symlinks under src.
-It also imports the [Google Test](https://googletest.googlecode.com) and
+If all goes well, this downloads the Ekam code into a directory called `.ekam` and adds some
+symlinks under src.  It also imports the [Google Test](https://googletest.googlecode.com) and
 [Protobuf](http://protobuf.googlecode.com) source code, so you can compile tests and benchmarks.

 Once Ekam is installed, you can do:

--- a/doc/otherlang.md
+++ b/doc/otherlang.md
+---
+layout: page
+---
+
+# Other Languages
+
+Currently, Cap'n Proto is implemented only in C++.  We'd like to support many more languages in
+the future!
+
+If you'd like to own the implementation of Cap'n Proto in some particular language,
+[let us know](https://groups.google.com/group/capnproto)!
--- a/doc/rpc.md
+++ b/doc/rpc.md
+---
+layout: page
+---
+
+# RPC Protocol
+
+The Cap'n Proto RPC protocol is not yet defined.  See the language spec's
+[section on interfaces](language.html#interfaces) for a hint of what it will do.
+
+Here are some misc planned / hoped-for features:
+
+* **Shared memory IPC:**  When instructed to communicate over a Unix domain socket, Cap'n Proto may
+  automatically negotiate to use shared memory, by creating a temporary file and then sending a
+  file descriptor across the socket.  Once messages are being allocated in shared memory, RPCs
+  can be initiated by merely signaling a [futex](http://man7.org/linux/man-pages/man2/futex.2.html)
+  (on Linux, at least), which ought to be ridiculously fast.