src · 961c0e6b86897c5d9ba9bd3b0b8a636b6b50a2f2 · submodule / protobuf

Correct interpretation of utf-8 0xf8-0xff · 961c0e6b

William A Rowe Jr authored Nov 04, 2019

In consuming this useful string utility, it was discovered
that the interpretation of leading byte codes 0xf8-0xff
did not conform to either the RFC 3629 nor ISO/IEC 10646
definitions of utf-8.

The IETF RFC describes only 1-4 byte encodings (a limited
number of 4 byte encodings at that), and plainly states in
section 1. Introduction;
   o  The octet values C0, C1, F5 to FF never appear.

Alternately, the ISO definition "R.2 Specification of UTF-8"
preseented in the original IETF RFC 2279 clearly define the
meaning of leading byte values F5 through FD, and RFC 3629
Section 10. Security paragraph 3 calls out this alternate
reading (alterative to "never appears".) F5-F7 begin an
invalid (in the domain of unicode code points) 4-byte UTF-8
sequence (similar to F0-F4), while F8-FC begin a 5-byte
sequence, FC and FD begin a 6 byte sequence.

The curent code is wrong in that it doesn't treat the codes
F8-FF as invalid 1-byte characters, nor does it treat the
codes F8-FD as the correct number of bytes. No valid parser
will land these lead characters 4 bytes forward. Most will
treat these as the 5 or 6 byte utf-32 character and may then
treat the resulting character as invalid, while some parsers
may reject all leading F5-FF characters as a single byte of
erronious input, followed by each invalid continuation byte.

We propose the conventional reading of F8-FD as 5 and 6 byte
sequences as originally defined, while FE-FF must be read
as single byte invalid code points.
Signed-off-by: William A Rowe Jr <wrowe@pivotal.io>
Signed-off-by: Yechiel Kalmenson <ykalmenson@pivotal.io>

961c0e6b

Name	Last commit	Last update
..
google/protobuf		Loading commit data...
solaris		Loading commit data...
Makefile.am		Loading commit data...
README.md		Loading commit data...
libprotobuf-lite.map		Loading commit data...
libprotobuf.map		Loading commit data...
libprotoc.map		Loading commit data...

README.md