Files · 961c0e6b86897c5d9ba9bd3b0b8a636b6b50a2f2 · submodule / protobuf

Correct interpretation of utf-8 0xf8-0xff · 961c0e6b

William A Rowe Jr authored Nov 04, 2019

In consuming this useful string utility, it was discovered
that the interpretation of leading byte codes 0xf8-0xff
did not conform to either the RFC 3629 nor ISO/IEC 10646
definitions of utf-8.

The IETF RFC describes only 1-4 byte encodings (a limited
number of 4 byte encodings at that), and plainly states in
section 1. Introduction;
   o  The octet values C0, C1, F5 to FF never appear.

Alternately, the ISO definition "R.2 Specification of UTF-8"
preseented in the original IETF RFC 2279 clearly define the
meaning of leading byte values F5 through FD, and RFC 3629
Section 10. Security paragraph 3 calls out this alternate
reading (alterative to "never appears".) F5-F7 begin an
invalid (in the domain of unicode code points) 4-byte UTF-8
sequence (similar to F0-F4), while F8-FC begin a 5-byte
sequence, FC and FD begin a 6 byte sequence.

The curent code is wrong in that it doesn't treat the codes
F8-FF as invalid 1-byte characters, nor does it treat the
codes F8-FD as the correct number of bytes. No valid parser
will land these lead characters 4 bytes forward. Most will
treat these as the 5 or 6 byte utf-32 character and may then
treat the resulting character as invalid, while some parsers
may reject all leading F5-FF characters as a single byte of
erronious input, followed by each invalid continuation byte.

We propose the conventional reading of F8-FD as 5 and 6 byte
sequences as originally defined, while FE-FF must be read
as single byte invalid code points.
Signed-off-by: William A Rowe Jr <wrowe@pivotal.io>
Signed-off-by: Yechiel Kalmenson <ykalmenson@pivotal.io>

961c0e6b

Name	Last commit	Last update
.github		Loading commit data...
benchmarks		Loading commit data...
cmake		Loading commit data...
conformance		Loading commit data...
csharp		Loading commit data...
docs		Loading commit data...
editors		Loading commit data...
examples		Loading commit data...
java		Loading commit data...
js		Loading commit data...
kokoro		Loading commit data...
m4		Loading commit data...
objectivec		Loading commit data...
php		Loading commit data...
protoc-artifacts		Loading commit data...
python		Loading commit data...
ruby		Loading commit data...
src		Loading commit data...
third_party		Loading commit data...
util/python		Loading commit data...
.gitignore		Loading commit data...
.gitmodules		Loading commit data...
BUILD		Loading commit data...
CHANGES.txt		Loading commit data...
CONTRIBUTING.md		Loading commit data...
CONTRIBUTORS.txt		Loading commit data...
LICENSE		Loading commit data...
Makefile.am		Loading commit data...
Protobuf-C++.podspec		Loading commit data...
Protobuf.podspec		Loading commit data...
README.md		Loading commit data...
WORKSPACE		Loading commit data...
appveyor.bat		Loading commit data...
appveyor.yml		Loading commit data...
autogen.sh		Loading commit data...
build_files_updated_unittest.sh		Loading commit data...
compiler_config_setting.bzl		Loading commit data...
composer.json		Loading commit data...
configure.ac		Loading commit data...
generate_changelog.py		Loading commit data...
generate_descriptor_proto.sh		Loading commit data...
global.json		Loading commit data...
post_process_dist.sh		Loading commit data...
protobuf-lite.pc.in		Loading commit data...
protobuf.bzl		Loading commit data...
protobuf.pc.in		Loading commit data...
protobuf_deps.bzl		Loading commit data...
tests.sh		Loading commit data...
update_compatibility_version.py		Loading commit data...
update_file_lists.sh		Loading commit data...
update_version.py		Loading commit data...

README.md