• William A Rowe Jr's avatar
    Correct interpretation of utf-8 0xf8-0xff · 961c0e6b
    William A Rowe Jr authored
    In consuming this useful string utility, it was discovered
    that the interpretation of leading byte codes 0xf8-0xff
    did not conform to either the RFC 3629 nor ISO/IEC 10646
    definitions of utf-8.
    
    The IETF RFC describes only 1-4 byte encodings (a limited
    number of 4 byte encodings at that), and plainly states in
    section 1. Introduction;
       o  The octet values C0, C1, F5 to FF never appear.
    
    Alternately, the ISO definition "R.2 Specification of UTF-8"
    preseented in the original IETF RFC 2279 clearly define the
    meaning of leading byte values F5 through FD, and RFC 3629
    Section 10. Security paragraph 3 calls out this alternate
    reading (alterative to "never appears".) F5-F7 begin an
    invalid (in the domain of unicode code points) 4-byte UTF-8
    sequence (similar to F0-F4), while F8-FC begin a 5-byte
    sequence, FC and FD begin a 6 byte sequence.
    
    The curent code is wrong in that it doesn't treat the codes
    F8-FF as invalid 1-byte characters, nor does it treat the
    codes F8-FD as the correct number of bytes. No valid parser
    will land these lead characters 4 bytes forward. Most will
    treat these as the 5 or 6 byte utf-32 character and may then
    treat the resulting character as invalid, while some parsers
    may reject all leading F5-FF characters as a single byte of
    erronious input, followed by each invalid continuation byte.
    
    We propose the conventional reading of F8-FD as 5 and 6 byte
    sequences as originally defined, while FE-FF must be read
    as single byte invalid code points.
    Signed-off-by: 's avatarWilliam A Rowe Jr <wrowe@pivotal.io>
    Signed-off-by: 's avatarYechiel Kalmenson <ykalmenson@pivotal.io>
    961c0e6b
Name
Last commit
Last update
.github Loading commit data...
benchmarks Loading commit data...
cmake Loading commit data...
conformance Loading commit data...
csharp Loading commit data...
docs Loading commit data...
editors Loading commit data...
examples Loading commit data...
java Loading commit data...
js Loading commit data...
kokoro Loading commit data...
m4 Loading commit data...
objectivec Loading commit data...
php Loading commit data...
protoc-artifacts Loading commit data...
python Loading commit data...
ruby Loading commit data...
src Loading commit data...
third_party Loading commit data...
util/python Loading commit data...
.gitignore Loading commit data...
.gitmodules Loading commit data...
BUILD Loading commit data...
CHANGES.txt Loading commit data...
CONTRIBUTING.md Loading commit data...
CONTRIBUTORS.txt Loading commit data...
LICENSE Loading commit data...
Makefile.am Loading commit data...
Protobuf-C++.podspec Loading commit data...
Protobuf.podspec Loading commit data...
README.md Loading commit data...
WORKSPACE Loading commit data...
appveyor.bat Loading commit data...
appveyor.yml Loading commit data...
autogen.sh Loading commit data...
build_files_updated_unittest.sh Loading commit data...
compiler_config_setting.bzl Loading commit data...
composer.json Loading commit data...
configure.ac Loading commit data...
generate_changelog.py Loading commit data...
generate_descriptor_proto.sh Loading commit data...
global.json Loading commit data...
post_process_dist.sh Loading commit data...
protobuf-lite.pc.in Loading commit data...
protobuf.bzl Loading commit data...
protobuf.pc.in Loading commit data...
protobuf_deps.bzl Loading commit data...
tests.sh Loading commit data...
update_compatibility_version.py Loading commit data...
update_file_lists.sh Loading commit data...
update_version.py Loading commit data...