Correct interpretation of utf-8 0xf8-0xff

In consuming this useful string utility, it was discovered that the interpretation of leading byte codes 0xf8-0xff did not conform to either the RFC 3629 nor ISO/IEC 10646 definitions of utf-8. The IETF RFC describes only 1-4 byte encodings (a limited number of 4 byte encodings at that), and plainly states in section 1. Introduction; o The octet values C0, C1, F5 to FF never appear. Alternately, the ISO definition "R.2 Specification of UTF-8" preseented in the original IETF RFC 2279 clearly define the meaning of leading byte values F5 through FD, and RFC 3629 Section 10. Security paragraph 3 calls out this alternate reading (alterative to "never appears".) F5-F7 begin an invalid (in the domain of unicode code points) 4-byte UTF-8 sequence (similar to F0-F4), while F8-FC begin a 5-byte sequence, FC and FD begin a 6 byte sequence. The curent code is wrong in that it doesn't treat the codes F8-FF as invalid 1-byte characters, nor does it treat the codes F8-FD as the correct number of bytes. No valid parser will land these lead characters 4 bytes forward. Most will treat these as the 5 or 6 byte utf-32 character and may then treat the resulting character as invalid, while some parsers may reject all leading F5-FF characters as a single byte of erronious input, followed by each invalid continuation byte. We propose the conventional reading of F8-FD as 5 and 6 byte sequences as originally defined, while FE-FF must be read as single byte invalid code points. Signed-off-by: William A Rowe Jr <wrowe@pivotal.io> Signed-off-by: Yechiel Kalmenson <ykalmenson@pivotal.io>

Correct interpretation of utf-8 0xf8-0xff
In consuming this useful string utility, it was discovered that the interpretation of leading byte codes 0xf8-0xff did not conform to either the RFC 3629 nor ISO/IEC 10646 definitions of utf-8. The IETF RFC describes only 1-4 byte encodings (a limited number of 4 byte encodings at that), and plainly states in section 1. Introduction; o The octet values C0, C1, F5 to FF never appear. Alternately, the ISO definition "R.2 Specification of UTF-8" preseented in the original IETF RFC 2279 clearly define the meaning of leading byte values F5 through FD, and RFC 3629 Section 10. Security paragraph 3 calls out this alternate reading (alterative to "never appears".) F5-F7 begin an invalid (in the domain of unicode code points) 4-byte UTF-8 sequence (similar to F0-F4), while F8-FC begin a 5-byte sequence, FC and FD begin a 6 byte sequence. The curent code is wrong in that it doesn't treat the codes F8-FF as invalid 1-byte characters, nor does it treat the codes F8-FD as the correct number of bytes. No valid parser will land these lead characters 4 bytes forward. Most will treat these as the 5 or 6 byte utf-32 character and may then treat the resulting character as invalid, while some parsers may reject all leading F5-FF characters as a single byte of erronious input, followed by each invalid continuation byte. We propose the conventional reading of F8-FD as 5 and 6 byte sequences as originally defined, while FE-FF must be read as single byte invalid code points. Signed-off-by: William A Rowe Jr <wrowe@pivotal.io> Signed-off-by: Yechiel Kalmenson <ykalmenson@pivotal.io>
961c0e6b · William A Rowe Jr · Adam Cozzette · 6d087c25 · 961c0e6b
Commit 961c0e6b authored Nov 04, 2019 by William A Rowe Jr Committed by Adam Cozzette Dec 10, 2019
Hide whitespace changes
Inline Side-by-side

Showing with 1 addition and 1 deletion

strutil.cc src/google/protobuf/stubs/strutil.cc +1 -1

No files found.
--- a/src/google/protobuf/stubs/strutil.cc
+++ b/src/google/protobuf/stubs/strutil.cc
@@ -2292,7 +2292,7 @@ static const unsigned char kUTF8LenTbl[256] = {
  1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,
  1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,
  2,2,2,2,2,2,2,2, 2,2,2,2,2,2,2,2, 2,2,2,2,2,2,2,2, 2,2,2,2,2,2,2,2,
-  3,3,3,3,3,3,3,3, 3,3,3,3,3,3,3,3, 4,4,4,4,4,4,4,4, 4,4,4,4,4,4,4,4
+  3,3,3,3,3,3,3,3, 3,3,3,3,3,3,3,3, 4,4,4,4,4,4,4,4, 5,5,5,5,6,6,1,1
 };

 // Return length of a single UTF-8 source character