Correct interpretation of utf-8 0xf8-0xff
In consuming this useful string utility, it was discovered that the interpretation of leading byte codes 0xf8-0xff did not conform to either the RFC 3629 nor ISO/IEC 10646 definitions of utf-8. The IETF RFC describes only 1-4 byte encodings (a limited number of 4 byte encodings at that), and plainly states in section 1. Introduction; o The octet values C0, C1, F5 to FF never appear. Alternately, the ISO definition "R.2 Specification of UTF-8" preseented in the original IETF RFC 2279 clearly define the meaning of leading byte values F5 through FD, and RFC 3629 Section 10. Security paragraph 3 calls out this alternate reading (alterative to "never appears".) F5-F7 begin an invalid (in the domain of unicode code points) 4-byte UTF-8 sequence (similar to F0-F4), while F8-FC begin a 5-byte sequence, FC and FD begin a 6 byte sequence. The curent code is wrong in that it doesn't treat the codes F8-FF as invalid 1-byte characters, nor does it treat the codes F8-FD as the correct number of bytes. No valid parser will land these lead characters 4 bytes forward. Most will treat these as the 5 or 6 byte utf-32 character and may then treat the resulting character as invalid, while some parsers may reject all leading F5-FF characters as a single byte of erronious input, followed by each invalid continuation byte. We propose the conventional reading of F8-FD as 5 and 6 byte sequences as originally defined, while FE-FF must be read as single byte invalid code points. Signed-off-by: William A Rowe Jr <wrowe@pivotal.io> Signed-off-by: Yechiel Kalmenson <ykalmenson@pivotal.io>
Showing
Please
register
or
sign in
to comment