The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26 [1]. A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range U+10000 to U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four. Each CESU-8 character code (1, 2, or 3 bytes) can be converted to exactly one UTF-16 code (2 bytes). CESU-8 is not an official part of the Unicode Standard, because Unicode Technical Reports are informative documents only. It should be used exclusively for internal processing and never for external data exchange.


J+sT75paaFHMYGOrPYRjzQ thumb jpg 07 Aug 2006 07 12 1 0K J7wgTILG9NB+CESU8tkQrg full jpg 07 Aug 2006 03 03 14K J7wgTILG9NB+CESU8tkQrg original jpg 07 Aug 2006 03 03 30K J7wgTILG9NB+CESU8tkQrg thumb jpg 07 Aug 2006 03 03 1 4K
http://www.wefeelfine.org/data/images/2006/08/07

UTR #26: Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8)

The Unicode Consortium, does not encourage the use of CESU-8, but does recognize the existence of data in this encoding and supplies this technical ...
CESU-8 is similar to Java's Modified UTF-8 but does not have the special encoding of the NUL character (U+0000). The CESU-8 encoding form is used in the Oracle database software. Oracle's UTF8 character set (unfortunately, a misnomer), available since version 8.0 of the database, is actually CESU-8. The character set AL32UTF8, introduced in version 9.0, is UTF-8 compliant. The encoding of unicode supplementary characters works out to 11101101 1010yyyy 10xxxxxx 11101101 1011xxxx 10xxxxxx (yyyy represents the top five bits of the character minus one i.e. U+10**** becomes 1111, U+01**** becomes 0000, x represents the remaining bits of the character).clarification needed Examples Encoding Unicode code point U+0045 U+0205 U+10400 E ȅ 𐐀 UTF-8 45 C8 85 F0 90 90 80 UTF-16 0045 0205 D801 DC00 CESU-8 45 C8 85 ED A0 81 ED B0 80 External links Unicode Technical Report #26 Modified UTF-8 overview Graphical View of CESU-8 in ICU's Converter Explorer v · d · eUnicode Unicode Unicode Consortium · ISO/IEC 10646 (Universal Character Set) Code points Code point · Plane · Block · Mapping characters · Character property · Character charts Characters Special purpose BOM · Combining grapheme joiner · Left-to-right mark and Right-to-left mark · Zero-width non-breaking space · Zero-width joiner · Zero-width non-joiner · Zero-width space Miscellaneous lists Combining character · Duplicate characters · Graphic characters Processing Algorithms Bi-directional text · Collation (ISO 14651) · Equivalence Transformation BOCU-1 · CESU-8 · UTF-1 · UTF-7 · UTF-8 · UTF-9/UTF-18 · UTF-16/UCS-2 · UTF-32/UCS-4 · UTF-EBCDIC · Punycode · SCSU · Comparison On pairs of code points Equivalence · Combining character · Duplicates · Homoglyph · Precomposed character (List) · Compatibility characters · Z-variant Usage Unicode and e-mail · Unicode and HTML · Character entity references · Unicode input · Internationalized domain name · Numeric character reference · Private Use U+F8FF · Typefaces (fonts) · Script (Unicode) Related standards Common Locale Data Repository (CLDR) · GB 18030 · Han unification · ISO/IEC 8859 (8-bit encodings) · ISO 14651 (Collation) · ISO 15924 (Script codes) Related topics Anomalies · ConScript Unicode Registry · Ideographic Rapporteur Group · International Components for Unicode · MUFI · People related to Unicode  Scripts and symbols in Unicode Common and inherited scripts Combining marks · Diacritics · Punctuation · Space Modern scripts Arabic (diacritics · Unicode blocks) · Armenian · Balinese · Batak · Bamum · Bengali · Bopomofo · Braille · Buginese · Buhid · Canadian Aboriginal · Cham · Cherokee · CJK Unified Ideographs (Han) · Cyrillic · Deseret · Devanagari · Ethiopic · Georgian · Greek · Gujarati · Gurmukhi · Kanji · Hanja · Hán tự · Hangul · Hanunoo · Hebrew (diacritics) · Hiragana · Javanese · Kannada · Katakana · Kayah Li · Khmer · Lao · Latin · Lepcha · Limbu · Lisu · Malayalam · Mandaic · Meetei Mayek · Mongolian · Manchu · Myanmar · N'Ko · New Tai Lue · Ol Chiki · Oriya · Osmanya · Rejang · Samaritan · Saurashtra · Shavian · Sinhala · Sundanese · Syloti Nagri · Syriac · Tagalog · Tagbanwa · Tai Le · Tai Tham · Tai Viet · Tamil · Telugu · Thaana · Thai · Tibetan · Tifinagh · Vai · Yi Ancient and historic scripts Avestan · Brāhmī · Carian · Coptic · Sumero-Akkadian · Cypriot · Egyptian Hieroglyphs · Glagolitic · Gothic · Imperial Aramaic · Inscriptional Pahlavi · Inscriptional Parthian · Kaithi · Kharoshthi · Linear B · Lycian · Lydian · Ogham · Old Italic · Old Persian · Phags-pa · Phoenician · Old South Arabian · Old Turkic · Runic · Ugaritic Symbols Cultural, political, and religious symbols · Currency · Mathematical operators and symbols · Phonetic symbols (including IPA) v · d · eCharacter encodings Character sets Early telecommunications ASCII · ISO/IEC 646 · ISO/IEC 6937 · T.61 · sixbit code pages · Baudot code · Morse code ISO/IEC 8859 -1 · -2 · -3 · -4 · -5 · -6 · -7 · -8 · -9 · -10 · -11 · -12 · -13 · -14 · -15 · -16 Bibliographic use ANSEL · ISO 5426 / 5426-2 / 5427 / 5428 / 6438 / 6861 / 6862 / 10585 / 10586 / 10754 / 11822 · MARC-8 National standards ArmSCII · CNS 11643 · GOST 10859 · GB 2312 · HKSCS · ISCII · JIS X 0201 · JIS X 0208 · JIS X 0212 · JIS X 0213 · KPS 9566 · KS X 1001 · PASCII · TIS-620 · TSCII · VISCII · YUSCII EUC CN · JP · KR · TW ISO/IEC 2022 CN · JP · KR · CCCII MacOS codepages ("scripts") Arabic · CentralEurRoman · ChineseSimp / EUC-CN · ChineseTrad / Big5 · Croatian · Cyrillic · Devanagari · Dingbats · Farsi · Greek · Gujarati · Gurmukhi · Hebrew · Icelandic · Japanese / ShiftJIS · Korean / EUC-KR · Roman · Romanian · Symbol · Thai / TIS-620 · Turkish · Ukrainian DOS codepages 437 · 720 · 737 · 775 · 850 · 852 · 855 · 857 · 858 · 860 · 861 · 862 · 863 · 864 · 865 · 866 · 869 · Kamenický · Mazovia · MIK · Iran System Windows codepages 874 / TIS-620 · 932 / ShiftJIS · 936 / GBK · 949 / EUC-KR · 950 / Big5 · 1250 · 1251 · 1252 · 1253 · 1254 · 1255 · 1256 · 1257 · 1258 · 1361 · 54936 / GB18030 EBCDIC codepages 37/1140 · 273/1141 · 277/1142 · 278/1143 · 280/1144 · 284/1145 · 285/1146 · 297/1147 · 420/16804 · 424/12712 · 500/1148 · 838/1160 · 871/1149 · 875/9067 · 930/1390 · 933/1364 · 937/1371 · 935/1388 · 939/1399 · 1025/1154 · 1026/1155 · 1047/924 · 1112/1156 · 1122/1157 · 1123/1158 · 1130/1164 · JEF · KEIS Platform specific ATASCII · CDC display code · DEC-MCS · DEC Radix-50 · Fieldata · GSM 03.38 · HP roman8 · PETSCII · TI calculator character sets · ZX Spectrum character set Unicode / ISO/IEC 10646 UTF-8 · UTF-16/UCS-2 · UTF-32/UCS-4 · UTF-7 · UTF-EBCDIC · GB 18030 · SCSU · BOCU-1 Miscellaneous codepages APL · Cork · HZ · IBM code page 1133 · KOI8 · TRON Related topics control character (C0 C1) · CCSID · Character encodings in HTML · charset detection · Han unification · ISO 6429/IEC 6429/ANSI X3.64 · mojibake


AU COURS D UNE FORMATION < C E S U
http://samu83.free.fr/texte%2036.htm

Cesu-8

The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described ... Compiled it toghether with the CESU-8 patch and played around with it ...




http://www.kvadrapak.lv/public/27312.html

CESU-8 - Wikipédia

Le CESU-8 (Compatibility Encoding Scheme for UTF-16: 8-Bit) est un codage de caractères ... Le but principal de CESU-8 est de maintenir la même collation binaire ...




http://www.xxweb.com.cn/viewproduct.aspx?id=3

PDUTR #26: Compatibility Encoding Scheme for UTF-16: 8-Bit ...

In CESU-8, supplementary characters are represented as six-byte sequences resulting from ... CESU-8 is useful in 8-bit processing environments where binary ...



J+sT75paaFHMYGOrPYRjzQ original jpg 07 Aug 2006 07 12 49K J+sT75paaFHMYGOrPYRjzQ thumb jpg 07 Aug 2006 07 12 1 0K J7wgTILG9NB+CESU8tkQrg full jpg 07 Aug 2006 03 03 14K J7wgTILG9NB+CESU8tkQrg original jpg 07 Aug 2006 03 03 30K
http://www.wefeelfine.org/data/images/2006/08/07

CESU-8

Data encoded in CESU-8 should only be exchanged when it is labeled as such in a higher-level protocol or is agreed upon in an API definition. ...



J7wgTILG9NB+CESU8tkQrg full jpg 07 Aug 2006 03 03 14K J7wgTILG9NB+CESU8tkQrg original jpg 07 Aug 2006 03 03 30K J7wgTILG9NB+CESU8tkQrg thumb jpg 07 Aug 2006 03 03 1 4K J8DbaTJEgCueunq2czPuFQ full jpg 07 Aug 2006 04 43 7 8K
http://www.wefeelfine.org/data/images/2006/08/07

Cesu encyclopedia topics | Reference.com

Encyclopedia article of Cesu at Reference.com compiled from comprehensive and current sources. ... The CESU-8 encoding form is used in the Oracle database software. ...




http://wassr.jp/user/moriyama/statuses/1O5RvG1xSI

OTN Discussion Forums : UTF-8 vs. UTF-16 vs. CESU-8 ...

According to Unicode.org the CESU-8 encoding scheme for Unicode is identical to UTF-8 except for its representation of supplementary characters, ...



UTF-8: Information from Answers.com

UTF-8 ( U nicode T ransformation F ormat -8 ) A format in the Unicode coding system that uses from one to four bytes



Secure UTF-8 Input in Rails - igvita.com

Secure UTF-8 Input in Rails. Approximately 64.2 percent of online users do not speak ... 8, CESU-8, UTF-16/UCS-2, etc.) have been developed to address this need, but UTF-8 ...