'Phags-pa script
ANSEL
APL (codepage)
ASCII
ATASCII
Arabic (Unicode block)
Arabic alphabet
Arabic diacritics
Aramaic language#Imperial Aramaic
ArmSCII
Armenian alphabet
Avestan alphabet
Balinese script
Bamum language
Basic Multilingual Plane
Batak script
Baudot code
Baybayin
Bengali script
Bi-directional text
Big5
Binary Ordered Compression for Unicode
Brāhmī script
Braille
Buhid script
Burmese script
Byte order mark
C0 and C1 control codes
CCCII
CCSID
CDC display code
CESU-8
CJK Unified Ideographs
CNS 11643
Canadian Aboriginal syllabics
Carian script
Cham alphabet
Character encoding
Character encodings in HTML
Character property (Unicode)
Charset detection
Cherokee syllabary
Code page 1133
Code page 437
Code page 720
Code page 737
Code page 775
Code page 850
Code page 852
Code page 855
Code page 857
Code page 858
Code page 860
Code page 861
Code page 862
Code page 863
Code page 865
Code page 866
Code page 869
Code page 932
Code page 936
Code page 949
Code page 950
Code point
Combining character
Combining grapheme joiner
Common Locale Data Repository
Comparison of Unicode encodings
ConScript Unicode Registry
Control character
Coptic alphabet
Cork encoding
Cuneiform script
Currency sign
Cypriot syllabary
Cyrillic alphabet
DEC Radix-50
Deseret alphabet
Devanagari script
Diacritic
Duplicate characters in Unicode
EBCDIC 037
EBCDIC 1047
EBCDIC 285
EBCDIC 500
EBCDIC 875
EBCDIC 930
EUC-CN
EUC-JP
EUC-KR
EUC-TW
Egyptian hieroglyphs
Extended Unix Code
Fieldata
Fraser alphabet
GBK
GB 18030
GB 2312
GOST 10859
GSM 03.38
ANSEL
APL (codepage)
ASCII
ATASCII
Arabic (Unicode block)
Arabic alphabet
Arabic diacritics
Aramaic language#Imperial Aramaic
ArmSCII
Armenian alphabet
Avestan alphabet
Balinese script
Bamum language
Basic Multilingual Plane
Batak script
Baudot code
Baybayin
Bengali script
Bi-directional text
Big5
Binary Ordered Compression for Unicode
Brāhmī script
Braille
Buhid script
Burmese script
Byte order mark
C0 and C1 control codes
CCCII
CCSID
CDC display code
CESU-8
CJK Unified Ideographs
CNS 11643
Canadian Aboriginal syllabics
Carian script
Cham alphabet
Character encoding
Character encodings in HTML
Character property (Unicode)
Charset detection
Cherokee syllabary
Code page 1133
Code page 437
Code page 720
Code page 737
Code page 775
Code page 850
Code page 852
Code page 855
Code page 857
Code page 858
Code page 860
Code page 861
Code page 862
Code page 863
Code page 865
Code page 866
Code page 869
Code page 932
Code page 936
Code page 949
Code page 950
Code point
Combining character
Combining grapheme joiner
Common Locale Data Repository
Comparison of Unicode encodings
ConScript Unicode Registry
Control character
Coptic alphabet
Cork encoding
Cuneiform script
Currency sign
Cypriot syllabary
Cyrillic alphabet
DEC Radix-50
Deseret alphabet
Devanagari script
Diacritic
Duplicate characters in Unicode
EBCDIC 037
EBCDIC 1047
EBCDIC 285
EBCDIC 500
EBCDIC 875
EBCDIC 930
EUC-CN
EUC-JP
EUC-KR
EUC-TW
Egyptian hieroglyphs
Extended Unix Code
Fieldata
Fraser alphabet
GBK
GB 18030
GB 2312
GOST 10859
GSM 03.38
The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26 [1]. A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range U+10000 to U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four. Each CESU-8 character code (1, 2, or 3 bytes) can be converted to exactly one UTF-16 code (2 bytes).
CESU-8 is not an official part of the Unicode Standard, because Unicode Technical Reports are informative documents only. It should be used exclusively for internal processing and never for external data exchange.
UTR #26: Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8)
The Unicode Consortium, does not encourage the use of CESU-8, but does recognize the existence of data in this encoding and supplies this technical ...
CESU-8 is similar to Java's Modified UTF-8 but does not have the special encoding of the NUL character (U+0000).
The CESU-8 encoding form is used in the Oracle database software. Oracle's UTF8 character set (unfortunately, a misnomer), available since version 8.0 of the database, is actually CESU-8. The character set AL32UTF8, introduced in version 9.0, is UTF-8 compliant.
The encoding of unicode supplementary characters works out to 11101101 1010yyyy 10xxxxxx 11101101 1011xxxx 10xxxxxx (yyyy represents the top five bits of the character minus one i.e. U+10**** becomes 1111, U+01**** becomes 0000, x represents the remaining bits of the character).clarification needed
Examples
Encoding
Unicode code point
U+0045
U+0205
U+10400
E
ȅ
𐐀
UTF-8
45
C8
85
F0
90
90
80
UTF-16
0045
0205
D801
DC00
CESU-8
45
C8
85
ED
A0
81
ED
B0
80
External links
Unicode Technical Report #26
Modified UTF-8 overview
Graphical View of CESU-8 in ICU's Converter Explorer
v · d · eUnicode
Unicode
Unicode Consortium · ISO/IEC 10646 (Universal Character Set)
Code points
Code point · Plane · Block · Mapping characters · Character property · Character charts
Characters
Special purpose
BOM · Combining grapheme joiner · Left-to-right mark and Right-to-left mark · Zero-width non-breaking space · Zero-width joiner · Zero-width non-joiner · Zero-width space
Miscellaneous lists
Combining character · Duplicate characters · Graphic characters
Processing
Algorithms
Bi-directional text · Collation (ISO 14651) · Equivalence
Transformation
BOCU-1 · CESU-8 · UTF-1 · UTF-7 · UTF-8 · UTF-9/UTF-18 · UTF-16/UCS-2 · UTF-32/UCS-4 · UTF-EBCDIC · Punycode · SCSU · Comparison
On pairs
of code points
Equivalence · Combining character · Duplicates · Homoglyph · Precomposed character (List) · Compatibility characters · Z-variant
Usage
Unicode and e-mail · Unicode and HTML · Character entity references · Unicode input · Internationalized domain name · Numeric character reference · Private Use U+F8FF · Typefaces (fonts) · Script (Unicode)
Related standards
Common Locale Data Repository (CLDR) · GB 18030 · Han unification · ISO/IEC 8859 (8-bit encodings) · ISO 14651 (Collation) · ISO 15924 (Script codes)
Related topics
Anomalies · ConScript Unicode Registry · Ideographic Rapporteur Group · International Components for Unicode · MUFI · People related to Unicode
Scripts and symbols in Unicode
Common and
inherited scripts
Combining marks · Diacritics · Punctuation · Space
Modern scripts
Arabic (diacritics · Unicode blocks) · Armenian · Balinese · Batak · Bamum · Bengali · Bopomofo · Braille · Buginese · Buhid · Canadian Aboriginal · Cham · Cherokee · CJK Unified Ideographs (Han) · Cyrillic · Deseret · Devanagari · Ethiopic · Georgian · Greek · Gujarati · Gurmukhi · Kanji · Hanja · Hán tự · Hangul · Hanunoo · Hebrew (diacritics) · Hiragana · Javanese · Kannada · Katakana · Kayah Li · Khmer · Lao · Latin · Lepcha · Limbu · Lisu · Malayalam · Mandaic · Meetei Mayek · Mongolian · Manchu · Myanmar · N'Ko · New Tai Lue · Ol Chiki · Oriya · Osmanya · Rejang · Samaritan · Saurashtra · Shavian · Sinhala · Sundanese · Syloti Nagri · Syriac · Tagalog · Tagbanwa · Tai Le · Tai Tham · Tai Viet · Tamil · Telugu · Thaana · Thai · Tibetan · Tifinagh · Vai · Yi
Ancient and
historic scripts
Avestan · Brāhmī · Carian · Coptic · Sumero-Akkadian · Cypriot · Egyptian Hieroglyphs · Glagolitic · Gothic · Imperial Aramaic · Inscriptional Pahlavi · Inscriptional Parthian · Kaithi · Kharoshthi · Linear B · Lycian · Lydian · Ogham · Old Italic · Old Persian · Phags-pa · Phoenician · Old South Arabian · Old Turkic · Runic · Ugaritic
Symbols
Cultural, political, and religious symbols · Currency · Mathematical operators and symbols · Phonetic symbols (including IPA)
v · d · eCharacter encodings
Character sets
Early telecommunications
ASCII · ISO/IEC 646 · ISO/IEC 6937 · T.61 · sixbit code pages · Baudot code · Morse code
ISO/IEC 8859
-1 · -2 · -3 · -4 · -5 · -6 · -7 · -8 · -9 · -10 · -11 · -12 · -13 · -14 · -15 · -16
Bibliographic use
ANSEL · ISO 5426 / 5426-2 / 5427 / 5428 / 6438 / 6861 / 6862 / 10585 / 10586 / 10754 / 11822 · MARC-8
National standards
ArmSCII · CNS 11643 · GOST 10859 · GB 2312 · HKSCS · ISCII · JIS X 0201 · JIS X 0208 · JIS X 0212 · JIS X 0213 · KPS 9566 · KS X 1001 · PASCII · TIS-620 · TSCII · VISCII · YUSCII
EUC
CN · JP · KR · TW
ISO/IEC 2022
CN · JP · KR · CCCII
MacOS codepages ("scripts")
Arabic · CentralEurRoman · ChineseSimp / EUC-CN · ChineseTrad / Big5 · Croatian · Cyrillic · Devanagari · Dingbats · Farsi · Greek · Gujarati · Gurmukhi · Hebrew · Icelandic · Japanese / ShiftJIS · Korean / EUC-KR · Roman · Romanian · Symbol · Thai / TIS-620 · Turkish · Ukrainian
DOS codepages
437 · 720 · 737 · 775 · 850 · 852 · 855 · 857 · 858 · 860 · 861 · 862 · 863 · 864 · 865 · 866 · 869 · Kamenický · Mazovia · MIK · Iran System
Windows codepages
874 / TIS-620 · 932 / ShiftJIS · 936 / GBK · 949 / EUC-KR · 950 / Big5 · 1250 · 1251 · 1252 · 1253 · 1254 · 1255 · 1256 · 1257 · 1258 · 1361 · 54936 / GB18030
EBCDIC codepages
37/1140 · 273/1141 · 277/1142 · 278/1143 · 280/1144 · 284/1145 · 285/1146 · 297/1147 · 420/16804 · 424/12712 · 500/1148 · 838/1160 · 871/1149 · 875/9067 · 930/1390 · 933/1364 · 937/1371 · 935/1388 · 939/1399 · 1025/1154 · 1026/1155 · 1047/924 · 1112/1156 · 1122/1157 · 1123/1158 · 1130/1164 · JEF · KEIS
Platform specific
ATASCII · CDC display code · DEC-MCS · DEC Radix-50 · Fieldata · GSM 03.38 · HP roman8 · PETSCII · TI calculator character sets · ZX Spectrum character set
Unicode / ISO/IEC 10646
UTF-8 · UTF-16/UCS-2 · UTF-32/UCS-4 · UTF-7 · UTF-EBCDIC · GB 18030 · SCSU · BOCU-1
Miscellaneous codepages
APL · Cork · HZ · IBM code page 1133 · KOI8 · TRON
Related topics
control character (C0 C1) · CCSID · Character encodings in HTML · charset detection · Han unification · ISO 6429/IEC 6429/ANSI X3.64 · mojibake
Cesu-8
The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described ... Compiled it toghether with the CESU-8 patch and played around with it ...
CESU-8 - Wikipédia
Le CESU-8 (Compatibility Encoding Scheme for UTF-16: 8-Bit) est un codage de caractères ... Le but principal de CESU-8 est de maintenir la même collation binaire ...
PDUTR #26: Compatibility Encoding Scheme for UTF-16: 8-Bit ...
In CESU-8, supplementary characters are represented as six-byte sequences resulting from ... CESU-8 is useful in 8-bit processing environments where binary ...
CESU-8
Data encoded in CESU-8 should only be exchanged when it is labeled as such in a higher-level protocol or is agreed upon in an API definition. ...
Cesu encyclopedia topics | Reference.com
Encyclopedia article of Cesu at Reference.com compiled from comprehensive and current sources. ... The CESU-8 encoding form is used in the Oracle database software. ...
OTN Discussion Forums : UTF-8 vs. UTF-16 vs. CESU-8 ...
According to Unicode.org the CESU-8 encoding scheme for Unicode is identical to UTF-8 except for its representation of supplementary characters, ...
UTF-8: Information from Answers.com
UTF-8 ( U nicode T ransformation F ormat -8 ) A format in the Unicode coding system that uses from one to four bytes
Secure UTF-8 Input in Rails - igvita.com
Secure UTF-8 Input in Rails. Approximately 64.2 percent of online users do not speak ... 8, CESU-8, UTF-16/UCS-2, etc.) have been developed to address this need, but UTF-8 ...
