A source character set or target character set can also contain multibyte characters (sequences of one or more bytes). Each sequence represents a single character in the extended character set. You use multibyte characters to represent large sets of characters, such as Kanji. A multibyte character can be a one-byte sequence that is a character from the basic C character set, an additional one-byte sequence that is implementation defined, or an additional sequence of two or more bytes that is implementation defined.
Any multibyte encoding that contains sequences of two or more bytes depends, for its interpretation between bytes, on a conversion state determined by bytes earlier in the sequence of characters. In the initial conversion state if the byte immediately following matches one of the characters in the basic C character set, the byte must represent that character.
For example, the EUC encoding is a superset of ASCII. A byte value in the interval [0xA1, 0xFE] is the first of a two-byte sequence (whose second byte value is in the interval [0x80, 0xFF]). All other byte values are one-byte sequences. Since all members of the basic C character set have byte values in the range [0x00, 0x7F] in ASCII, EUC meets the requirements for a multibyte encoding in Standard C. Such a sequence is not in the initial conversion state immediately after a byte value in the interval [0xA1, 0xFe]. It is ill-formed if a second byte value is not in the interval [0x80, 0xFF].
Multibyte characters can also have a state-dependent encoding. How you interpret a byte in such an encoding depends on a conversion state that involves both a parse state, as before, and a shift state, determined by bytes earlier in the sequence of characters. The initial shift state, at the beginning of a new multibyte character, is also the initial conversion state. A subsequent shift sequence can determine an alternate shift state, after which all byte sequences (including one-byte sequences) can have a different interpretation. A byte containing the value zero, however, always represents the null character. It cannot occur as any of the bytes of another multibyte character.
For example, the JIS encoding is another superset of ASCII. In the initial shift state, each byte represents a single character, except for two three-byte shift sequences:
JIS also meets the requirements for a multibyte encoding in Standard C. Such a sequence is not in the initial conversion state when partway through a three-byte shift sequence or when in two-byte mode.
(Amendment 1 adds the type mbstate_t, which describes an object that can store a conversion state. It also relaxes the above rules for generalized multibyte characters, which describe the encoding rules for a broad range of wide streams.)
You can write multibyte characters in C source text as part of a comment, a character constant, a string literal, or a filename in an include directive. How such characters print is implementation defined. Each sequence of multibyte characters that you write must begin and end in the initial shift state. The program can also include multibyte characters in null-terminated C strings used by several library functions, including the format strings for printf and scanf. Each such character string must begin and end in the initial shift state.
Each character in the extended character set also has an integer representation, called a wide-character encoding. Each extended character has a unique wide-character value. The value zero always corresponds to the null wide character. The type definition wchar_t specifies the integer type that represents wide characters.
You write a wide-character constant as L'mbc', where mbc represents a single multibyte character. You write a wide-character string literal as L”mbs“, where mbs represents a sequence of zero or more multibyte characters. The wide-character string literal L”xyz“ becomes a sequence of wide-character constants stored in successive bytes of memory, followed by a null wide character: {L'x', L'y', L'z', L'\0'}
The following library functions help you convert between the multibyte and wide-character representations of extended characters: btowc, mblen, mbrlen, mbrtowc, mbsrtowcs, mbstowcs, mbtowc, wcrtomb, wcsrtombs, wcstombs, wctob, and wctomb.
The macro MB_LEN_MAX specifies the length of the longest possible multibyte sequence required to represent a single character defined by the implementation across supported locales. And the macro MB_CUR_MAX specifies the length of the longest possible multibyte sequence required to represent a single character defined for the current locale.
For example, the string literal ”hello“ becomes an array of six char:
{'h', 'e', 'l', 'l', 'o', 0}
while the wide-character string literal L”hello“ becomes an array of six integers of type wchar_t:
{L'h', L'e', L'l', L'l', L'o', 0}
Certain materials included or referred to in this document are copyright P.J. Plauger and/or Dinkumware, Ltd. or are based on materials that are copyright P.J. Plauger and/or Dinkumware, Ltd.
Notwithstanding the meta-data for this document, copyright information for this document is as follows:
Copyright © IBM Corp. 1999, 2010. & Copyright © P.J. Plauger and/or Dinkumware, Ltd. 1992-2006.