Sather Home Page

Section 8.3.2.6:
UNICODE

immutable class UNICODE < $ORDERED{UNICODE}

$ORDERED{UNICODE} UNICODE
Inheritance Diagram

Formal Types

types

SAME = UNICODE ;
UNICODE = token ;

This class implements the concept of a character encoding which is applicable to a given culture.


External specifications

The following feature is required to be implemented for this class in accordance with the specification given in $IS_EQ from which $ORDERED{UNICODE} sub-types :-


The following feature is required to be implemented for this class in accordance with the specification given in $IS_LT{UNICODE} from which $ORDERED{UNICODE} sub-types :-


The following feature is required to be implemented for this class in accordance with the specification given in $IS_NIL which is inherited from the class $ORDERED from which this sub-types :-


The following feature is required to be implemented for this class in accordance with the specification given in $NIL from which $ORDERED{UNICODE} sub-types:-


Reader Routines

The majority of the reader routines below are either individual codes or code tables of some kind. This section has therefore been divided into five sub-groups.

Standard Properties

This small group of properties is defined for uniformity with other classes -

Universal Names

The following table of codes are universal in the sense that all previous international 8-bit standards use the identical code values - this is essential for backward compatibility. It is strongly recommended that these names be used, since it is possible that a later revision of the standard changes the code values concerned. All of these names return a value of this class!

SPACE EXCLAMATION_MARK QUOTATION_MARK NUMBER_SIGN
DOLLAR_SIGN PERCENT_SIGN AMPERSAND APOSTROPHE
LEFT_PARENTHESIS RIGHT_PARENTHESIS ASTERISK PLUS_SIGN
COMMA HYPHEN_MINUS FULL_STOP SOLIDUS
DIGIT_ZERO DIGIT_ONE DIGIT_TWO DIGIT_THREE
DIGIT_FOUR DIGIT_FIVE DIGIT_SIX DIGIT_SEVEN
DIGIT_EIGHT DIGIT_NINE COLON SEMICOLON
LESS_THAN_SIGN EQUALS_SIGN GREATER_THAN_SIGN QUESTION_MARK
COMMERCIAL_AT LATIN_CAPITAL_LETTER_A LATIN_CAPITAL_LETTER_B LATIN_CAPITAL_LETTER_C
LATIN_CAPITAL_LETTER_D LATIN_CAPITAL_LETTER_E LATIN_CAPITAL_LETTER_F LATIN_CAPITAL_LETTER_G
LATIN_CAPITAL_LETTER_H LATIN_CAPITAL_LETTER_I LATIN_CAPITAL_LETTER_J LATIN_CAPITAL_LETTER_K
LATIN_CAPITAL_LETTER_L LATIN_CAPITAL_LETTER_M LATIN_CAPITAL_LETTER_N LATIN_CAPITAL_LETTER_O
LATIN_CAPITAL_LETTER_P LATIN_CAPITAL_LETTER_Q LATIN_CAPITAL_LETTER_R LATIN_CAPITAL_LETTER_S
LATIN_CAPITAL_LETTER_T LATIN_CAPITAL_LETTER_U LATIN_CAPITAL_LETTER_V LATIN_CAPITAL_LETTER_W
LATIN_CAPITAL_LETTER_X LATIN_CAPITAL_LETTER_Y LATIN_CAPITAL_LETTER_Z LEFT_SQUARE_BRACKET
REVERSE_SOLIDUS RIGHT_SQUARE_BRACKET CIRCUMFLEX_ACCENT LOW_LINE
GRAVE_ACCENT LATIN_SMALL_LETTER_A LATIN_SMALL_LETTER_B LATIN_SMALL_LETTER_C
LATIN_SMALL_LETTER_D LATIN_SMALL_LETTER_E LATIN_SMALL_LETTER_F LATIN_SMALL_LETTER_G
LATIN_SMALL_LETTER_H LATIN_SMALL_LETTER_I LATIN_SMALL_LETTER_J LATIN_SMALL_LETTER_K
LATIN_SMALL_LETTER_L LATIN_SMALL_LETTER_M LATIN_SMALL_LETTER_N LATIN_SMALL_LETTER_O
LATIN_SMALL_LETTER_P LATIN_SMALL_LETTER_Q LATIN_SMALL_LETTER_R LATIN_SMALL_LETTER_S
LATIN_SMALL_LETTER_T LATIN_SMALL_LETTER_U LATIN_SMALL_LETTER_V LATIN_SMALL_LETTER_W
LATIN_SMALL_LETTER_X LATIN_SMALL_LETTER_Y LATIN_SMALL_LETTER_Z LEFT_CURLY_BRACKET
VERTICAL_LINE RIGHT_CURLY_BRACKET TILDE

Special Codes

The Unicode standard defines a small group of codes which are not valid characters, but are defined for use in manipulating codes for communication purposes. Either the ISO/IEC standard or the Unicode documents should be studied for the exact semantics associated with thes encodings.

Unicode Standard Tables

This long group of tables are the defined groupings in the Unicode Standard. These table ranges include codes which are reserved - and those which are not yet allocated in the specific ranges. Many, however, are full! The order of the entries in the list below is in code point order of the first element in the range.

  • Latin_Extended_Additional : RANGE
  • Greek_Extended : RANGE
  • General_Punctuation : RANGE
  • Superscripts_and_Subscripts : RANGE
  • Currency_Symbols : RANGE
  • Combining_Diacritical_Marks_for_Symbols : RANGE
  • Letterlike_Symbols : RANGE
  • Number_Forms : RANGE
  • Arrows : RANGE
  • Mathematical_Operators
  • Miscellaneous_Technical
  • Control_Pictures : RANGE
  • Optical_Character_Recognition : RANGE
  • Enclosed_Alphanumerics : RANGE
  • Box_Drawing : RANGE
  • Block_Elements : RANGE
  • Geometric_Shapes : RANGE
  • Miscellaneous_Symbols : RANGE
  • Dingbats : RANGE
  • CJK_Smbols_and_Punctuation
  • Hiragana : RANGE
  • Katakana : RANGE
  • Bopomofo : RANGE
  • Hangul_Compatibility_Jamo : RANGE
  • CJK_Miscellaneous : RANGE
  • Enclosed_CJK_Letters_and_Months : RANGE
  • CJK_Compatibility : RANGE
  • CJK_Unified_Ideographs : RANGE
  • Yi : RANGE
  • Hangul : RANGE
  • Private_Use_Area : RANGE
  • CJK_Compatibility_Ideographs : RANGE
  • Alphabetic_Presentation_Forms : RANGE
  • Arabic_Presentation_Forms_A : RANGE
  • CJK_Compatibility_Forms : RANGE
  • Combining_Half_Marks : RANGE
  • Small_Form_Variants : RANGE
  • Arabic_Presentation_Forms_B : RANGE
  • Halfwidth_and_Fullwidth_Forms : RANGE
  • Private_Area : RANGE
  • Specials : RANGE

Valid Categories

The following reader routines contain all and only valid code ranges. Gaps which are currently 'reserved' or otherwise are not in these tables. It is expected in practice that the predicates which are defined in this class will be used to determine the category or validity of some particular code.


is_valid

This predicate returns true if and only if the given numeric argument is a valid bit-pattern for a character encoding in the given culture.

is_valid (
val : CARD
) : BOOL
Formal Signature
is_valid(val : CARD) res : BOOL
Pre-condition

Since this operation is a predicate then this pre-condition is vacuously true.

Post-condition

Note that this post-condition performs an abstract character code creation in order to determine if the result is in the domain of the character repertoire. Such an operation could not be performed in general with executable code. It is used solely for specification purposes.

post let loc_ch = create(val) in
         (exists script | script in set dom SCRIPTS &
            let loc_dom = dom Code_Groups(script) in
               res = (loc_ch in set loc_dom))
      or (exists symbol | symbol in set dom SYMBOLS &
            let loc_sym_dom = dom Symbolics(symbol) in
               res = (loc_ch in set loc_sym_dom))

This predicate returns true if and only if the bit-pattern of the numeric argument forms a valid Unicode encoding.


create

This feature creates a character code from the given numeric value, used as a bit-pattern.

create (
val : CARD
) : SAME
Formal Signature
create(val : CARD) res : SAME
Pre-condition
pre is_valid(val)
Post-condition
post let loc_res : seq of OCTET be st loc_res = res in
      loc_res = CARD.binstr(val)(1, ..., asize)
         and lib(res) = plib

This feature creates a new character code which has the bit-pattern representation which is the same as the value given.


create

This feature creates a new Unicode encoding which is that of the given rune - or, if the argument contains more than one encoding, that of the first element.

create (
rn : RUNE
) : SAME
Formal Signature
create2(rn : RUNE) res : SAME
Pre-condition

Since the argument is not optional then this pre-condition is vacuously true.

Post-condition
post res = create(CHAR_CODE.card(RUNE.code(rn)))

This feature creates a new Unicode encoding which has the same encoding as that of the first element in the given argument.


is_combining

This predicate returns true only if self is a combining code as defined in the Unicode version 3.1 standard.

is_combining : BOOL
Formal Signature
is_combining(self : SAME) res : BOOL
Pre-condition

Since this feature is a predicate and the argument is not optional then this pre-condition is vacuously true.

Post-condition
post res = self in set dom Combining

This predicate returns true if and only if self is a combining encoding in the Unicode domain.


is_letter

This predicate returns true only if self is a letter code in the given script as defined in ISO/IEC 10646-1:2000

is_letter (
script : SCRIPT
) : BOOL
Formal Signature
is_letter(self : SAME, script : SCRIPTS) res : BOOL
Pre-condition

Since this feature is a predicate and neither argument is optional then this pre-condition is vacuously true.

Post-condition
post res = self in set dom Letters(script.enum)

This predicate returns true if and only if self is a letter code in the domain of the given script.


is_letter

This predicate returns true only if self is a letter code in any script defined in the Unicode standard.

is_letter : BOOL
Formal Signature
is_letter2(self : SAME) res : BOOL
Pre-condition

Since this feature is a predicate and the argument is not optional then this pre-condition is vacuously true.

Post-condition
post res = exists script | script in set dom Letters &
         self in set dom Letters(script)

This predicate returns true if and only if self is a letter encoding in any script defined in the Unicode domain.


is_up_mapped

This predicate returns true only if self is an encoding for a lower case letter to which there is a corresponding upper case letter.

is_up_mapped : BOOL
Formal Signature
is_up_mapped(self : SAME) res : BOOL
Pre-condition

Since this feature is a predicate and the argument is not optional then this pre-condition is vacuously true.

Post-condition
post res = self in set dom Case_Pair

This predicate returns true if and only if self is an encoding for a lower case letter to which there is a mapped upper case letter in the Unicode domain.


is_lower

This predicate returns true only if self is an encoding for a lower case letter in the Unicode standard. Note that there is a number of encodings for which there is no corresponding upper case character.

is_lower : BOOL
Formal Signature
is_lower(self : SAME) res : BOOL
Pre-condition

Since this feature is a predicate and the argument is not optional then this pre-condition is vacuously true.

Post-condition
post res = self in set dom Case_Pair
      or self in set dom Lower_only

This predicate returns true if and only if self is an encoding for a lower case letter in the Unicode domain.

is_down_mapped

This predicate returns true only if self is an encoding for an upper case letter to which there is a corresponding lower case letter.

is_down_mapped : BOOL
Formal Signature
is_down_mapped(self : SAME) res : BOOL
Pre-condition

Since this feature is a predicate and the argument is not optional then this pre-condition is vacuously true.

Post-condition
post res = self in set rng Case_Pair

This predicate returns true if and only if self is an encoding for an upper case letter to which there is a mapped lower case letter in the Unicode domain.


is_upper

This predicate returns true only if self is an encoding for an upper case letter in the Unicode standard. Note that there is a number of encodings for which there is no corresponding lower case character.

is_upper : BOOL
Formal Signature
is_upper(self : SAME) res : BOOL
Pre-condition

Since this feature is a predicate and the argument is not optional then this pre-condition is vacuously true.

Post-condition
post res = self in set rng Case_Pair
      or self in set dom Upper_only

This predicate returns true if and only if self is an encoding for an upper case letter in the Unicode domain.


is_spacing

This predicate returns true only if self is an encoding for an invisible mark in the Unicode standard which occupies space on a presentation medium.

is_spacing : BOOL
Formal Signature
is_spacing(self : SAME) res : BOOL
Pre-condition

Since this feature is a predicate and the argument is not optional then this pre-condition is vacuously true.

Post-condition
post res = let loc_space = dom Symbolics(SYMBOLS.Spacing) in
      self in set loc_space

This predicate returns true if and only if self is an encoding for a character which is invisible on a presentation medium but nevertheless occupies space when rendered.


is_whitespace

This predicate returns true only if self is either an encoding for an invisible mark in the Unicode standard which occupies space on a presentation medium or a bit-pattern corresponding to the name of a control function which effects movement but no marking on a presentation medium.

is_whitespace : BOOL
Formal Signature
is_white_space(self : SAME) res : BOOL
Pre-condition

Since this feature is a predicate and the argument is not optional then this pre-condition is vacuously true.

Post-condition
post res = (let loc_space = dom Symbolics(SYMBOLS.Spacing) in
         self in set loc_space)
      or CONTROL_CODES.is_space(CONTROL_CODES.create(card(self)))

This predicate returns true if and only if self is an encoding for a character which is invisible on a presentation medium but nevertheless occupies space when rendered.


is_numeric

This predicate returns true only if self is an encoding for a numeric symbol in the given script as defined in ISO/IEC 10646-1:2000. This does not mean that it is possible to convert such an encoding into a numeric value - see is_digit below.

is_numeric (
script : SCRIPT
) : BOOL
Formal Signature
is_numeric(self : SAME, script : SCRIPTS) res : BOOL
Pre-condition

Since this feature is a predicate and neither argument is optional then this pre-condition is vacuously true.

Post-condition
post res = self in set dom Numeric(script.enum)

This predicate returns true if and only if self is a letter code in the domain of the given script.


is_numeric

This predicate returns true only if self is an encoding representing a numeric value in any script defined in the Unicode standard. This does not mean that it is possible to convert such an encoding into a numeric value - see is_digit below.

is_numeric : BOOL
Formal Signature
is_numeric2(self : SAME) res : BOOL
Pre-condition

Since this feature is a predicate and the argument is not optional then this pre-condition is vacuously true.

Post-condition
post res = exists script | script in set dom Numeric &
         self in set dom Numeric(script))

This predicate returns true if and only if self is an encoding representing a numeric symbol in the Unicode domain.


is_digit

This predicate returns true only if self is a decimal digit encoding in the given script as defined in ISO/IEC 10646-1:2000

is_digit (
script : SCRIPT
) : BOOL
Formal Signature
is_digit(self : SAME, script : SCRIPTS) res : BOOL
Pre-condition

Since this feature is a predicate and neither argument is optional then this pre-condition is vacuously true.

Post-condition
post res = self in set dom Decimal(script.enum)

This predicate returns true if and only if self is a decimal digit encoding in the domain of the given script.


is_digit

This predicate returns true only if self is a decimal digit encoding in any script defined in the Unicode standard.

is_digit : BOOL
Formal Signature
is_digit2(self : SAME) res : BOOL
Pre-condition

Since this feature is a predicate and the argument is not optional then this pre-condition is vacuously true.

Post-condition
post res = exists script | script in set dom Decimal &
         self in set dom Decimal(script)

This predicate returns true if and only if self is a decimal digit encoding in any script defined in the Unicode domain.


is_octal_digit

This predicate returns true only if self is an octal digit encoding in the given script as defined in ISO/IEC 10646-1:2000

is_octal_digit (
script : SCRIPT
) : BOOL
Formal Signature
is_octal_digit(self : SAME, script : SCRIPTS) res : BOOL
Pre-condition

Since this feature is a predicate and neither argument is optional then this pre-condition is vacuously true.

Post-condition
post res = self in set dom Decimal(script.enum)
      and digit_value(self) < OCTET::Octet_Bits

This predicate returns true if and only if self is an octal digi encoding in the domain of the given script.


is_octal_digit

This predicate returns true only if self is an octal digit code in any script defined in the Unicode standard.

is_octal_digit : BOOL
Formal Signature
is_octal_digit2(self : SAME) res : BOOL
Pre-condition

Since this feature is a predicate and the argument is not optional then this pre-condition is vacuously true.

Post-condition
post res = (exists script | script in set dom Decimal &
         self in set dom Decimal(script))
         and digit_value(self) < OCTET::Octet_Bits

This predicate returns true if and only if self is an octal digit encoding in any script defined in the Unicode domain.


is_hex_digit

This predicate returns true only if self is a hexadecimal digit code defined in the Unicode standard. Note that this can only be true in the Latin script.

is_hex_digit : BOOL
Formal Signature
is_hex_digit(self : SAME) res : BOOL
Pre-condition

Since this feature is a predicate and the argument is not optional then this pre-condition is vacuously true.

Post-condition
post res = self in set dom Decimal(SCRIPTS.Latin))
      or self in set (LATIN_CAPITAL_LETTER_A, ..., LATIN_CAPITAL_LETTER_F)
      or self in set (LATIN_SMALL_LETTER_A, ..., LATIN_SMALL_LETTER_F)

This predicate returns true if and only if self is a hexadecimal digit encoding in any script defined in the Unicode domain.


is_print

This predicate returns true only if self is an encoding defined in the Unicode standard for which a rendering engine will produce a mark.

is_print : BOOL
Formal Signature
is_print(self : SAME) res : BOOL
Pre-condition

Since this feature is a predicate and the argument is not optional then this pre-condition is vacuously true.

Post-condition
post res = not is_spacing(self)

This predicate returns true if and only if self is an encoding in any script defined in the Unicode domain for a character which, when rendered on some presentation medium is visible.


is_punct

This predicate returns true only if self is an encoding defined in the Unicode standard for a punctuation symbol.

is_punct : BOOL
Formal Signature
is_punct(self : SAME) res : BOOL
Pre-condition

Since this feature is a predicate and the argument is not optional then this pre-condition is vacuously true.

Post-condition
post res = self in set dom Symbolics(SYMBOLS.Punctuation)

This predicate returns true if and only if self is an encoding in any script defined in the Unicode domain for a punctuation symbol.


is_control

This predicate returns false since control codes, not being characters, do not form part of Unicode.

is_control : BOOL
Formal Signature
is_control(self : SAME) res : BOOL
Pre-condition

Since this feature is a predicate and the argument is not optional then this pre-condition is vacuously true.

Post-condition
post res = false

This predicate returns true if and only if self is a control code - this is identically false.


is_646char

This predicate returns true if self is a character encoding corresponding to the encoding of characters defined by ISO 646-IRV. Note that this is identical to ISO 646-US and US-ASCII and is also a subset of all standards in the ISO 8859 family.

is_646char : BOOL
Formal Signature
is_646char(self : SAME) res : BOOL
Pre-condition

Since this feature is a predicate and the argument is not optional then this pre-condition is vacuously true.

Post-condition
post res = (self in set dom Basic_Latin)
      and (let loc_dom = dom Code_Groups(SCRIPTS.Latin) in
            res = (loc_ch in set loc_dom)
         or (exists symbol | symbol in set dom SYMBOLS &
            let loc_sym_dom = dom Symbolics(symbol) in
               res = (loc_ch in set loc_sym_dom)))

This predicate returns true if and only if self is an encoding defined in ISO 646 7-bit character encoding standard.


card

This feature returns the numeric value of the bit-pattern of self as a cardinal number.

card : CARD
Formal Signature
card(self : SAME) res : CARD
Pre-condition

Since the argument is not optional then this pre-condition is vacuously true.

Post-condition
post create(res) = self

This feature returns the cardinal number corresponding to the code value (as a bit-pattern) of self.


code

This feature returns the character code corresponding to the value of self. The culture code kind of the result is Unicode.

code : CHAR_CODE
Formal Signature
code(self : SAME) res : CHAR_CODE
Pre-condition

Since the argument is not optional then this pre-condition is vacuously true.

Post-condition
post let loc_lib : LIBCHARS = CHAR_CODE.lib(res) in
         res = CHAR_CODE.create(card(self),loc_lib)
      and CULTURE.kind(LIBCHARS.culture(loc_lib)) = CODE_KINDS.Unicode

This feature returns the character code which has the encoding which is self and for which the culture code kind is CODE_KINDS::Unicode.


rune

This feature creates and returns a rune containing the single encoding self for which the culture code kind is Unicode.

rune : RUNE
Formal Signature
rune(self : SAME) res : RUNE
Pre-condition

Since this feature is a predicate and the argument is not optional then this pre-condition is vacuously true.

Post-condition
post (is_combining(self)
         and res = RUNE.nil)
      or res = CHAR_CODE.rune(code(self))

This feature returns the single encoding rune corresponding to self, provided that self is not a combining code when RUNE::nil shall be returned.


to_lower

This feature returns the lower case letter corresponding to self provided the pre-condition (that such a mapping exists) is satisfied.

to_lower : SAME
Formal Signature
to_lower(self : SAME) res : SAME
Pre-condition
pre is_down_mapped(self)
Post-condition
post to_upper(res) = self

This feature returns the encoding of the lower case letter equivalent to the upper case letter for which self is an encoding.


to_upper

This feature returns the upper case letter corresponding to self provided the pre-condition (that such a mapping exists) is satisfied.

to_upper : SAME
Formal Signature
to_upper(self : SAME) res : SAME
Pre-condition
pre is_up_mapped(self)
Post-condition
post to_lower(res) = self

This feature returns the encoding of the upper case letter equivalent to the lower case letter for which self is an encoding.


octal_value

This feature returns the octal value corresponding to the digit character encoding self

octal_value : CARD
Formal Signature
octal_value(self : SAME) res : CARD
Pre-condition
pre is_octal_digit(self)
Post-condition
post exists sublist | sublist in set dom Decimal &
      let loc_rng = iota rng | rng in set dom sublist &
            self in set dom loc_rng in
         res = (iota idx | idx in inds loc_rng &
               self = loc_rng(idx)) - 1

This routine returns as a cardinal number the value of the octal digit corresponding to the encoding self.


digit_value

This feature returns the decimal value corresponding to the digit character encoding self.

digit_value : CARD
Formal Signature
digit_value(self : SAME) res : CARD
Pre-condition
pre is_digit(self)
Post-condition
post exists sublist | sublist in set dom Decimal &
      let loc_rng = iota rng | rng in set dom sublist &
            self in set dom loc_rng in
         res = (iota idx | idx in inds loc_rng &
               self = loc_rng(idx)) - 1

This routine returns as a cardinal number the value of the decimal digit corresponding to the encoding self.


hex_digit_value

This feature returns the decimal value corresponding to the digit character encoding self.

hex_digit_value : CARD
Formal Signature
hex_digit_value(self : SAME) res : CARD
Pre-condition
pre is_hex_digit(self)
Post-condition
post (is_digit(self)
         and res = digit_value(self))
      or (let uc_seq = (LATIN_CAPITAL_LETTER_A, ..., LATIN_CAPITAL_LETTER_F) in
         res = (iota idx | idx in inds uc_seq &
               self = uc_seq(idx)) + 9)
      or (let lc_seq = (LATIN_SMALL_LETTER_A, ..., LATIN_SMALL_LETTER_F) in
         res = (iota idx | idx in inds lc_seq &
               self = lc_seq(idx)) + 9)

This routine returns as a cardinal number the value of the hexadecimal digit corresponding to the encoding self.


Language Index Library Index Codes Index
Comments or enquiries should be made toKeith Hopper.
Page last modified: Friday, 1 June 2001.
Produced with Amaya