Numeric Character Reference

A numeric character reference (NCR) is a common markup construct used in SGML and other SGML-based markup languages such as HTML and XML, and is also a source of much confusion among web document authors.

Example

In SGML, HTML, and XML, the following are numeric character references for the Greek capital letter Sigma ("Σ"):
  Σ  Σ  Σ  Σ 

Discussion

Markup languages are typically defined in terms of ISO 10646 or Unicode characters. That is, a document consists, at its most fundamental level of abstraction, of a sequence of characters, which are abstract units that exist independently of any encoding. Ideally, when the characters of a document utilizing a markup language are encoded for storage or transmission over a network as a sequence of bits, the encoding that is used will be one that supports representing each and every character in the document, if not in the whole of Unicode, directly as a particular bit sequence. Sometimes, though, for reasons of convenience or due to technical limitations, documents are encoded with an encoding that cannot represent some characters directly. For example, the widely used encodings based on ISO 8859 can only represent, at most, 256 unique characters as one 8-bit byte each. Documents are rarely, in practice, ever allowed to use more than one encoding internally, so the onus is usually on the markup language to provide a means for document authors to express unencodable characters in terms of encodable ones. This is generally done through some kind of "escaping" mechanism. The SGML-based markup languages allow document authors to use special sequences of characters from the ASCII range (the first 128 code points of Unicode) to represent, or reference, any Unicode character, regardless of whether the character being represented is directly available in the document's encoding. These special sequences are character references. Character references that are based on the referenced character's ISO 10646 or Unicode "code point" are called numeric character references. In HTML 4 and in all versions of XHTML and XML, the code point can be expressed either as a decimal (base 10) number or as a hexadecimal (base 16) number. The syntax is as follows: Character U+0026 (ampersand), followed by character U+0023 (number sign), followed by one of the following choices:
  • one or more decimal digits zero (U+0030) through nine (U+0039); or
  • character U+0078 ("x") followed by one or more hexadecimal digits, which are zero (U+0030) through nine (U+0039), Latin capital letter A (U+0041) through F (U+0046), and Latin small letter a (U+0061) through f (U+0066);
all followed by character U+003A (semicolon). Older versions of HTML disallowed the hexadecimal syntax. The characters that comprise a numeric character reference can be represented in every character encoding used in computing and telecommunications today, so there is no risk of the reference itself being unencodable. There is another kind of character reference called a character entity reference, which allows a character to be referred to by a name instead of a number. (Naming a character creates a character entity.) HTML defines some character entities, but not many; all other characters can only included by direct encoding or using NCRs.

Restrictions

ISO 10646 (the Universal Character Set) is the "document character set" of SGML, HTML 4, so by default, any character in such a document, and any character referenced in such a document, must be in the UCS. While the syntax of SGML does not prohibit unassigned code points such as  from being referenced, SGML-derived markup languages such as HTML and XML can, and often do, restrict numeric character references to reference only those code points that have not been assigned to characters (rather, code points not permanently unassigned). Restrictions may also apply for other reasons. For example, in HTML 4, , which is a reference to a non-printing "form feed" control character, is allowed (because a form feed character is allowed), but in XML, the form feed character cannot be used, not even by reference. As another example, €, which is a reference to another control character, is not allowed to be used or referenced in either HTML or XML, but when used in HTML, it is usually not flagged as an error by web browsers—some of which attempt to interpret it as a reference to the character represented by code value 128 in the Windows-1252 encoding: "€", which actually should be represented as €. As a further example, XML 1.0, being based on an older version of ISO 10646, prohibited using characters above U+FFFD except in character data, thus making a reference like 𐀁 illegal, while in XML 1.1, such a reference is allowed, because the available character repertoire was explicitly extended.

 

<< PreviousWord BrowserNext >>
stroke play
jeen han
funk fingers
summer babe
william patrick hitler
odawara castle
tabor college
arachosia
doomsday prediction
ned grossberg
corporal acts of mercy
cointreau
spiritual acts of mercy
adolf brand
gedrosia
bearded bellbird
nedumangad
spotted skunk
cure for pain
forum boarium
forum holitorium
green standard army
list of musicals: m to z
redology
the big hit
list of uae nationals
her
haniwa
race to the sea
srpska crnja
david x. cohen
hagen (legend)
military history of china
vladimir voinovich
gurgaon
cult following
atc code g04
group buy
mach five
atc code h01
binyamina
gorani (dialect)
atc code h02
squier