Unicode And Html

HTML 4.0 uses ISO 10646/Unicode as its official character set. That is, an HTML document must be composed of a sequence of Unicode characters. When stored or transmitted over a network, however, these characters are typically encoded as a sequence of bits according to some character encoding. Depending on which encoding is used, the encoded form may not be capable of representing all of the Unicode characters in the document. For example, if an 8-bit, one-byte-per-character encoding such as one of the ISO 8859 encodings is used, then at most only 256 specific characters from Unicode can be directly represented in the encoded HTML. In order to work around this limitation, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a numeric character reference of the form &#N;, where N is a decimal number for the Unicode code point, or a hexadecimal number prefixed by x. The characters that comprise the numeric character reference are universally representable in every encoding approved for use on the Internet. The support for hexadecimal in this context is more recent, so older browsers might have problems displaying those characters — but they will probably have a problem displaying Unicode characters outside the 8-bit range in the first place. It is still a common practice to convert the hexadecimal code point into a decimal value (e.g. &#9824; instead of &#x2660;). There is also a standard set of named character entities for commonly used symbols outside of some character encodings. These entities can be included in an HTML document via the use of entity references of the form &EntityName;, where EntityName is the name of the entity. For example, &mdash;, much like &#8212; or &#x2014;, represents U+2014: the em dash character—like this— even if the character encoding used doesn't contain that character. (In the Unicode standard, each code point is expressed in the notation U+hhhh, where hhhh are the hexadecimal digits). In order to correctly process HTML, a web browser must ascertain which Unicode characters are represented by the encoded form of an HTML document. In order to do this, the web browser must know what encoding was used. When a document is transmitted via a MIME message or MIME-Like transport such as an HTTP response, the message may signal the encoding via a Content-Type header, such as Content-Type: text/html; charset=ISO-8859-1. Other external means of declaring encoding are permitted, but rarely used. The encoding may also be declared within the document itself, in the form of a META element, like <meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />. When there is no encoding declaration, the default varies depending on the localisation of the browser. For a system set up mainly for western languages it will generally be iso-8859-1 or its close relation windows-1252. For a browser from a location where multibyte charsets are the norm some form of autodetection is likely to be applied. Because of the legacy of 8-bit text representations in programming languages and operating systems, and the desire to avoid burdening users with needing to understand the nuances of encoding, many text editors used by HTML authors are unable or unwilling to offer a choice of encodings when saving files to disk, and often do not even allow input of characters beyond a very limited range. Consequently, many HTML authors are unaware of encoding issues and may not have any idea what encoding their documents actually use. It is also a common misunderstanding that the encoding declaration effects a change in the actual encoding, whereas it is actually just a label that could be inaccurate. Many HTML documents are served on the World Wide Web with inaccurate encoding declarations, or no declarations at all. In order to determine the encoding in such cases, many browsers allow the user to manually select one from a list. They may also employ an encoding autodetection algorithm that works in concert with the manual override. The manual override may apply to all documents, or only those for which the encoding cannot be ascertained by looking at declarations and/or byte patterns. The fact that the manual override is present and widely used hinders the adoption of accurate encoding declarations on the Web; therefore the problem is likely to persist. This has been addressed somewhat by XHTML, which, being XML, requires that encoding declarations be accurate and that no workarounds be employed when they're found to be inaccurate. Many browsers are only capable of displaying a small subset of the full Unicode repertoire. Here is how your browser displays various Unicode code points:
Unicode HTML entity Unicode name What your browser displays
U+0041 &#65; Latin capital letter A style="text-align:center;font-size:large;" | A
U+00DF &#223; Latin small letter Sharp S style="text-align:center;font-size:large;" | ß
U+00FE &#254; Latin small letter Thorn style="text-align:center;font-size:large;" | þ
U+0394 &#916; Greek capital letter Delta style="text-align:center;font-size:large;" | Δ
U+0419 &#1049; Cyrillic capital letter Short I style="text-align:center;font-size:large;" | Й
U+05E7 &#1511; Hebrew letter Qof style="text-align:center;font-size:large;" | ק
U+0645 &#1605; Arabic letter Meem style="text-align:center;font-size:large;" | م
U+0E57 &#3671; Thai digit 7 style="text-align:center;font-size:large;" | ๗
U+1250 &#4688; Ethiopic syllable Qha style="text-align:center;font-size:large;" | ቐ
U+3042 &#12354; Hiragana letter small A (Japanese) style="text-align:center;font-size:large;" | あ
U+53F6 &#21494; CJK Unified Ideograph-53F6 (Simplified Chinese "Leaf") style="text-align:center;font-size:large;" | 叶
U+8449 &#33865; CJK Unified Ideograph-8449 (Traditional Chinese "Leaf") style="text-align:center;font-size:large;" | 葉
U+B0FB &#45307; Hangul syllable Nyaelh (Korean "Nieun Yae Rieulhieuh") style="text-align:center;font-size:large;" | 냻
U+10346 &#66374; Gothic letter Faihu style="text-align:center;font-size:large;" | 𐍆
colspan=4 style="font-size:smaller;" | To display all of the characters above, you may need to install one or more large multilingual fonts, like Code2000 (and Code2001 for some extinct languages, e.g., Gothic).
Some web browsers, such as Mozilla Firefox, and Safari are able to display multilingual web pages by intelligently choosing a font to display each individual character on the page. They will correctly display any mix of Unicode blocks, as long as appropriate fonts are present in the operating system. Internet Explorer is capable of displaying the full range of Unicode characters, but can't automatically make the necessary font choice. Web page authors must guess which appropriate fonts might be present on users' systems, and manually specify them for each block of text with a different language or Unicode range. A user may have another font installed which would display some characters, but if the web page author hasn't specified it, then Explorer will fail to display them, and show placeholder squares instead. Older browsers, such as Netscape Navigator 4.77, can only display text supported by the current font associated with the character encoding of the page, and may misinterpret numeric character references as being references to code values within the current character encoding, rather than references to Unicode code points. When you are using such a browser, it is unlikely that your computer has all of those fonts, or that the browser can use all available fonts on the same page. As a result, the browser will not display the text in the examples above correctly, though it may display a subset of them. Because they are encoded according to the standard, though, they will display correctly on any system that is compliant and does have the characters available. Further, those characters given names for use in named entity references are likely to be more commonly available than others. For displaying characters outside the Basic Multilingual Plane, like the Gothic letter faihu in the table above, some systems (like Windows 2000) need manual adjustments of their settings.

External links

 

<< PreviousWord BrowserNext >>
ucsd pascal
university of california, san diego
university of sydney
user datagram protocol
usb (disambiguation)
uss indianapolis (ca 35)
undead
ull
united nations convention on the law of the sea
metro
ucayali
list of metro systems
unix billennium
alternative words for american
un security council
united nations general assembly
un economic and social council
un trusteeship council
united nations high commissioner for refugees
united nations member states
umlaut
united airlines flight 175
united states department of state
united airlines flight 93
united states capitol
uss cole bombing
universal character set
democratic party (united states)
ultraviolet
table of unicode characters, 128 to 999
table of unicode characters, 1000 to 1999
umberto eco
university of southern california
u.s. presidential election, 2000
united states natural law party
usenet
university of arizona
geography of the united states
politics of the united states
economy of the united states
communications in the united states
transportation in the united states
list of political parties in the united states
foreign relations of the united states