About: Charset detection

Facets (new session)
Description
Metadata
Settings
- Rule:
- Inverse Functional Properties:
- "Same As":

About: Charset detection Goto Sponge NotDistinct Permalink

An Entity of Type : dbo:Election, within Data Space : dbpedia.demo.openlinksw.com associated with source document(s)
QRcode icon

http://dbpedia.demo.openlinksw.com/describe/?url=http%3A%2F%2Fdbpedia.org%2Fresource%2FCharset_detection&invfp=IFP_OFF&sas=SAME_AS_OFF

Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. The technique is recognised to be unreliable and is only used when specific metadata, such as a HTTP Content-Type: header is either not available, or is assumed to be untrustworthy. In general, incorrect charset detection leads to mojibake. Content-Type: text/html;charset=UTF-8 An isolated HTML document, such as one being edited as a file on disk, may imply such a header by a meta tag within the file:

Attributes	Values
rdf:type	Election
rdfs:label	Charset detection (en) 字符集探测 (zh)
rdfs:comment	字符编码探测、字符集探测又稱為代码页检测是個启发式猜测代表文字的一系列字节的字符编码。其算法通常依据对字节样式的统计分析。这并不是一个万无一失的方法因为它依赖于统计数据——比如有些Windows版本会误把ASCII编码的""当作中文UTF-16LE。为数不多的能可靠探测的情况之一是探测UTF-8。这是因为UTF-8中有大量的无效字节序列，所以当其他编码方式使用字节中的高位bit时极不可能通过UTF-8有效性测试。不幸的是不完善的字符集探测程序不优先进行可靠的UTF-8测试于是把UTF-8定为其他编码。 (zh) Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. The technique is recognised to be unreliable and is only used when specific metadata, such as a HTTP Content-Type: header is either not available, or is assumed to be untrustworthy. In general, incorrect charset detection leads to mojibake. Content-Type: text/html;charset=UTF-8 An isolated HTML document, such as one being edited as a file on disk, may imply such a header by a meta tag within the file: (en)
dcterms:subject	Character encoding
Wikipage page ID	19263080 (xsd:integer)
Wikipage revision ID	1046959984 (xsd:integer)
Link from a Wikipage to another Wikipage	Mojibake Character encoding UTF-16 UTF-16LE UTF-8 München Content sniffing Character encoding Language identification ASCII HTTP International Components for Unicode Heuristic Digraphs and trigraphs Bush hid the facts Byte order mark Metadata Microsoft Windows Browser sniffing Out-of-band data Language detection ISO-8859
Link from a Wikipage to an external page	http://chsdet.sourceforge.net/ http://cpdetector.sourceforge.net/usage.shtml http://www.joshisanerd.com/projects/hebci/ https://www-archive.mozilla.org/projects/intl/chardet.html https://web.archive.org/web/20101018031124/http:/jchardet.sourceforge.net/ https://web.archive.org/web/20101217195221/http:/icu-project.org/apiref/icu4c/ucsdet_8h.html https://www.freedesktop.org/wiki/Software/uchardet/ http://msdn.microsoft.com/en-us/library/aa920101.aspx http://www.fas.org/irp/doddir/army/fm34-40-2/appb.pdf https://github.com/errepi/ude
sameAs	Charset detection Charset detection Charset detection Charset detection
dbp:wikiPageUsesTemplate	dbt:Mono dbt:Reflist dbt:Character_encoding
has abstract	Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. The technique is recognised to be unreliable and is only used when specific metadata, such as a HTTP Content-Type: header is either not available, or is assumed to be untrustworthy. This algorithm usually involves statistical analysis of byte patterns, like frequency distribution of trigraphs of various languages encoded in each code page that will be detected; such statistical analysis can also be used to perform language detection. This process is not foolproof because it depends on statistical data. In general, incorrect charset detection leads to mojibake. One of the few cases where charset detection works reliably is detecting UTF-8. This is due to the large percentage of invalid byte sequences in UTF-8, so that text in any other encoding that uses bytes with the high bit set is extremely unlikely to pass a UTF-8 validity test. However, badly written charset detection routines do not run the reliable UTF-8 test first, and may decide that UTF-8 is some other encoding. For example, it was common that web sites in UTF-8 containing the name of the German city München were shown as MÃ¼nchen, due to the code deciding it was an ISO-8859 encoding before even testing to see if it was UTF-8. UTF-16 is fairly reliable to detect due to the high number of newlines (U+000A) and spaces (U+0020) that should be found when dividing the data into 16-bit words, and large numbers of NUL bytes all at even or odd locations. Common characters must be checked for, relying on a test to see that the text is valid UTF-16 fails: the Windows operating system would mis-detect the phrase "Bush hid the facts" (without a newline) in ASCII as Chinese UTF-16LE, since all the bytes for assigned Unicode characters in UTF-16. Charset detection is particularly unreliable in Europe, in an environment of mixed ISO-8859 encodings. These are closely related eight-bit encodings that share an overlap in their lower half with ASCII and all arrangements of bytes are valid. There is no technical way to tell these encodings apart and recognising them relies on identifying language features, such as letter frequencies or spellings. Due to the unreliability of heuristic detection, it is better to properly label datasets with the correct encoding. HTML documents served across the web by HTTP should have their encoding stated out-of-band using the Content-Type: header. Content-Type: text/html;charset=UTF-8 An isolated HTML document, such as one being edited as a file on disk, may imply such a header by a meta tag within the file: <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" > or with a new meta type in HTML5 <meta charset="utf-8" > If the document is Unicode, then some UTF encodings explicitly label the document with an embedded initial byte order mark (BOM). (en) 字符编码探测、字符集探测又稱為代码页检测是個启发式猜测代表文字的一系列字节的字符编码。其算法通常依据对字节样式的统计分析。这并不是一个万无一失的方法因为它依赖于统计数据——比如有些Windows版本会误把ASCII编码的""当作中文UTF-16LE。为数不多的能可靠探测的情况之一是探测UTF-8。这是因为UTF-8中有大量的无效字节序列，所以当其他编码方式使用字节中的高位bit时极不可能通过UTF-8有效性测试。不幸的是不完善的字符集探测程序不优先进行可靠的UTF-8测试于是把UTF-8定为其他编码。 (zh)
gold:hypernym	Process
prov:wasDerivedFrom	wikipedia-en:Charset_detection?oldid=1046959984&ns=0
page length (characters) of wiki page	4871 (xsd:nonNegativeInteger)
foaf:isPrimaryTopicOf	wikipedia-en:Charset_detection
is rdfs:seeAlso of	Content sniffing
is Link from a Wikipage to another Wikipage of	Mojibake Unicode and HTML SubRip Codepage sniffing Language identification Character encoding detection Plain text Bush hid the facts Extended ASCII
is Wikipage redirect of	Codepage sniffing Character encoding detection
is foaf:primaryTopic of	wikipedia-en:Charset_detection

Faceted Search & Find service v1.17_git139 as of Feb 29 2024

Alternative Linked Data Documents: ODE Content Formats:

RDF

ODATA

Microdata

About

OpenLink Virtuoso version 08.03.3330 as of Mar 19 2024, on Linux (x86_64-generic-linux-glibc212), Single-Server Edition (378 GB total memory, 52 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software