A demonstration of using the Chakasu encoding library in combination with the Chame HTML parser.
For the most part, this is the same as minidom. However, it also has support for decoding documents with arbitrary character sets using DecoderStream + EncoderStream.
Note: this is not implemented for the fragment parsing algorithm, because it is only defined for the UTF-8 character set.
For a version without the encoding library dependency, see minidom.
Procs
proc parseHTML(inputStream: Stream; opts: HTML5ParserOpts[Node, MAtom]; charsets: seq[Charset]; seekable = true; factory = newMAtomFactory()): Document {. ...raises: [IOError, OSError, Exception], tags: [ReadIOEffect, RootEffect], forbids: [].}
-
Read, parse and return an HTML document from inputStream.
charsets is a list of input character sets to try. If empty, it will be initialized to @[CHARSET_UTF_8].
The list of fallback charsets is used as follows:
- A charset stack is initialized to charsets, reversed. This means that the first charset specified in charsets is on top of the stack. (e.g. say charsets = @[CHARSET_UTF_16_LE, CHARSET_UTF_8], then utf-16-le is tried before utf-8.)
- BOM sniffing is attempted. If successful, confidence is set to certain and the resulting charset is used (i.e. other character sets will not be tried for decoding this document.)
- If the charset stack is empty, UTF-8 is pushed on top.
- Attempt to parse the document with the first charset on top of the stack.
- If BOM sniffing was unsuccessful, and a <meta charset=...> tag is encountered, parsing is restarted with the specified charset. No further attempts are made to detect the encoding, and decoder errors are signaled by U+FFFD replacement characters.
- Otherwise, each charset on the charset stack is tried until either no decoding errors are encountered, or only one charset is left. For the last charset, decoder errors are signaled by U+FFFD replacement characters.
seekable must be true only if inputStream is seekable; if set to true, inputStream.setPosition(0) must work.
Note that seekable = false disables automatic character set detection; even <meta charset=... tags will be disregarded. (TODO: this should be improved in the future; theoretically we could still switch between ASCII-compatible charsets before non-ASCII is encountered.)
Exports
-
TAG_RB, TAG_FACE, HTagTypes, TAG_DFN, TAG_SUMMARY, ==, TAG_DEFINITION_URL, hash, TAG_HTTP_EQUIV, newMiniDOMBuilder, PREFIX_UNKNOWN, TAG_PLAINTEXT, NamespacePrefix, TAG_MATH, CharacterData, PREFIX_XLINK, TAG_NOEMBED, XMLNS, TAG_EMBED, TAG_IMAGE, TAG_TH, TAG_DATALIST, TAG_COL, TAG_TABLE, TAG_INS, TAG_BODY, TAG_PRE, TAG_FRAMESET, TAG_B, TAG_DD, TAG_FONT, TAG_RT, TAG_FORM, TAG_BDO, MiniDOMBuilder, TAG_DIR, TAG_TIME, strToAtom, HTMLTemplateElement, TAG_ABBR, TAG_LINK, TAG_MI, TAG_SPAN, TAG_HEADER, DocumentType, TAG_OBJECT, toTagType, TAG_MGLYPH, TAG_NOSCRIPT, TAG_VIDEO, TAG_KEYGEN, TAG_CANVAS, TAG_IMG, TAG_BLINK, TAG_UNKNOWN, TAG_LI, TAG_OPTGROUP, TAG_SECTION, TAG_FIGURE, TAG_MARQUEE, TAG_MAP, TAG_A, Node, TAG_DETAILS, QuirksMode, TAG_LABEL, TAG_DESC, TAG_DEL, TAG_MO, HTML, TAG_HTML, TAG_WBR, TAG_ENCODING, PREFIX_XML, localNameStr, TAG_SELECT, Element, TAG_VAR, TAG_AREA, TAG_NAV, parseHTML, TAG_SUP, FormAssociatedElements, parseHTMLFragment, TAG_SVG, TAG_BR, TAG_OL, TAG_OPTION, TAG_TFOOT, TAG_H5, TAG_SUB, preInsertionValidity, TAG_KBD, newMAtomFactory, Namespace, TAG_ANNOTATION_XML, parseHTMLFragment, TAG_TRACK, TAG_MARK, TAG_RTC, Document, TAG_Q, AllTagTypes, TAG_PICTURE, MATHML, TAG_H3, TAG_IFRAME, TAG_HEAD, TAG_EM, TAG_NOBR, TAG_HR, Attribute, TAG_CHARSET, TAG_H6, TAG_BLOCKQUOTE, TAG_DL, TAG_CONTENT, TAG_OUTPUT, Comment, TAG_ADDRESS, TAG_MN, TAG_TD, TAG_P, XLINK, TAG_LEGEND, TAG_XMP, TAG_RUBY, TAG_CODE, TAG_CITE, TAG_SAMP, TAG_AUDIO, TAG_FIGCAPTION, atomToStr, TAG_I, TAG_META, TAG_PROGRESS, TAG_STYLE, PREFIX_XMLNS, TAG_FOOTER, TAG_MS, attrsStr, TAG_U, TAG_H4, TAG_BUTTON, TAG_TEXTAREA, TAG_BASEFONT, tagTypeToAtom, TAG_FRAME, TAG_COLOR, ListedElements, TAG_PORTAL, TAG_SOURCE, TAG_TT, TAG_CAPTION, TAG_STRONG, TAG_ASIDE, TAG_MALIGNMARK, TAG_NOFRAMES, TAG_H2, SVG, TAG_TEMPLATE, TAG_LISTING, TAG_TITLE, TAG_BASE, TAG_BGSOUND, TagType, TAG_MENU, TAG_TYP, TAG_DIALOG, TAG_CENTER, TAG_TR, Text, TAG_METER, TAG_DATA, TAG_SIZE, TAG_S, TAG_BIG, TAG_SARCASM, TAG_DT, TAG_RP, TAG_DIV, TAG_H1, TAG_TBODY, TAG_MAIN, TAG_THEAD, MAtom, TAG_FIELDSET, TAG_SEARCH, TAG_COLGROUP, XML, TAG_SCRIPT, TAG_ARTICLE, TAG_STRIKE, DocumentFragment, TAG_SMALL, TAG_APPLET, TAG_INPUT, TAG_BDI, TAG_FOREIGN_OBJECT, TAG_UL, NO_PREFIX, TAG_MTEXT, NAMESPACE_UNKNOWN, NO_NAMESPACE, tagType, TAG_PARAM, MAtomFactory, cmp, TAG_HGROUP, TAG_RB, TAG_TYP, TAG_OBJECT, TAG_DFN, TAG_SUMMARY, TAG_DEFINITION_URL, TAG_HTTP_EQUIV, PREFIX_XML, TAG_PLAINTEXT, NamespacePrefix, PREFIX_XLINK, HTagTypes, XMLNS, TAG_EMBED, TAG_IMAGE, TAG_TH, TAG_DATALIST, TAG_COL, TAG_TABLE, TAG_INS, TAG_BODY, TAG_PRE, TAG_FRAMESET, TAG_B, TAG_DD, TAG_FONT, TAG_RT, TAG_FORM, TAG_BDO, TAG_OL, TAG_TIME, TAG_ABBR, TAG_LINK, TAG_MI, TAG_SPAN, TAG_HEADER, TAG_NOEMBED, TAG_LI, TAG_NOSCRIPT, TAG_DATA, TAG_KEYGEN, TAG_MALIGNMARK, TAG_IMG, TAG_BLINK, TAG_UNKNOWN, TAG_MGLYPH, TAG_OPTGROUP, TAG_SECTION, TAG_FIGURE, TAG_MARQUEE, TAG_MAP, TAG_A, TAG_DETAILS, QuirksMode, TAG_LABEL, TAG_DESC, TAG_DEL, TAG_MO, HTML, TAG_HTML, TAG_WBR, TAG_FRAME, TAG_CITE, TAG_SELECT, TAG_VAR, TAG_AREA, TAG_DIV, TAG_SUP, FormAssociatedElements, TAG_SVG, TAG_BR, TAG_DIR, TAG_OPTION, TAG_TFOOT, TAG_H5, TAG_SEARCH, TAG_KBD, Namespace, TAG_ANNOTATION_XML, TAG_TRACK, AllTagTypes, TAG_RTC, TAG_Q, TAG_MARK, TAG_PICTURE, MATHML, TAG_H3, TAG_IFRAME, TAG_HEAD, TAG_EM, TAG_NOBR, TAG_HR, TAG_CHARSET, TAG_H6, TAG_BLOCKQUOTE, TAG_DL, TAG_CONTENT, TAG_OUTPUT, TAG_ADDRESS, TAG_MN, TAG_ARTICLE, TAG_P, XLINK, TAG_LEGEND, TAG_XMP, TAG_RUBY, TAG_CODE, PREFIX_UNKNOWN, TAG_SAMP, TAG_AUDIO, TAG_MATH, TAG_FIGCAPTION, TAG_I, TAG_META, TAG_PROGRESS, TAG_STYLE, PREFIX_XMLNS, TAG_FOOTER, TAG_MS, TAG_U, TAG_H4, TAG_BUTTON, TAG_TEXTAREA, TAG_DIALOG, TAG_ENCODING, TAG_COLOR, ListedElements, TAG_PORTAL, TAG_SOURCE, TAG_TT, TAG_CAPTION, TAG_STRONG, TAG_ASIDE, TAG_CANVAS, SVG, TAG_H2, TAG_NOFRAMES, TAG_TEMPLATE, TAG_LISTING, TAG_TITLE, TAG_BASE, TAG_BGSOUND, TagType, TAG_MENU, TAG_FACE, TAG_BASEFONT, TAG_CENTER, TAG_TR, TAG_METER, TAG_VIDEO, TAG_SIZE, TAG_S, TAG_BIG, TAG_SARCASM, TAG_DT, TAG_RP, TAG_NAV, TAG_H1, TAG_TBODY, TAG_MAIN, TAG_THEAD, TAG_FIELDSET, TAG_SUB, TAG_COLGROUP, XML, TAG_SCRIPT, TAG_TD, TAG_STRIKE, TAG_SMALL, TAG_APPLET, TAG_INPUT, TAG_BDI, TAG_FOREIGN_OBJECT, TAG_UL, NO_PREFIX, TAG_MTEXT, NAMESPACE_UNKNOWN, NO_NAMESPACE, TAG_PARAM, TAG_HGROUP