Source Edit

NOTE: The behaviour might change in future versions as it is not clear what “wild HTML the real world uses” really implies.

It can be used to parse a wild HTML document and output it as valid XHTML document (well, if you are lucky):

  1. echo loadHtml("mydirty.html")

Every tag in the resulting tree is in lower case.

Note: The resulting XmlNode already uses the clientData field, so it cannot be used by clients of this library.

Example: Transforming hyperlinks

This code demonstrates how you can iterate over all the tags in an HTML file and write back the modified version. In this case we look for hyperlinks ending with the extension .rst and convert them to .html.

  1. import std/htmlparser
  2. import std/xmltree # To use '$' for XmlNode
  3. import std/strtabs # To access XmlAttributes
  4. import std/os # To use splitFile
  5. import std/strutils # To use cmpIgnoreCase
  6. proc transformHyperlinks() =
  7. let html = loadHtml("input.html")
  8. for a in html.findAll("a"):
  9. if a.attrs.hasKey "href":
  10. let (dir, filename, ext) = splitFile(a.attrs["href"])
  11. if cmpIgnoreCase(ext, ".rst") == 0:
  12. a.attrs["href"] = dir / filename & ".html"
  13. writeFile("output.html", $html)

Imports

strutils, streams, parsexml, xmltree, unicode, strtabs, os

Types

  1. HtmlTag = enum
  2. tagUnknown, ## unknown HTML element
  3. tagA, ## the HTML `a` element
  4. tagAbbr, ## the deprecated HTML `abbr` element
  5. tagAcronym, ## the HTML `acronym` element
  6. tagAddress, ## the HTML `address` element
  7. tagApplet, ## the deprecated HTML `applet` element
  8. tagArea, ## the HTML `area` element
  9. tagArticle, ## the HTML `article` element
  10. tagAside, ## the HTML `aside` element
  11. tagAudio, ## the HTML `audio` element
  12. tagB, ## the HTML `b` element
  13. tagBase, ## the HTML `base` element
  14. tagBdi, ## the HTML `bdi` element
  15. tagBdo, ## the deprecated HTML `dbo` element
  16. tagBasefont, ## the deprecated HTML `basefont` element
  17. tagBig, ## the HTML `big` element
  18. tagBlockquote, ## the HTML `blockquote` element
  19. tagBody, ## the HTML `body` element
  20. tagBr, ## the HTML `br` element
  21. tagButton, ## the HTML `button` element
  22. tagCanvas, ## the HTML `canvas` element
  23. tagCaption, ## the HTML `caption` element
  24. tagCenter, ## the deprecated HTML `center` element
  25. tagCite, ## the HTML `cite` element
  26. tagCode, ## the HTML `code` element
  27. tagCol, ## the HTML `col` element
  28. tagColgroup, ## the HTML `colgroup` element
  29. tagCommand, ## the HTML `command` element
  30. tagDatalist, ## the HTML `datalist` element
  31. tagDd, ## the HTML `dd` element
  32. tagDel, ## the HTML `del` element
  33. tagDetails, ## the HTML `details` element
  34. tagDfn, ## the HTML `dfn` element
  35. tagDialog, ## the HTML `dialog` element
  36. tagDiv, ## the HTML `div` element
  37. tagDir, ## the deprecated HTLM `dir` element
  38. tagDl, ## the HTML `dl` element
  39. tagDt, ## the HTML `dt` element
  40. tagEm, ## the HTML `em` element
  41. tagEmbed, ## the HTML `embed` element
  42. tagFieldset, ## the HTML `fieldset` element
  43. tagFigcaption, ## the HTML `figcaption` element
  44. tagFigure, ## the HTML `figure` element
  45. tagFont, ## the deprecated HTML `font` element
  46. tagFooter, ## the HTML `footer` element
  47. tagForm, ## the HTML `form` element
  48. tagFrame, ## the HTML `frame` element
  49. tagFrameset, ## the deprecated HTML `frameset` element
  50. tagH1, ## the HTML `h1` element
  51. tagH2, ## the HTML `h2` element
  52. tagH3, ## the HTML `h3` element
  53. tagH4, ## the HTML `h4` element
  54. tagH5, ## the HTML `h5` element
  55. tagH6, ## the HTML `h6` element
  56. tagHead, ## the HTML `head` element
  57. tagHeader, ## the HTML `header` element
  58. tagHgroup, ## the HTML `hgroup` element
  59. tagHtml, ## the HTML `html` element
  60. tagHr, ## the HTML `hr` element
  61. tagI, ## the HTML `i` element
  62. tagIframe, ## the deprecated HTML `iframe` element
  63. tagImg, ## the HTML `img` element
  64. tagInput, ## the HTML `input` element
  65. tagIns, ## the HTML `ins` element
  66. tagIsindex, ## the deprecated HTML `isindex` element
  67. tagKbd, ## the HTML `kbd` element
  68. tagKeygen, ## the HTML `keygen` element
  69. tagLabel, ## the HTML `label` element
  70. tagLegend, ## the HTML `legend` element
  71. tagLi, ## the HTML `li` element
  72. tagLink, ## the HTML `link` element
  73. tagMap, ## the HTML `map` element
  74. tagMark, ## the HTML `mark` element
  75. tagMenu, ## the deprecated HTML `menu` element
  76. tagMeta, ## the HTML `meta` element
  77. tagMeter, ## the HTML `meter` element
  78. tagNav, ## the HTML `nav` element
  79. tagNobr, ## the deprecated HTML `nobr` element
  80. tagNoframes, ## the deprecated HTML `noframes` element
  81. tagNoscript, ## the HTML `noscript` element
  82. tagObject, ## the HTML `object` element
  83. tagOl, ## the HTML `ol` element
  84. tagOptgroup, ## the HTML `optgroup` element
  85. tagOption, ## the HTML `option` element
  86. tagOutput, ## the HTML `output` element
  87. tagP, ## the HTML `p` element
  88. tagParam, ## the HTML `param` element
  89. tagPre, ## the HTML `pre` element
  90. tagProgress, ## the HTML `progress` element
  91. tagQ, ## the HTML `q` element
  92. tagRp, ## the HTML `rp` element
  93. tagRt, ## the HTML `rt` element
  94. tagRuby, ## the HTML `ruby` element
  95. tagS, ## the deprecated HTML `s` element
  96. tagSamp, ## the HTML `samp` element
  97. tagScript, ## the HTML `script` element
  98. tagSection, ## the HTML `section` element
  99. tagSelect, ## the HTML `select` element
  100. tagSmall, ## the HTML `small` element
  101. tagSource, ## the HTML `source` element
  102. tagSpan, ## the HTML `span` element
  103. tagStrike, ## the deprecated HTML `strike` element
  104. tagStrong, ## the HTML `strong` element
  105. tagStyle, ## the HTML `style` element
  106. tagSub, ## the HTML `sub` element
  107. tagSummary, ## the HTML `summary` element
  108. tagSup, ## the HTML `sup` element
  109. tagTable, ## the HTML `table` element
  110. tagTbody, ## the HTML `tbody` element
  111. tagTd, ## the HTML `td` element
  112. tagTextarea, ## the HTML `textarea` element
  113. tagTfoot, ## the HTML `tfoot` element
  114. tagTh, ## the HTML `th` element
  115. tagThead, ## the HTML `thead` element
  116. tagTime, ## the HTML `time` element
  117. tagTitle, ## the HTML `title` element
  118. tagTr, ## the HTML `tr` element
  119. tagTrack, ## the HTML `track` element
  120. tagTt, ## the HTML `tt` element
  121. tagU, ## the deprecated HTML `u` element
  122. tagUl, ## the HTML `ul` element
  123. tagVar, ## the HTML `var` element
  124. tagVideo, ## the HTML `video` element
  125. tagWbr ## the HTML `wbr` element

list of all supported HTML tags; order will always be alphabetically Source Edit

Consts

  1. BlockTags = {tagAddress, tagBlockquote, tagCenter, tagDel, tagDir, tagDiv,
  2. tagDl, tagFieldset, tagForm, tagH1, tagH2, tagH3, tagH4, tagH5,
  3. tagH6, tagHr, tagIns, tagIsindex, tagMenu, tagNoframes,
  4. tagNoscript, tagOl, tagP, tagPre, tagTable, tagUl, tagCenter,
  5. tagDir, tagIsindex, tagMenu, tagNoframes}

Source Edit

  1. InlineTags = {tagA, tagAbbr, tagAcronym, tagApplet, tagB, tagBasefont, tagBdo,
  2. tagBig, tagBr, tagButton, tagCite, tagCode, tagDel, tagDfn, tagEm,
  3. tagFont, tagI, tagImg, tagIns, tagInput, tagIframe, tagKbd,
  4. tagLabel, tagMap, tagObject, tagQ, tagSamp, tagScript, tagSelect,
  5. tagSmall, tagSpan, tagStrong, tagSub, tagSup, tagTextarea, tagTt,
  6. tagVar, tagApplet, tagBasefont, tagFont, tagIframe, tagU, tagS,
  7. tagStrike, tagWbr}

Source Edit

  1. SingleTags = {tagArea, tagBase, tagBasefont, tagBr, tagCol, tagFrame, tagHr,
  2. tagImg, tagIsindex, tagLink, tagMeta, tagParam, tagWbr, tagSource}

Source Edit

  1. tagToStr = ["a", "abbr", "acronym", "address", "applet", "area", "article",
  2. "aside", "audio", "b", "base", "basefont", "bdi", "bdo", "big",
  3. "blockquote", "body", "br", "button", "canvas", "caption", "center",
  4. "cite", "code", "col", "colgroup", "command", "datalist", "dd",
  5. "del", "details", "dfn", "dialog", "div", "dir", "dl", "dt", "em",
  6. "embed", "fieldset", "figcaption", "figure", "font", "footer",
  7. "form", "frame", "frameset", "h1", "h2", "h3", "h4", "h5", "h6",
  8. "head", "header", "hgroup", "html", "hr", "i", "iframe", "img",
  9. "input", "ins", "isindex", "kbd", "keygen", "label", "legend", "li",
  10. "link", "map", "mark", "menu", "meta", "meter", "nav", "nobr",
  11. "noframes", "noscript", "object", "ol", "optgroup", "option",
  12. "output", "p", "param", "pre", "progress", "q", "rp", "rt", "ruby",
  13. "s", "samp", "script", "section", "select", "small", "source",
  14. "span", "strike", "strong", "style", "sub", "summary", "sup",
  15. "table", "tbody", "td", "textarea", "tfoot", "th", "thead", "time",
  16. "title", "tr", "track", "tt", "u", "ul", "var", "video", "wbr"]

Source Edit

Procs

  1. proc entityToRune(entity: string): Rune {....raises: [], tags: [], forbids: [].}

Converts an HTML entity name like Ü or values like Ü or Ü to its UTF-8 equivalent. Rune(0) is returned if the entity name is unknown.

Example:

  1. import std/unicode
  2. doAssert entityToRune("") == Rune(0)
  3. doAssert entityToRune("a") == Rune(0)
  4. doAssert entityToRune("gt") == ">".runeAt(0)
  5. doAssert entityToRune("Uuml") == "Ü".runeAt(0)
  6. doAssert entityToRune("quest") == "?".runeAt(0)
  7. doAssert entityToRune("#x0003F") == "?".runeAt(0)

Source Edit

  1. proc entityToUtf8(entity: string): string {....raises: [], tags: [], forbids: [].}

Converts an HTML entity name like Ü or values like Ü or Ü to its UTF-8 equivalent. “” is returned if the entity name is unknown. The HTML parser already converts entities to UTF-8.

Example:

  1. const sigma = "Σ"
  2. doAssert entityToUtf8("") == ""
  3. doAssert entityToUtf8("a") == ""
  4. doAssert entityToUtf8("gt") == ">"
  5. doAssert entityToUtf8("Uuml") == "Ü"
  6. doAssert entityToUtf8("quest") == "?"
  7. doAssert entityToUtf8("#63") == "?"
  8. doAssert entityToUtf8("Sigma") == sigma
  9. doAssert entityToUtf8("#931") == sigma
  10. doAssert entityToUtf8("#0931") == sigma
  11. doAssert entityToUtf8("#x3A3") == sigma
  12. doAssert entityToUtf8("#x03A3") == sigma
  13. doAssert entityToUtf8("#x3a3") == sigma
  14. doAssert entityToUtf8("#X3a3") == sigma

Source Edit

  1. proc htmlTag(n: XmlNode): HtmlTag {....raises: [], tags: [], forbids: [].}

Gets n’s tag as a HtmlTag. Source Edit

  1. proc htmlTag(s: string): HtmlTag {....raises: [], tags: [], forbids: [].}

Converts s to a HtmlTag. If s is no HTML tag, tagUnknown is returned. Source Edit

  1. proc loadHtml(path: string): XmlNode {....raises: [IOError, OSError, ValueError,
  2. Exception], tags: [ReadIOEffect, RootEffect, WriteIOEffect], forbids: [].}

Loads and parses HTML from file specified by path, and returns a XmlNode. All parsing errors are ignored. Source Edit

  1. proc loadHtml(path: string; errors: var seq[string]): XmlNode {.
  2. ...raises: [IOError, OSError, ValueError, Exception],
  3. tags: [ReadIOEffect, RootEffect, WriteIOEffect], forbids: [].}

Loads and parses HTML from file specified by path, and returns a XmlNode. Every occurred parsing error is added to the errors sequence. Source Edit

  1. proc parseHtml(html: string): XmlNode {....raises: [IOError, OSError, ValueError,
  2. Exception], tags: [ReadIOEffect, RootEffect, WriteIOEffect], forbids: [].}

Parses the HTML from string html and returns a XmlNode. All parsing errors are ignored. Source Edit

  1. proc parseHtml(s: Stream): XmlNode {....raises: [IOError, OSError, ValueError,
  2. Exception], tags: [ReadIOEffect, RootEffect, WriteIOEffect], forbids: [].}

Parses the HTML from stream s and returns a XmlNode. All parsing errors are ignored. Source Edit

  1. proc parseHtml(s: Stream; filename: string; errors: var seq[string]): XmlNode {.
  2. ...raises: [IOError, OSError, ValueError, Exception],
  3. tags: [ReadIOEffect, RootEffect, WriteIOEffect], forbids: [].}

Parses the XML from stream s and returns a XmlNode. Every occurred parsing error is added to the errors sequence. Source Edit

  1. proc runeToEntity(rune: Rune): string {....raises: [], tags: [], forbids: [].}

converts a Rune to its numeric HTML entity equivalent.

Example:

  1. import std/unicode
  2. doAssert runeToEntity(Rune(0)) == ""
  3. doAssert runeToEntity(Rune(-1)) == ""
  4. doAssert runeToEntity("Ü".runeAt(0)) == "#220"
  5. doAssert runeToEntity("∈".runeAt(0)) == "#8712"

Source Edit