Server-side DOM and parsing

elements

The DIV helper and all derived helpers provide the search methods element and elements.

element returns the first child element matching a specified condition (or None if no match).

elements returns a list of all matching children.

element and elements use the same syntax to specify the matching condition, which allows for three possibilities that can be mixed and matched: jQuery-like expressions, match by exact attribute value, match using regular expressions.

Here is a simple example:

  1. >>> a = DIV(DIV(DIV('a', _id='target', _class='abc')))
  2. >>> d = a.elements('div#target')
  3. >>> d[0][0] = 'changed'
  4. >>> print a
  5. <div><div><div id="target" class="abc">changed</div></div></div>

The un-named argument of elements is a string, which may contain: the name of a tag, the id of a tag preceded by a pound symbol, the class preceded by a dot, the explicit value of an attribute in square brackets.

Here are 4 equivalent ways to search the previous tag by id:

  1. d = a.elements('#target')
  2. d = a.elements('div#target')
  3. d = a.elements('div[id=target]')
  4. d = a.elements('div', _id='target')

Here are 4 equivalent ways to search the previous tag by class:

  1. d = a.elements('.abc')
  2. d = a.elements('div.abc')
  3. d = a.elements('div[class=abc]')
  4. d = a.elements('div', _class='abc')

Any attribute can be used to locate an element (not just id and class), including multiple attributes (the function element can take multiple named arguments), but only the first matching element will be returned.

Using the jQuery syntax “div#target” it is possible to specify multiple search criteria separated by a comma:

  1. a = DIV(SPAN('a', _id='t1'), DIV('b', _class='c2'))
  2. d = a.elements('span#t1, div.c2')

or equivalently

  1. a = DIV(SPAN('a', _id='t1'), DIV('b', _class='c2'))
  2. d = a.elements('span#t1', 'div.c2')

If the value of an attribute is specified using a name argument, it can be a string or a regular expression:

  1. a = DIV(SPAN('a', _id='test123'), DIV('b', _class='c2'))
  2. d = a.elements('span', _id=re.compile('test\d{3}')

A special named argument of the DIV (and derived) helpers is find. It can be used to specify a search value or a search regular expression in the text content of the tag. For example:

  1. >>> a = DIV(SPAN('abcde'), DIV('fghij'))
  2. >>> d = a.elements(find='bcd')
  3. >>> print d[0]
  4. <span>abcde</span>

or

  1. >>> a = DIV(SPAN('abcde'), DIV('fghij'))
  2. >>> d = a.elements(find=re.compile('fg\w{3}'))
  3. >>> print d[0]
  4. <div>fghij</div>

components

Here’s an example of listing all elements in an html string:

  1. >>> html = TAG('<a>xxx</a><b>yyy</b>')
  2. >>> for item in html.components:
  3. ... print item
  4. ...
  5. <a>xxx</a>
  6. <b>yyy</b>

parent and siblings

parent returns the parent of the current element.

  1. >>> a = DIV(SPAN('a'), DIV('b'))
  2. >>> s = a.element('span')
  3. >>> d = s.parent
  4. >>> d['_class']='abc'
  5. >>> print a
  6. <div class="abc"><span>a</span><div>b</div></div>
  7. >>> for e in s.siblings(): print e
  8. <div>b</div>

Replacing elements

Elements that are matched can also be replaced or removed by specifying the replace argument. Notice that a list of the original matching elements is still returned as usual.

  1. >>> a = DIV(SPAN('x'), DIV(SPAN('y'))
  2. >>> b = a.elements('span', replace=P('z')
  3. >>> print a
  4. <div><p>z</p><div><p>z</p></div>

replace can be a callable. In this case it will be passed the original element and it is expected to return the replacement element:

  1. >>> a = DIV(SPAN('x'), DIV(SPAN('y'))
  2. >>> b = a.elements('span', replace=lambda t: P(t[0])
  3. >>> print a
  4. <div><p>x</p><div><p>y</p></div>

If replace=None, matching elements will be removed completely.

  1. >>> a = DIV(SPAN('x'), DIV(SPAN('y'))
  2. >>> b = a.elements('span', replace=None)
  3. >>> print a
  4. <div></div>

flatten

The flatten method recursively serializes the content of the children of a given element into regular text (without tags):

  1. >>> a = DIV(SPAN('this', DIV('is', B('a'))), SPAN('test'))
  2. >>> print a.flatten()
  3. thisisatest

Flatten can be passed an optional argument, render, i.e. a function that renders/flattens the content using a different protocol. Here is an example to serialize some tags into Markmin wiki syntax:

  1. >>> a = DIV(H1('title'), P('example of a ', A('link', _href='#test')))
  2. >>> from gluon.html import markmin_serializer
  3. >>> print a.flatten(render=markmin_serializer)
  4. # titles
  5. example of [[a link #test]]

At the time of writing we provide markmin_serializer and markdown_serializer.

Parsing

The TAG object is also an XML/HTML parser. It can read text and convert into a tree structure of helpers. This allows manipulation using the API above:

  1. >>> html = '<h1>Title</h1><p>this is a <span>test</span></p>'
  2. >>> parsed_html = TAG(html)
  3. >>> parsed_html.element('span')[0]='TEST'
  4. >>> print parsed_html
  5. <h1>Title</h1><p>this is a <span>TEST</span></p>