HTML strip character filter

HTML strip character filter

Strips HTML elements from a text and replaces HTML entities with their decoded value (e.g, replaces & with &).

The html_strip filter uses Lucene’s HTMLStripCharFilter.

Example

The following analyze API request uses the html_strip filter to change the text <p>I&apos;m so <b>happy</b>!</p> to \nI'm so happy!\n.

  1. GET /_analyze
  2. {
  3. "tokenizer": "keyword",
  4. "char_filter": [
  5. "html_strip"
  6. ],
  7. "text": "<p>I&apos;m so <b>happy</b>!</p>"
  8. }

The filter produces the following text:

  1. [ \nI'm so happy!\n ]

Add to an analyzer

The following create index API request uses the html_strip filter to configure a new custom analyzer.

  1. PUT /my-index-000001
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "my_analyzer": {
  7. "tokenizer": "keyword",
  8. "char_filter": [
  9. "html_strip"
  10. ]
  11. }
  12. }
  13. }
  14. }
  15. }

Configurable parameters

escaped_tags

(Optional, array of strings) Array of HTML elements without enclosing angle brackets (< >). The filter skips these HTML elements when stripping HTML from the text. For example, a value of [ "p" ] skips the <p> HTML element.

Customize

To customize the html_strip filter, duplicate it to create the basis for a new custom character filter. You can modify the filter using its configurable parameters.

The following create index API request configures a new custom analyzer using a custom html_strip filter, my_custom_html_strip_char_filter.

The my_custom_html_strip_char_filter filter skips the removal of the <b> HTML element.

  1. PUT my-index-000001
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "my_analyzer": {
  7. "tokenizer": "keyword",
  8. "char_filter": [
  9. "my_custom_html_strip_char_filter"
  10. ]
  11. }
  12. },
  13. "char_filter": {
  14. "my_custom_html_strip_char_filter": {
  15. "type": "html_strip",
  16. "escaped_tags": [
  17. "b"
  18. ]
  19. }
  20. }
  21. }
  22. }
  23. }