HTML strip character filter

The html_strip character filter removes HTML tags, such as <div>, <p>, and <a>, from the input text and renders plain text. The filter can be configured to preserve certain tags or decode specific HTML entities, such as &nbsp;, into spaces.

Example: HTML analyzer

  1. GET /_analyze
  2. {
  3. "tokenizer": "keyword",
  4. "char_filter": [
  5. "html_strip"
  6. ],
  7. "text": "<p>Commonly used calculus symbols include &alpha;, &beta; and &theta; </p>"
  8. }

copy

Using the HTML analyzer, you can convert the HTML character entity references into their corresponding symbols. The processed text would read as follows:

  1. Commonly used calculus symbols include α, β and θ

Example: Custom analyzer with lowercase filter

The following example query creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the html_strip analyzer and lowercase filter:

  1. PUT /html_strip_and_lowercase_analyzer
  2. {
  3. "settings": {
  4. "analysis": {
  5. "char_filter": {
  6. "html_filter": {
  7. "type": "html_strip"
  8. }
  9. },
  10. "analyzer": {
  11. "html_strip_analyzer": {
  12. "type": "custom",
  13. "char_filter": ["html_filter"],
  14. "tokenizer": "standard",
  15. "filter": ["lowercase"]
  16. }
  17. }
  18. }
  19. }
  20. }

copy

Testing html_strip_and_lowercase_analyzer

You can run the following request to test the analyzer:

  1. GET /html_strip_and_lowercase_analyzer/_analyze
  2. {
  3. "analyzer": "html_strip_analyzer",
  4. "text": "<h1>Welcome to <strong>OpenSearch</strong>!</h1>"
  5. }

copy

In the response, the HTML tags have been removed and the plain text has been converted to lowercase:

  1. welcome to opensearch!

Example: Custom analyzer that preserves HTML tags

The following example request creates a custom analyzer that preserves HTML tags:

  1. PUT /html_strip_preserve_analyzer
  2. {
  3. "settings": {
  4. "analysis": {
  5. "char_filter": {
  6. "html_filter": {
  7. "type": "html_strip",
  8. "escaped_tags": ["b", "i"]
  9. }
  10. },
  11. "analyzer": {
  12. "html_strip_analyzer": {
  13. "type": "custom",
  14. "char_filter": ["html_filter"],
  15. "tokenizer": "keyword"
  16. }
  17. }
  18. }
  19. }
  20. }

copy

Testing html_strip_preserve_analyzer

You can run the following request to test the analyzer:

  1. GET /html_strip_preserve_analyzer/_analyze
  2. {
  3. "analyzer": "html_strip_analyzer",
  4. "text": "<p>This is a <b>bold</b> and <i>italic</i> text.</p>"
  5. }

copy

In the response, the italic and bold tags have been retained, as specified in the custom analyzer request:

  1. This is a <b>bold</b> and <i>italic</i> text.