
pipeline pipeline

The Segmentation pipeline segments text into semantic units.


The following shows a simple example using this pipeline.

  1. from txtai.pipeline import Segmentation
  2. # Create and run pipeline
  3. segment = Segmentation(sentences=True)
  4. segment("This is a test. And another test.")

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.


  1. # Create pipeline using lower case class name
  2. segmentation:
  3. sentences: true
  4. # Run pipeline with workflow
  5. workflow:
  6. segment:
  7. tasks:
  8. - action: segmentation

Run with Workflows

  1. from txtai import Application
  2. # Create and run pipeline with workflow
  3. app = Application("config.yml")
  4. list(app.workflow("segment", ["This is a test. And another test."]))

Run with API

  1. CONFIG=config.yml uvicorn "txtai.api:app" &
  2. curl \
  3. -X POST "http://localhost:8000/workflow" \
  4. -H "Content-Type: application/json" \
  5. -d '{"name":"segment", "elements":["This is a test. And another test."]}'


Python documentation for the pipeline.

__init__(sentences=False, lines=False, paragraphs=False, minlength=None, join=False, sections=False)

Creates a new Segmentation pipeline.



tokenize text into sentences if True, defaults to False


tokenizes text into lines if True, defaults to False


tokenizes text into paragraphs if True, defaults to False


require at least minlength characters per text element, defaults to None


joins tokenized sections back together if True, defaults to False


tokenizes text into sections if True, defaults to False. Splits using section or page breaks, depending on what’s available


Source code in txtai/pipeline/data/

  1. 23
  2. 24
  3. 25
  4. 26
  5. 27
  6. 28
  7. 29
  8. 30
  9. 31
  10. 32
  11. 33
  12. 34
  13. 35
  14. 36
  15. 37
  16. 38
  17. 39
  18. 40
  19. 41
  20. 42
  21. 43
  22. 44
  1. def init(self, sentences=False, lines=False, paragraphs=False, minlength=None, join=False, sections=False):
  2. “””
  3. Creates a new Segmentation pipeline.
  4. Args:
  5. sentences: tokenize text into sentences if True, defaults to False
  6. lines: tokenizes text into lines if True, defaults to False
  7. paragraphs: tokenizes text into paragraphs if True, defaults to False
  8. minlength: require at least minlength characters per text element, defaults to None
  9. join: joins tokenized sections back together if True, defaults to False
  10. sections: tokenizes text into sections if True, defaults to False. Splits using section or page breaks, depending on whats available
  11. “””
  12. if not NLTK:
  13. raise ImportError(‘Segmentation pipeline is not available - install pipeline extra to enable’)
  14. self.sentences = sentences
  15. self.lines = lines
  16. self.paragraphs = paragraphs
  17. self.sections = sections
  18. self.minlength = minlength
  19. self.join = join


Segments text into semantic units.

This method supports text as a string or a list. If the input is a string, the return type is text|list. If text is a list, a list of returned, this could be a list of text or a list of lists depending on the tokenization strategy.







segmented text

Source code in txtai/pipeline/data/

  1. 46
  2. 47
  3. 48
  4. 49
  5. 50
  6. 51
  7. 52
  8. 53
  9. 54
  10. 55
  11. 56
  12. 57
  13. 58
  14. 59
  15. 60
  16. 61
  17. 62
  18. 63
  19. 64
  20. 65
  21. 66
  22. 67
  23. 68
  24. 69
  25. 70
  26. 71
  27. 72
  28. 73
  1. def call(self, text):
  2. “””
  3. Segments text into semantic units.
  4. This method supports text as a string or a list. If the input is a string, the return
  5. type is text|list. If text is a list, a list of returned, this could be a
  6. list of text or a list of lists depending on the tokenization strategy.
  7. Args:
  8. text: text|list
  9. Returns:
  10. segmented text
  11. “””
  12. # Get inputs
  13. texts = [text] if not isinstance(text, list) else text
  14. # Extract text for each input file
  15. results = []
  16. for value in texts:
  17. # Get text
  18. value = self.text(value)
  19. # Parse and add extracted results
  20. results.append(self.parse(value))
  21. return results[0] if isinstance(text, str) else results