Textractor

pipeline pipeline

The Textractor pipeline extracts and splits text from documents. This pipeline uses Apache Tika (if Java is available) and BeautifulSoup4. See this link for a list of supported document formats.

Each document goes through the following process.

  • Content is retrieved if it’s not local
  • If the document mime-type isn’t plain text or HTML, it’s run through Tika and converted to XHTML
  • XHTML is converted to Markdown and returned

Without Apache Tika, this pipeline only supports plain text and HTML. Other document types require Tika and Java to be installed. Another option is to start Apache Tika via this Docker Image.

Example

The following shows a simple example using this pipeline.

  1. from txtai.pipeline import Textractor
  2. # Create and run pipeline
  3. textract = Textractor()
  4. textract("https://github.com/neuml/txtai")

See the link below for a more detailed example.

NotebookDescription
Extract text from documentsExtract text from PDF, Office, HTML and moreOpen In Colab

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

  1. # Create pipeline using lower case class name
  2. textractor:
  3. # Run pipeline with workflow
  4. workflow:
  5. textract:
  6. tasks:
  7. - action: textractor

Run with Workflows

  1. from txtai import Application
  2. # Create and run pipeline with workflow
  3. app = Application("config.yml")
  4. list(app.workflow("textract", ["https://github.com/neuml/txtai"]))

Run with API

  1. CONFIG=config.yml uvicorn "txtai.api:app" &
  2. curl \
  3. -X POST "http://localhost:8000/workflow" \
  4. -H "Content-Type: application/json" \
  5. -d '{"name":"textract", "elements":["https://github.com/neuml/txtai"]}'

Methods

Python documentation for the pipeline.

__init__(sentences=False, lines=False, paragraphs=False, minlength=None, join=False, tika=True, sections=False, headers=None)

Source code in txtai/pipeline/data/textractor.py

  1. 31
  2. 32
  3. 33
  4. 34
  5. 35
  6. 36
  7. 37
  8. 38
  9. 39
  10. 40
  11. 41
  12. 42
  13. 43
  14. 44
  15. 45
  1. def init(self, sentences=False, lines=False, paragraphs=False, minlength=None, join=False, tika=True, sections=False, headers=None):
  2. if not TIKA:
  3. raise ImportError(‘Textractor pipeline is not available - install pipeline extra to enable’)
  4. super().init(sentences, lines, paragraphs, minlength, join, sections)
  5. # Determine if Apache Tika (default if Java is available) or Beautiful Soup should be used
  6. # Beautiful Soup only supports HTML, Tika supports a wide variety of file formats.
  7. self.tika = self.checkjava() if tika else False
  8. # HTML to Text extractor
  9. self.extract = Extract(self.paragraphs, self.sections)
  10. # HTTP headers
  11. self.headers = headers if headers else {}

__call__(text)

Segments text into semantic units.

This method supports text as a string or a list. If the input is a string, the return type is text|list. If text is a list, a list of returned, this could be a list of text or a list of lists depending on the tokenization strategy.

Parameters:

NameTypeDescriptionDefault
text

text|list

required

Returns:

TypeDescription

segmented text

Source code in txtai/pipeline/data/segmentation.py

  1. 46
  2. 47
  3. 48
  4. 49
  5. 50
  6. 51
  7. 52
  8. 53
  9. 54
  10. 55
  11. 56
  12. 57
  13. 58
  14. 59
  15. 60
  16. 61
  17. 62
  18. 63
  19. 64
  20. 65
  21. 66
  22. 67
  23. 68
  24. 69
  25. 70
  26. 71
  27. 72
  28. 73
  1. def call(self, text):
  2. “””
  3. Segments text into semantic units.
  4. This method supports text as a string or a list. If the input is a string, the return
  5. type is text|list. If text is a list, a list of returned, this could be a
  6. list of text or a list of lists depending on the tokenization strategy.
  7. Args:
  8. text: text|list
  9. Returns:
  10. segmented text
  11. “””
  12. # Get inputs
  13. texts = [text] if not isinstance(text, list) else text
  14. # Extract text for each input file
  15. results = []
  16. for value in texts:
  17. # Get text
  18. value = self.text(value)
  19. # Parse and add extracted results
  20. results.append(self.parse(value))
  21. return results[0] if isinstance(text, str) else results