Textractor
The Textractor pipeline extracts and splits text from documents. This pipeline uses either an Apache Tika backend (if Java is available) or BeautifulSoup4.
Example
The following shows a simple example using this pipeline.
from txtai.pipeline import Textractor
# Create and run pipeline
textract = Textractor()
textract("https://github.com/neuml/txtai")
See the link below for a more detailed example.
Notebook | Description | |
---|---|---|
Extract text from documents | Extract text from PDF, Office, HTML and more |
Configuration-driven example
Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.
config.yml
# Create pipeline using lower case class name
textractor:
# Run pipeline with workflow
workflow:
textract:
tasks:
- action: textractor
Run with Workflows
from txtai.app import Application
# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("textract", ["https://github.com/neuml/txtai"]))
Run with API
CONFIG=config.yml uvicorn "txtai.api:app" &
curl \
-X POST "http://localhost:8000/workflow" \
-H "Content-Type: application/json" \
-d '{"name":"textract", "elements":["https://github.com/neuml/txtai"]}'
Methods
Python documentation for the pipeline.
Source code in txtai/pipeline/data/textractor.py
|
|
Segments text into semantic units.
This method supports text as a string or a list. If the input is a string, the return type is text|list. If text is a list, a list of returned, this could be a list of text or a list of lists depending on the tokenization strategy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text | text|list | required |
Returns:
Type | Description |
---|---|
segmented text |
Source code in txtai/pipeline/data/segmentation.py
|
|