Tabular
The Tabular pipeline splits tabular data into rows and columns. The tabular pipeline is most useful in creating (id, text, tag) tuples to load into Embedding indexes.
Example
The following shows a simple example using this pipeline.
from txtai.pipeline import Tabular
# Create and run pipeline
tabular = Tabular("id", ["text"])
tabular("path to csv file")
See the link below for a more detailed example.
Notebook | Description | |
---|---|---|
Transform tabular data with composable workflows | Transform, index and search tabular data |
Configuration-driven example
Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.
config.yml
# Create pipeline using lower case class name
tabular:
idcolumn: id
textcolumns:
- text
# Run pipeline with workflow
workflow:
tabular:
tasks:
- action: tabular
Run with Workflows
from txtai.app import Application
# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("tabular", ["path to csv file"]))
Run with API
CONFIG=config.yml uvicorn "txtai.api:app" &
curl \
-X POST "http://localhost:8000/workflow" \
-H "Content-Type: application/json" \
-d '{"name":"tabular", "elements":["path to csv file"]}'
Methods
Python documentation for the pipeline.
Creates a new Tabular pipeline.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
idcolumn | column name to use for row id | None | |
textcolumns | list of columns to combine as a text field | None | |
content | if True, a dict per row is generated with all fields. If content is a list, a subset of fields is included in the generated rows. | False |
Source code in txtai/pipeline/data/tabular.py
def __init__(self, idcolumn=None, textcolumns=None, content=False):
"""
Creates a new Tabular pipeline.
Args:
idcolumn: column name to use for row id
textcolumns: list of columns to combine as a text field
content: if True, a dict per row is generated with all fields. If content is a list, a subset of fields
is included in the generated rows.
"""
if not PANDAS:
raise ImportError('Tabular pipeline is not available - install "pipeline" extra to enable')
self.idcolumn = idcolumn
self.textcolumns = textcolumns
self.content = content
Splits data into rows and columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data | input data | required |
Returns:
Type | Description |
---|---|
list of (id, text, tag) |
Source code in txtai/pipeline/data/tabular.py
def __call__(self, data):
"""
Splits data into rows and columns.
Args:
data: input data
Returns:
list of (id, text, tag)
"""
items = [data] if not isinstance(data, list) else data
# Combine all rows into single return element
results = []
dicts = []
for item in items:
# File path
if isinstance(item, str):
_, extension = os.path.splitext(item)
extension = extension.replace(".", "").lower()
if extension == "csv":
df = pd.read_csv(item)
results.append(self.process(df))
# Dict
if isinstance(item, dict):
dicts.append(item)
# List of dicts
elif isinstance(item, list):
df = pd.DataFrame(item)
results.append(self.process(df))
if dicts:
df = pd.DataFrame(dicts)
results.extend(self.process(df))
return results[0] if not isinstance(data, list) else results