Tabular

Tabular

The Tabular pipeline splits tabular data into rows and columns. The tabular pipeline is most useful in creating (id, text, tag) tuples to load into Embedding indexes.

Example

The following shows a simple example using this pipeline.

from txtai.pipeline import Tabular
# Create and run pipeline
tabular = Tabular("id", ["text"])
tabular("path to csv file")

See the link below for a more detailed example.

Notebook	Description
Transform tabular data with composable workflows	Transform, index and search tabular data

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

# Create pipeline using lower case class name
tabular:
    idcolumn: id
    textcolumns:
      - text
# Run pipeline with workflow
workflow:
  tabular:
    tasks:
      - action: tabular

Run with Workflows

from txtai.app import Application
# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("tabular", ["path to csv file"]))

Run with API

CONFIG=config.yml uvicorn "txtai.api:app" &
curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"tabular", "elements":["path to csv file"]}'

Methods

Python documentation for the pipeline.

Creates a new Tabular pipeline.

Parameters:

Name	Description	Default
`idcolumn`	column name to use for row id	`None`
`textcolumns`	list of columns to combine as a text field	`None`
`content`	if True, a dict per row is generated with all fields. If content is a list, a subset of fields is included in the generated rows.	`False`

Source code in txtai/pipeline/data/tabular.py

def __init__(self, idcolumn=None, textcolumns=None, content=False):
    """
    Creates a new Tabular pipeline.
    Args:
        idcolumn: column name to use for row id
        textcolumns: list of columns to combine as a text field
        content: if True, a dict per row is generated with all fields. If content is a list, a subset of fields
                 is included in the generated rows.
    """
    if not PANDAS:
        raise ImportError('Tabular pipeline is not available - install "pipeline" extra to enable')
    self.idcolumn = idcolumn
    self.textcolumns = textcolumns
    self.content = content

Splits data into rows and columns.

Parameters:

Name	Type	Description	Default
`data`		input data	required

Returns:

Type	Description
	list of (id, text, tag)

Source code in txtai/pipeline/data/tabular.py

def __call__(self, data):
    """
    Splits data into rows and columns.
    Args:
        data: input data
    Returns:
        list of (id, text, tag)
    """
    items = [data] if not isinstance(data, list) else data
    # Combine all rows into single return element
    results = []
    dicts = []
    for item in items:
        # File path
        if isinstance(item, str):
            _, extension = os.path.splitext(item)
            extension = extension.replace(".", "").lower()
            if extension == "csv":
                df = pd.read_csv(item)
            results.append(self.process(df))
        # Dict
        if isinstance(item, dict):
            dicts.append(item)
        # List of dicts
        elif isinstance(item, list):
            df = pd.DataFrame(item)
            results.append(self.process(df))
    if dicts:
        df = pd.DataFrame(dicts)
        results.extend(self.process(df))
    return results[0] if not isinstance(data, list) else results