HFTrainer
Trains a new Hugging Face Transformer model using the Trainer framework.
Example
The following shows a simple example using this pipeline.
import pandas as pd
from datasets import load_dataset
from txtai.pipeline import HFTrainer
trainer = HFTrainer()
# Pandas DataFrame
df = pd.read_csv("training.csv")
model, tokenizer = trainer("bert-base-uncased", df)
# Hugging Face dataset
ds = load_dataset("glue", "sst2")
model, tokenizer = trainer("bert-base-uncased", ds["train"], columns=("sentence", "label"))
# List of dicts
dt = [{"text": "sentence 1", "label": 0}, {"text": "sentence 2", "label": 1}]]
model, tokenizer = trainer("bert-base-uncased", dt)
# Support additional TrainingArguments
model, tokenizer = trainer("bert-base-uncased", dt,
learning_rate=3e-5, num_train_epochs=5)
All TrainingArguments are supported as function arguments to the trainer call.
See the links below for more detailed examples.
Notebook | Description | |
---|---|---|
Train a text labeler | Build text sequence classification models | |
Train without labels | Use zero-shot classifiers to train new models | |
Train a QA model | Build and fine-tune question-answering models | |
Train a language model from scratch | Build new language models |
Training tasks
The HFTrainer pipeline builds and/or fine-tunes models for following training tasks.
Task | Description |
---|---|
language-generation | Causal language model for text generation (e.g. GPT) |
language-modeling | Masked language model for general tasks (e.g. BERT) |
question-answering | Extractive question-answering model, typically with the SQuAD dataset |
sequence-sequence | Sequence-Sequence model (e.g. T5) |
text-classification | Classify text with a set of labels |
token-detection | ELECTRA-style pre-training with replaced token detection |
Methods
Python documentation for the pipeline.
Builds a new model using arguments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
base | path to base model, accepts Hugging Face model hub id, local path or (model, tokenizer) tuple | required | |
train | training data | required | |
validation | validation data | None | |
columns | tuple of columns to use for text/label, defaults to (text, None, label) | None | |
maxlength | maximum sequence length, defaults to tokenizer.model_max_length | None | |
stride | chunk size for splitting data for QA tasks | 128 | |
task | optional model task or category, determines the model type, defaults to “text-classification” | ‘text-classification’ | |
prefix | optional source prefix | None | |
metrics | optional function that computes and returns a dict of evaluation metrics | None | |
tokenizers | optional number of concurrent tokenizers, defaults to None | None | |
checkpoint | optional resume from checkpoint flag or path to checkpoint directory, defaults to None | None | |
args | training arguments | {} |
Returns:
Type | Description |
---|---|
(model, tokenizer) |
Source code in txtai/pipeline/train/hftrainer.py
def __call__(
self,
base,
train,
validation=None,
columns=None,
maxlength=None,
stride=128,
task="text-classification",
prefix=None,
metrics=None,
tokenizers=None,
checkpoint=None,
**args
):
"""
Builds a new model using arguments.
Args:
base: path to base model, accepts Hugging Face model hub id, local path or (model, tokenizer) tuple
train: training data
validation: validation data
columns: tuple of columns to use for text/label, defaults to (text, None, label)
maxlength: maximum sequence length, defaults to tokenizer.model_max_length
stride: chunk size for splitting data for QA tasks
task: optional model task or category, determines the model type, defaults to "text-classification"
prefix: optional source prefix
metrics: optional function that computes and returns a dict of evaluation metrics
tokenizers: optional number of concurrent tokenizers, defaults to None
checkpoint: optional resume from checkpoint flag or path to checkpoint directory, defaults to None
args: training arguments
Returns:
(model, tokenizer)
"""
# Parse TrainingArguments
args = self.parse(args)
# Set seed for model reproducibility
set_seed(args.seed)
# Load model configuration, tokenizer and max sequence length
config, tokenizer, maxlength = self.load(base, maxlength)
# Data collator and list of labels (only for classification models)
collator, labels = None, None
# Prepare datasets
if task == "language-generation":
# Default tokenizer pad token if it's not set
tokenizer.pad_token = tokenizer.pad_token if tokenizer.pad_token is not None else tokenizer.eos_token
process = Texts(tokenizer, columns, maxlength)
collator = DataCollatorForLanguageModeling(tokenizer, mlm=False, pad_to_multiple_of=8 if args.fp16 else None)
elif task in ("language-modeling", "token-detection"):
process = Texts(tokenizer, columns, maxlength)
collator = DataCollatorForLanguageModeling(tokenizer, pad_to_multiple_of=8 if args.fp16 else None)
elif task == "question-answering":
process = Questions(tokenizer, columns, maxlength, stride)
elif task == "sequence-sequence":
process = Sequences(tokenizer, columns, maxlength, prefix)
collator = DataCollatorForSeq2Seq(tokenizer, pad_to_multiple_of=8 if args.fp16 else None)
else:
process = Labels(tokenizer, columns, maxlength)
labels = process.labels(train)
# Tokenize training and validation data
train, validation = process(train, validation, os.cpu_count() if tokenizers and isinstance(tokenizers, bool) else tokenizers)
# Create model to train
model = self.model(task, base, config, labels, tokenizer)
# Add model to collator
if collator:
collator.model = model
# Build trainer
trainer = Trainer(
model=model,
tokenizer=tokenizer,
data_collator=collator,
args=args,
train_dataset=train,
eval_dataset=validation if validation else None,
compute_metrics=metrics,
)
# Run training
trainer.train(resume_from_checkpoint=checkpoint)
# Run evaluation
if validation:
trainer.evaluate()
# Save model outputs
if args.should_save:
trainer.save_model()
trainer.save_state()
# Put model in eval mode to disable weight updates and return (model, tokenizer)
return (model.eval(), tokenizer)