LLM Provider Integration Guides - Llama2 - 《Kong Gateway v3.8 Documentation》

Upstream formats
Using the plugin with Llama2

This guide walks you through setting up the AI Proxy plugin with the Llama2 LLM.

Llama2 is a self-hosted model. As such, it requires setting the model option upstream_url to point to the absolute HTTP(S) endpoint for this model implementation.

There are a number of hosting/format options for running this LLM. Popular options include:

Upstream formats

The upstream request and response formats are different between various implementations of Llama2, and its accompanying web server.

For this provider, the following should be used for the config.model.options.llama2_format parameter:

Llama2 Hosting	llama2_format Config Value	Auth Header
HuggingFace	`raw`	`Authorization`
OLLAMA	`ollama`	Not required by default
llama.cpp	`raw`	Not required by default
Self-Hosted GGUF	`openai`	Not required by default

Raw format

The raw format option emits the full Llama2 prompt format, under the JSON field inputs:

{
  "inputs": "<s>[INST] <<SYS>>You are a mathematician. \n <</SYS>> \n\n What is 1 + 1? [/INST]"
}

It expects the response to be in the responses JSON field. If using llama.cpp, it should also be set to RAW mode.

Ollama format

The ollama format option adheres to the chat and chat-completion request formats, as defined in its API documentation.

OpenAI format

The openai format option follows the same upstream formats as the equivalent OpenAI route type operation (that is, llm/v1/chat or llm/v1/completions).

Using the plugin with Llama2

For all providers, the Kong AI Proxy plugin attaches to route entities.

It can be installed into one route per operation, for example:

OpenAI chat route
Cohere chat route
Cohere completions route

Each of these AI-enabled routes must point to a null service. This service doesn’t need to map to any real upstream URL, it can point somewhere empty (for example, http://localhost:32000), because the AI Proxy plugin overwrites the upstream URL. This requirement will be removed in a later Kong revision.

Prerequisites

You need a service to contain the route for the LLM provider. Create a service first:

curl -X POST http://localhost:8001/services \
  --data "name=ai-proxy" \
  --data "url=http://localhost:32000"

Remember that the upstream URL can point anywhere empty, as it won’t be used by the plugin.

Provider configuration

Set up route and plugin

After installing and starting your Llama2 instance, you can then create an AI Proxy route and plugin configuration.

Kong Admin API

YAML

Create the route:

curl -X POST http://localhost:8001/services/ai-proxy/routes \
  --data "name=llama2-chat" \
  --data "paths[]=~/llama2-chat$"

Enable and configure the AI Proxy plugin for Llama2:

curl -X POST http://localhost:8001/routes/llama2-chat/plugins \
  --data "name=ai-proxy" \
  --data "config.route_type=llm/v1/chat" \
  --data "config.model.provider=llama2" \
  --data "config.model.name=llama2" \
  --data "config.model.options.llama2_format=ollama" \
  --data "config.model.options.upstream_url=http://ollama-server.local:11434/api/chat"

name: llama2-chat
paths:
  - "~/llama2-chat$"
methods:
  - POST
plugins:
  - name: ai-proxy
    config:
      route_type: "llm/v1/chat"
      model:
        provider: "llama2"
        name: "llama2"
        options:
          llama2_format: "ollama"
          upstream_url: "http://llama2-server.local:11434/api/chat"

Test the configuration

Make an llm/v1/chat type request to test your new endpoint:

curl -X POST http://localhost:8000/llama2-chat \
  -H 'Content-Type: application/json' \
  --data-raw '{ "messages": [ { "role": "system", "content": "You are a mathematician" }, { "role": "user", "content": "What is 1+1?"} ] }'

Previous Set up AI Proxy with Cohere

Next Set up AI Proxy with Mistral