This guide walks you through setting up the AI Proxy plugin with the Llama2 LLM.
Llama2 is a self-hosted model. As such, it requires setting the model option upstream_url to point to the absolute HTTP(S) endpoint for this model implementation.
There are a number of hosting/format options for running this LLM. Popular options include:
Upstream formats
The upstream request and response formats are different between various implementations of Llama2, and its accompanying web server.
For this provider, the following should be used for the config.model.options.llama2_format parameter:
Llama2 Hosting | llama2_format Config Value | Auth Header |
---|---|---|
HuggingFace | raw | Authorization |
OLLAMA | ollama | Not required by default |
llama.cpp | raw | Not required by default |
Self-Hosted GGUF | openai | Not required by default |
Raw format
The raw
format option emits the full Llama2 prompt format, under the JSON field inputs
:
{
"inputs": "<s>[INST] <<SYS>>You are a mathematician. \n <</SYS>> \n\n What is 1 + 1? [/INST]"
}
It expects the response to be in the responses
JSON field. If using llama.cpp
, it should also be set to RAW
mode.
Ollama format
The ollama
format option adheres to the chat
and chat-completion
request formats, as defined in its API documentation.
OpenAI format
The openai
format option follows the same upstream formats as the equivalent OpenAI route type operation (that is, llm/v1/chat
or llm/v1/completions
).
Using the plugin with Llama2
For all providers, the Kong AI Proxy plugin attaches to route entities.
It can be installed into one route per operation, for example:
- OpenAI
chat
route - Cohere
chat
route - Cohere
completions
route
Each of these AI-enabled routes must point to a null service. This service doesn’t need to map to any real upstream URL, it can point somewhere empty (for example, http://localhost:32000
), because the AI Proxy plugin overwrites the upstream URL. This requirement will be removed in a later Kong revision.
Prerequisites
You need a service to contain the route for the LLM provider. Create a service first:
curl -X POST http://localhost:8001/services \
--data "name=ai-proxy" \
--data "url=http://localhost:32000"
Remember that the upstream URL can point anywhere empty, as it won’t be used by the plugin.
Provider configuration
Set up route and plugin
After installing and starting your Llama2 instance, you can then create an AI Proxy route and plugin configuration.
Kong Admin API
YAML
Create the route:
curl -X POST http://localhost:8001/services/ai-proxy/routes \
--data "name=llama2-chat" \
--data "paths[]=~/llama2-chat$"
Enable and configure the AI Proxy plugin for Llama2:
curl -X POST http://localhost:8001/routes/llama2-chat/plugins \
--data "name=ai-proxy" \
--data "config.route_type=llm/v1/chat" \
--data "config.model.provider=llama2" \
--data "config.model.name=llama2" \
--data "config.model.options.llama2_format=ollama" \
--data "config.model.options.upstream_url=http://ollama-server.local:11434/api/chat"
name: llama2-chat
paths:
- "~/llama2-chat$"
methods:
- POST
plugins:
- name: ai-proxy
config:
route_type: "llm/v1/chat"
model:
provider: "llama2"
name: "llama2"
options:
llama2_format: "ollama"
upstream_url: "http://llama2-server.local:11434/api/chat"
Test the configuration
Make an llm/v1/chat
type request to test your new endpoint:
curl -X POST http://localhost:8000/llama2-chat \
-H 'Content-Type: application/json' \
--data-raw '{ "messages": [ { "role": "system", "content": "You are a mathematician" }, { "role": "user", "content": "What is 1+1?"} ] }'
Previous Set up AI Proxy with Cohere