pi-lilac-provider
pi-lilac-provider
Kimi K2.6, GLM 5.1, Gemma 4 & more on idle GPUs via Lilac
A pi provider extension for cost-efficient GPU inference.
Access Kimi K2.6, GLM 5.1, MiniMax M2.7, and Gemma 4 models through Lilac's OpenAI-compatible API on idle GPUs.
Features
- 3 AI Models — Kimi K2.6, GLM 5.1, and Gemma 4
- OpenAI-Compatible API — Just change the base URL and API key
- Cost Tracking — Per-model pricing with cache read discounts
- Reasoning Models — Chain-of-thought via
chat_template_kwargs(all models) - Vision Support — Image input on Kimi K2.6 and Gemma 4
- Context Caching — Cache read pricing on Kimi K2.6 and GLM 5.1
- Flex (Discount Gating) — Only let the LLM respond when the active model's discount meets a threshold you set (
/lilac-flex) - Idle GPU Scheduling — Lilac leverages idle GPU capacity for cost-efficient inference
Installation
Option 1: Using pi install (Recommended)
Install directly from GitHub:
pi install https://github.com/monotykamary/pi-lilac-provider
Then set your API key and run pi:
# Recommended: add to auth.json
# See Authentication section below
# Or set as environment variable
export LILAC_API_KEY=your-api-key-here
pi
Option 2: Manual Clone
Clone this repository:
git clone https://github.com/monotykamary/pi-lilac-provider.git cd pi-lilac-providerSet your Lilac API key:
# Recommended: add to auth.json # See Authentication section below # Or set as environment variable export LILAC_API_KEY=your-api-key-hereRun pi with the extension:
pi -e /path/to/pi-lilac-provider
Available Models
| Model | Context | Vision | Reasoning | Input $/M | Cache Read $/M | Output $/M |
|---|---|---|---|---|---|---|
| Gemma 4 | 262K | $0.11 | — | $0.35 | ||
| GLM 5.1 | 203K | $0.90 | $0.27 | $3.00 | ||
| GLM 5.2 | 524K | $0.90 | $0.27 | $3.00 | ||
| Kimi K2.6 | 262K | $0.70 | $0.20 | $3.50 | ||
| MiniMax M2.7 | 205K | $0.30 | $0.06 | $1.20 | ||
| MiniMax M3 | 1.0M | $0.28 | $0.05 | $1.10 |
Costs are per million tokens. Prices subject to change — check getlilac.com for current pricing.
Notes:
- Gemma 4 has reasoning off by default — pi enables it when you set a thinking level (Shift+Tab)
- Kimi K2.6 and GLM 5.1 have reasoning on by default
- Cache read pricing applies to repeated input tokens served from cache on supported models
- Gemma 4 does not support cache read pricing
Usage
After loading the extension, use the /model command in pi to select your preferred model:
/model lilac moonshotai/kimi-k2.6
Or start pi directly with a Lilac model:
pi --provider lilac --model moonshotai/kimi-k2.6
Thinking Mode
All Lilac models toggle reasoning via chat_template_kwargs, but the key each
model's chat template honors differs per family. The provider uses pi's
chat-template thinkingFormat with per-model chatTemplateKwargs (configured in
patch.json) so the right key reaches each template:
| Model | Reasoning key | Default |
|---|---|---|
| Kimi K2.6 | thinking (bool) |
on |
| GLM 5.1 | enable_thinking (bool) |
on |
| GLM 5.2 | enable_thinking (bool) + reasoning_effort (max|high) |
on (max) |
| Gemma 4 | enable_thinking (bool) |
off |
| MiniMax M2.7 | thinking + enable_thinking (bool) |
on |
| MiniMax M3 | thinking_mode (disabled|adaptive|enabled) |
adaptive (server) |
Kimi K2.6, GLM 5.1, Gemma 4, and MiniMax M2.7 use the forward-compatible form
that sends both thinking and enable_thinking, so whichever key the
template honors is set. GLM 5.2 additionally maps pi's thinking levels to
reasoning_effort (high = lower-latency, xhigh = max). MiniMax M3 uses
the thinking_mode enum, exposed as three pi thinking levels: off →
disabled (never think), minimal → adaptive (the model decides), high →
enabled (always think). Pi starts at off (disabled); cycle to minimal
for M3's adaptive "model decides" mode. (The selector/footer show pi's level
names — minimal/high — not the thinking_mode values; pi has no per-model
level-relabel hook.)
Preserved thinking (full-history reasoning). By default these templates trim older assistant reasoning between turns (each vendor's default), which degrades multi-turn recall. Three models opt into full-history preservation via a template flag sent alongside the reasoning key:
| Model | Flag | Effect |
|---|---|---|
| Kimi K2.6 | preserve_thinking: true |
keeps every assistant turn's reasoning (default: only the last) |
| GLM 5.1 | clear_thinking: false |
keeps reasoning for all turns (default: clears before the last user message) |
| GLM 5.2 | clear_thinking: false |
keeps reasoning for all turns (default: clears before the last user message) |
Kimi K2.6 and GLM 5.2 are E2E-verified on the sibling neuralwatt provider via a
3-turn, two-20-digit-number recall test (Kimi 0/6 → 6/6, GLM 5.2 1/4 → 4/4);
GLM 5.1 uses the same clear_thinking mechanism (confirmed in its HuggingFace
chat template). Gemma 4 and MiniMax M2.7/M3 expose no family-wide preserve flag,
so their older assistant reasoning is trimmed per the template default.
In pi, reasoning models automatically use the appropriate thinking format. Use Shift+Tab to control thinking level.
Vision
Kimi K2.6 and Gemma 4 support image inputs. Pass images in messages and pi will handle the formatting automatically.
Gemma 4 also supports video by accepting a sequence of frames as images.
Authentication
The Lilac API key can be configured in multiple ways (resolved in this order):
auth.json(recommended) — Add to~/.pi/agent/auth.json:The{ "lilac": { "type": "api_key", "key": "your-api-key" } }keyfield supports literal values, env var names, and shell commands (prefix with!). See pi's auth file docs for details.- Runtime override — Use the
--api-keyCLI flag - Environment variable — Set
LILAC_API_KEY
Get your API key at getlilac.com.
Environment Variables
| Variable | Required | Description |
|---|---|---|
LILAC_API_KEY |
No | Your Lilac API key (fallback if not in auth.json) |
Configuration
Add to your pi configuration for automatic loading:
{
"extensions": [
"/path/to/pi-lilac-provider"
]
}
Compat Settings
Lilac's API is OpenAI-compatible with these specifics:
thinkingFormat: "chat-template"— All reasoning models. Lilac's vLLM backend toggles reasoning viachat_template_kwargs, but the honored key differs per model family. Per-modelchatTemplateKwargsinpatch.jsonsend the right key(s):thinking+enable_thinking(bool) for Kimi K2.6, GLM 5.1, Gemma 4, and MiniMax M2.7;enable_thinking+reasoning_effortfor GLM 5.2;thinking_mode(adaptive|enabled|disabled) for MiniMax M3. Kimi K2.6, GLM 5.1, and GLM 5.2 additionally send a preservation flag (preserve_thinking: true/clear_thinking: false) to retain full reasoning history across turns — see Preserved thinking above. Override these per-model via Model Overrides.maxTokensField: "max_completion_tokens"— All models. Lilac supportsmax_completion_tokens(preferred for reasoning models as it includes reasoning tokens).supportsDeveloperRole: true— All models. Lilac's vLLM backend maps the developer role to system.supportsStore: false— All models. Lilac doesn't support thestoreparameter.
Known Caveats
- GLM 5.1 intermittent tool call loss: vLLM's streaming parser intermittently emits
finish_reason: "tool_calls"without anydelta.tool_callschunks — even withtool_stream: true(set viazaiToolStreamin compat). Pi maps this tostopReason: "toolUse"with zero toolCall blocks, causing an "abrupt stop". The extension'smessage_endhandler converts this to a retryable error that triggers pi's built-in auto-retry mechanism, so the agent automatically re-prompts and typically succeeds on the next attempt. - GLM 5.1 chain-of-thought leakage: On the current vLLM build, disabling reasoning on GLM 5.1 may still leak chain-of-thought into
contentterminated by a</think>marker. Post-process the response to discard text up to and including the first</think>when reasoning is disabled. See vllm-project/vllm#31319. - Gemma 4 reasoning parser: vLLM's reasoning parser can fail to populate the
reasoningfield when special tokens are stripped before the parser runs. Clients that require a clean split should post-process<|channel|>thought ... <|channel|>markers. See vllm-project/vllm#38855. - Gemma 4 structured output: Combining
enable_thinking: falsewithresponse_format: json_schemacan silently disable xgrammar-backed structured output. If you rely on structured output with Gemma 4, leave thinking enabled or validate output client-side. See vllm-project/vllm#39130.
Patch Overrides
The patch.json file contains overrides that are applied on top of models.json data. This is useful for:
- Correcting API-derived values (e.g., GLM 5.1's
maxTokens— API returns context length, actual max output is 131K) - Marking models as reasoning-capable when the API features list doesn't include it
- Adding compat settings that the API doesn't provide
- Overriding pricing when official rates change
Model Overrides
modelOverrides lets you override compat flags and other model properties per model id, on top of patch.json + custom-models.json, without editing the extension. Keyed by model id; compat (including nested chatTemplateKwargs), thinkingLevelMap, and cost are deep-merged recursively (toggle one flag without redeclaring the rest), scalars and arrays are replaced. Applied at session start, so edits take effect on the next pi session.
Create ~/.pi/agent/extensions/lilac.json (auto-populated with defaults on first run):
{
// Only respond when the active model's discount is >= this percent. null = off.
// See "Flex (Discount Gating)" below. Set interactively with /lilac-flex.
"flexThreshold": null,
"modelOverrides": {
// Disable full-history reasoning for kimi-k2.6 (e.g. to save tokens):
"moonshotai/kimi-k2.6": { "compat": { "chatTemplateKwargs": { "preserve_thinking": false } } },
// Toggle the GLM 5.2 clear_thinking flag without redeclaring the rest of compat:
"zai-org/glm-5.2": { "compat": { "chatTemplateKwargs": { "clear_thinking": true } } },
// Override a single thinking level without redeclaring the whole map:
"zai-org/glm-5.1": { "thinkingLevelMap": { "high": "max" } }
}
}
The full set of overridable fields matches the model schema (compat, thinkingLevelMap, cost, contextWindow, maxTokens, reasoning, input). See Compat Settings for the catalog of compat flags and what chatTemplateKwargs values mean per family. An invalid JSON file is left untouched (defaults are used) so a typo isn't silently wiped — fix the file and restart pi.
Flex (Discount Gating)
Lilac's per-model discount fluctuates with idle-GPU supply. Flex lets you set a discount threshold so pi only sends a prompt to the LLM when the active model's current discount is at or above it — e.g. "only respond when the discount is ≥ 75%". Below the threshold, the prompt is blocked (dropped with a warning) until the next discount poll brings the discount back up. This is a spend-control feature: you only spend when supply is cheap.
Set it interactively with the /lilac-flex command:
/lilac-flex # picker: Off / ≥50% / ≥75% / Custom…
/lilac-flex 75 # set threshold directly (only respond at ≥75% discount)
/lilac-flex 50% # trailing % accepted on the command line
/lilac-flex off # disable flex (allow all discounts)
The threshold persists in ~/.pi/agent/extensions/lilac.json as flexThreshold (a number 0–100, or null for off) alongside modelOverrides, so it survives restarts. /lilac-flex updates it live — no restart needed.
Behavior notes:
- Gating point: flex checks at prompt-submission time. When blocked, the prompt is dropped (you get a warning notification) and you re-submit once the discount improves. It does not queue the prompt.
- Scope: only interactive (TUI-typed) prompts are gated.
rpc/print(automation) and extension-injected messages are not gated, so flex never causes a silent failure in a pipeline or an extension loop. Flex only applies to the active lilac model; non-lilac models always pass. - No data / no discount entry = 0%. A lilac model with no discount (list price), or before the first discount poll has data, counts as 0% and is blocked when flex is on. This matches how discounts are priced elsewhere in the extension.
- Freshness: when a prompt is blocked, the extension triggers an immediate
/statusrefresh (throttled to once per ~5s) so you're not stuck on a stale low value from the 5-minute idle poll. The next submission sees the fresh discount. The footer status reflects the gate:… · flex ≥75% okor… · flex ≥75% blocked. - Discount lock-in: per Lilac, a discount is locked in when a request starts. Flex gates on the best-known discount at submit time, which is what gets locked in for that turn.
Updating Models
Run the update script to fetch the latest models from Lilac's API:
export LILAC_API_KEY=your-api-key
node scripts/update-models.js
This will:
- Fetch models from
https://api.getlilac.com/v1/models - Convert per-token pricing to per-million-tokens
- Preserve existing curated data (pricing, compat) for known models
- Apply overrides from
patch.json - Update
models.jsonand the README model table
A GitHub Actions workflow runs this daily and creates a PR if models have changed.
License
MIT