Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.axioniclabs.ai/llms.txt

Use this file to discover all available pages before exploring further.

Replaces a general-purpose 1.7B model with a fine-tuned 0.5B model for structured data extraction. The resulting model runs on a 2GB GPU, classifies chat messages into topics, names semantic clusters, and expands search queries with domain synonyms.

Prerequisites

  • A Spectra account with a configured teacher model API key
  • Ollama installed on your target machine
  • Python 3.10+ with transformers and gguf packages (for local conversion)
  • An application that currently calls an LLM for structured output

Step 1: Audit Your Existing LLM Usage

Before training, identify exactly what your application asks the LLM to do. In this project, the application had three LLM-powered workflows:
WorkflowInputOutputPrevious Model
Cluster naming30 sample messages from a k-means clusterJSON with topic name, description, and keywordsqwen3:1.7b (1.3GB VRAM)
Message classificationA single message + list of valid topicsJSON with primary topic, optional secondary topic, confidence scoreKeyword regex (no LLM)
Search query expansionA user’s search queryExpanded query with domain synonymsNone (raw query used)
The first workflow was the only one using an LLM. The second two were opportunities to add LLM-powered functionality that the 1.7B model’s VRAM footprint had previously precluded.

Step 2: Design Tool Schemas

Each distinct LLM task becomes a tool. If your application already prompts for JSON with a specific schema, use that as the starting point.
{
  "name": "name_topic_cluster",
  "description": "Analyze semantically similar chat messages and produce a concise topic label, description, and discriminative keywords for pattern matching.",
  "parameters": {
    "type": "object",
    "properties": {
      "topic_name": {
        "type": "string",
        "description": "Short topic name, 2-4 words, specific to the content"
      },
      "topic_description": {
        "type": "string",
        "description": "One sentence describing what this topic covers"
      },
      "keywords": {
        "type": "array",
        "items": { "type": "string" },
        "description": "10-15 discriminative keywords for regex matching, must be specific domain terms"
      }
    },
    "required": ["topic_name", "topic_description", "keywords"]
  }
}
Upload all three as a JSON array via the Paste JSON tab. See Tool Schemas for format details.
Avoid naming parameters "name" or "description" — these collide with tool-level fields in some parsers. Use topic_name, topic_description, etc.

Step 3: Configure Training

Model Selection

SettingValueRationale
Teacher ModelDefault (optimized for training)Uses the built-in Axionic teacher path
Student ModelMechanex Mini (Qwen2.5-0.5B-Instruct)Smallest available; fits on 2GB GPUs with room to spare

Training Parameters

ParameterRecommendedDefaultWhy
Training ModeSFT OnlySFT OnlyStructured extraction doesn’t benefit from RL
Prompts30200.5B models need more examples to learn structured output reliably
Trajectories55Default is adequate
Epochs53Narrow task with small model; extra epochs improve convergence without overfitting
Learning Rate2e-51e-5Slightly higher LR avoids underfitting within the epoch budget

Training Objectives

Keep the three defaults and add domain-specific objectives:
Always include required parameters in tool calls
Never hallucinate field names or API endpoints
Validate input types before making calls
Topic names must be 2-4 words and domain-specific, never generic labels like "General" or "Other"
Keywords must be discriminative terms that distinguish one topic from another, never stop words or common verbs
Confidence scores must reflect actual certainty, use 0.4-0.6 for genuinely ambiguous messages
When no topic fits well, output the closest match with low confidence rather than inventing a new topic
Expanded search queries must include domain-specific synonyms
Click Start Training. Training typically completes in 5-15 minutes for a 0.5B student model.

Step 4: Add a vector in Optimization

Prepare Contrastive Data

JSONL file with prompt, positive (desired output), and negative (unwanted output):
{"prompt": "Classify this message: \"anyone know when the next validator set rotation happens?\"", "positive": "{\"primary_topic\":\"Staking\",\"confidence\":0.85}", "negative": "I think this message is probably about staking, since it mentions validators and delegation."}
{"prompt": "Classify this message: \"yeah idk might rain tomorrow\"", "positive": "{\"primary_topic\":\"Community\",\"confidence\":0.3}", "negative": "{\"primary_topic\":\"Community\",\"confidence\":0.95}"}
{"prompt": "Name this cluster:\n1. \"my node keeps missing blocks\"\n2. \"how do I unjail my validator\"", "positive": "{\"topic_name\":\"Validator Operations\",\"topic_description\":\"Running and troubleshooting validator nodes.\",\"keywords\":[\"validator\",\"node\",\"unjail\",\"blocks\",\"slashing\"]}", "negative": "This cluster seems to be about validators. The messages discuss various operational issues that node operators face."}
{"prompt": "Expand this search query: \"how do I unstake my atoms\"", "positive": "{\"expanded_query\":\"unstake unbond undelegate validator delegation 21 days rewards ATOM\",\"topic_hint\":\"Staking\"}", "negative": "To expand your search query about unstaking, I would suggest including related terms such as unbonding, undelegating, and validator delegation periods."}
10-15 pairs covering all tools. Mix clear-cut examples with ambiguous edge cases.

Create the Vector

1

Open Vector Library

Open the Optimization workspace for your model and click + Create Vector.
2

Configure

FieldValue
Vector NameStructured Tool Output
Labelstructured-output
DescriptionForces JSON tool-call output instead of prose. Steers toward calibrated confidence scores, domain-specific keywords, and concise output.
CategoryCompliance
MethodologyCAA
3

Upload data

Select Upload as the dataset source and upload your JSONL file. Verify the pair count matches your expectations.
4

Set refinement parameters

Use defaults: Refinement Steps = 100, Learning Rate = 0.01.
5

Create and attach

Click Create Vector. Once generated, attach it in Optimization and set strength to 60%. Increase to 80% if prose still leaks through during testing.

Step 5: Test via API

from openai import OpenAI

client = OpenAI(
    api_key="your-spectra-api-key",
    base_url="https://api.axioniclabs.ai/v1"
)

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {
            "role": "system",
            "content": "You are a topic analysis agent. Always respond with ONLY valid JSON."
        },
        {
            "role": "user",
            "content": (
                'Classify this message: '
                '"my IBC transfer from osmosis has been stuck for 3 hours, '
                'is the relayer down?"\n'
                'Return: {"primary_topic":"...","confidence":0.0}'
            )
        }
    ],
    temperature=0.3,
    max_tokens=200
)

print(response.choices[0].message.content)

Step 6: Deploy Locally with Ollama

Convert to GGUF for local inference via Ollama.

Download from HuggingFace

mkdir -p /tmp/my-model && cd /tmp/my-model
HF_TOKEN="your-hf-token"
REPO="your-username/your-model-name"
BASE="https://huggingface.co/${REPO}/resolve/main"

for f in config.json tokenizer.json tokenizer_config.json generation_config.json model.safetensors; do
  curl -sSL -H "Authorization: Bearer $HF_TOKEN" -o "$f" "${BASE}/${f}"
done

Fix Tokenizer Compatibility (Qwen2)

Qwen2-based models may ship with extra_special_tokens as a list instead of a dict, which breaks conversion:
import json

with open('/tmp/my-model/tokenizer_config.json') as f:
    cfg = json.load(f)

if isinstance(cfg.get('extra_special_tokens'), list):
    tokens = cfg['extra_special_tokens']
    cfg['extra_special_tokens'] = {t: t for t in tokens}
    with open('/tmp/my-model/tokenizer_config.json', 'w') as f:
        json.dump(cfg, f, indent=2)
    print('Fixed extra_special_tokens format')

Convert to GGUF

git clone --depth 1 https://github.com/ggml-org/llama.cpp.git
pip install gguf transformers

python llama.cpp/convert_hf_to_gguf.py /tmp/my-model \
  --outtype q8_0 \
  --outfile /tmp/my-model.q8_0.gguf
Spectra exports safetensors in BF16. GPUs without BF16 support (Pascal-generation and older) will crash at inference. GGUF conversion to F16 or Q8_0 is required for these cards.

Quantization Options

FormatSize (0.5B model)QualityWhen to use
F16~988MBLosslessWhen VRAM isn’t a constraint
Q8_0~525MBNear-losslessDefault choice for small models
Q4_K_M~300MBGoodWhen VRAM is extremely tight
For sub-1B models, Q8_0 is the default choice.

Create an Ollama Model

Modelfile with the Qwen2 chat template (required for tool-calling support):
FROM /tmp/my-model.q8_0.gguf

TEMPLATE """{{- if or .System .Tools }}<|im_start|>system
{{ if .System }}{{ .System }}{{ end }}
{{- if .Tools }}

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end -}}
<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if .Content }}{{ .Content }}
{{- else if .ToolCalls }}<tool_call>
{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}</tool_call>
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}"""

PARAMETER temperature 0.3
PARAMETER num_predict 500
PARAMETER stop <|im_end|>

SYSTEM """You are a topic analysis agent. Always respond with ONLY valid JSON, no other text."""
ollama create my-model -f Modelfile
If you skip the TEMPLATE directive, Ollama assigns a minimal {{ .Prompt }} template that breaks chat-format inference and tool calling.

Verify

curl -sS http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [
      {"role": "system", "content": "Always respond with valid JSON only."},
      {"role": "user", "content": "Classify: \"validator is jailed after missing 500 blocks\"\nReturn: {\"primary_topic\":\"...\",\"confidence\":0.0}"}
    ],
    "temperature": 0.3
  }'

Step 7: Integrate into Your Application

Both the Spectra API and Ollama’s /v1 endpoint implement the OpenAI chat completions spec. Write once, switch with env vars.
const LLM_API_URL = process.env.LLM_API_URL || 'https://api.axioniclabs.ai/v1';
const LLM_API_KEY = process.env.LLM_API_KEY || '';
const LLM_MODEL = process.env.LLM_MODEL || 'my-model';

async function llmComplete(prompt: string, systemPrompt?: string): Promise<string> {
  const messages: { role: string; content: string }[] = [];
  if (systemPrompt) messages.push({ role: 'system', content: systemPrompt });
  messages.push({ role: 'user', content: prompt });

  const headers: Record<string, string> = { 'Content-Type': 'application/json' };
  if (LLM_API_KEY) headers['Authorization'] = `Bearer ${LLM_API_KEY}`;

  const response = await fetch(`${LLM_API_URL}/chat/completions`, {
    method: 'POST',
    headers,
    body: JSON.stringify({
      model: LLM_MODEL,
      messages,
      temperature: 0.3,
      max_tokens: 500
    })
  });

  if (!response.ok) throw new Error(`LLM API error: ${response.status}`);

  const data = await response.json();
  return data.choices?.[0]?.message?.content || '';
}

Environment Configuration

LLM_API_URL=https://api.axioniclabs.ai/v1
LLM_API_KEY=ax_your_api_key_here
LLM_MODEL=your-model-name
Use during development or when you want Spectra’s steering vectors and SAE monitoring applied at inference time.

Parse the Response

Small models occasionally wrap JSON in markdown fences:
function parseJsonResponse(raw: string): Record<string, unknown> {
  const match = raw.match(/\{[\s\S]*\}/);
  if (!match) throw new Error('No JSON found in response');
  return JSON.parse(match[0]);
}

const result = parseJsonResponse(await llmComplete(prompt, systemPrompt));

Step 8: Production Checklist

  • Test all three tools with representative inputs from your actual dataset
  • Verify GPU memory with ollama ps while the model is loaded
  • Set up fallback if using the hosted API (local Ollama for resilience)
  • Monitor output quality for the first week
  • Remove the previous model (ollama rm old-model)

Results

MetricBefore (qwen3:1.7b)After (fine-tuned 0.5B)
VRAM usage~1.3GB~525MB
Inference speed1x~3x faster
Structured outputRegex JSON extraction with fallbackConsistent JSON
CapabilitiesCluster naming onlyCluster naming + classification + query expansion