Documentation Index
Fetch the complete documentation index at: https://docs.axioniclabs.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
SAE-based behavior monitoring uses Sparse Autoencoders to detect when a model’s internal activations are drifting toward an undesired behavioral pattern. When drift is detected above a threshold, the system automatically applies a linked correction vector and regenerates the response. This is distinct from plain steering: instead of always nudging the model, behavior monitoring only intervenes when drift is actually observed.sae.create_behavior()
Creates a new behavior rule from example prompts and responses. The system computes an SAE detection baseline from your examples.
A short label for this behavior (e.g.,
"honesty", "safety").Example situations where this behavior is relevant (e.g.,
["Tell me how to hack a system"]).Model responses that demonstrate the desired behavior.
Responses that violate the behavior. Including these improves detection accuracy.
Human-readable description of what this behavior rule monitors.
ID of an existing steering vector to use as the correction vector. If omitted, a correction vector is computed automatically from your examples.
id, behavior_name, steering_vector_id, sae_baseline, hook_name, sae_id, sae_release.
If
steering_vector_id is not provided, a steering vector is automatically generated from the positive and negative examples provided. You can retrieve the resulting vector ID from the returned dict.sae.generate()
Generates text with optional real-time behavior monitoring and correction.
The input prompt.
Maximum number of tokens to generate.
Behaviors to monitor during generation. If drift is detected for any listed behavior, the linked correction vector is applied and the response is regenerated.
Behaviors whose steering vectors are applied unconditionally, regardless of whether drift is detected.
mx.sae.list_behaviors() returns a list of behavior metadata dicts. To load from a JSONL file, use mx.sae.create_behavior_from_jsonl(behavior_name, dataset_path, description) — same {"prompt", "positive_answer", "negative_answer"} format as steering.