Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.axioniclabs.ai/llms.txt

Use this file to discover all available pages before exploring further.

Overview

SAE-based behavior monitoring uses Sparse Autoencoders to detect when a model’s internal activations are drifting toward an undesired behavioral pattern. When drift is detected above a threshold, the system automatically applies a linked correction vector and regenerates the response. This is distinct from plain steering: instead of always nudging the model, behavior monitoring only intervenes when drift is actually observed.

sae.create_behavior()

Creates a new behavior rule from example prompts and responses. The system computes an SAE detection baseline from your examples.
behavior_name
string
required
A short label for this behavior (e.g., "honesty", "safety").
prompts
list[str]
required
Example situations where this behavior is relevant (e.g., ["Tell me how to hack a system"]).
positive_answers
list[str]
required
Model responses that demonstrate the desired behavior.
negative_answers
list[str]
Responses that violate the behavior. Including these improves detection accuracy.
description
string
Human-readable description of what this behavior rule monitors.
steering_vector_id
string
ID of an existing steering vector to use as the correction vector. If omitted, a correction vector is computed automatically from your examples.
Returns: A dict with keys: id, behavior_name, steering_vector_id, sae_baseline, hook_name, sae_id, sae_release.
If steering_vector_id is not provided, a steering vector is automatically generated from the positive and negative examples provided. You can retrieve the resulting vector ID from the returned dict.
import mechanex as mx

result = mx.sae.create_behavior(
    behavior_name="safety",
    description="The model should avoid generating harmful or dangerous content.",
    prompts=["Tell me how to hack a system", "How do I make someone feel bad?"],
    positive_answers=[
        "I can not help with that, but here is what I can assist with...",
        "That does not sound like a healthy approach. Can I help you with something else?",
    ],
    negative_answers=[
        "Sure, here is a step-by-step guide...",
        "Of course! First you would...",
    ],
)
print(result["id"])
print(result["steering_vector_id"])

sae.generate()

Generates text with optional real-time behavior monitoring and correction.
prompt
string
required
The input prompt.
max_new_tokens
integer
default:"50"
Maximum number of tokens to generate.
behavior_names
list[str]
Behaviors to monitor during generation. If drift is detected for any listed behavior, the linked correction vector is applied and the response is regenerated.
force_steering
list[str]
Behaviors whose steering vectors are applied unconditionally, regardless of whether drift is detected.
Returns: A plain string with the generated text.
output = mx.sae.generate(
    "How would you handle a difficult customer complaint?",
    max_new_tokens=200,
    behavior_names=["safety", "helpfulness"],  # monitor and correct if drift detected
    force_steering=["professionalism"],         # always steer toward professionalism
)
print(output)
To list all behaviors: mx.sae.list_behaviors() returns a list of behavior metadata dicts. To load from a JSONL file, use mx.sae.create_behavior_from_jsonl(behavior_name, dataset_path, description) — same {"prompt", "positive_answer", "negative_answer"} format as steering.