SAE behavior monitoring and runtime drift correction

Overview

SAE-based behavior monitoring uses Sparse Autoencoders to detect when a model’s internal activations are drifting toward an undesired behavioral pattern. When drift is detected above a threshold, the system automatically applies a linked correction vector and regenerates the response. This is distinct from plain steering: instead of always nudging the model, behavior monitoring only intervenes when drift is actually observed.

`sae.create_behavior()`

Creates a new behavior rule from example prompts and responses. The system computes an SAE detection baseline from your examples.

behavior_name

string

required

A short label for this behavior (e.g., "honesty", "safety").

prompts

list[str]

required

Example situations where this behavior is relevant (e.g., ["Tell me how to hack a system"]).

positive_answers

list[str]

required

Model responses that demonstrate the desired behavior.

negative_answers

list[str]

Responses that violate the behavior. Including these improves detection accuracy.

description

string

Human-readable description of what this behavior rule monitors.

steering_vector_id

string

ID of an existing steering vector to use as the correction vector. If omitted, a correction vector is computed automatically from your examples.

Returns: A dict with keys: id, behavior_name, steering_vector_id, sae_baseline, hook_name, sae_id, sae_release.

If steering_vector_id is not provided, a steering vector is automatically generated from the positive and negative examples provided. You can retrieve the resulting vector ID from the returned dict.

import mechanex as mx

result = mx.sae.create_behavior(
    behavior_name="safety",
    description="The model should avoid generating harmful or dangerous content.",
    prompts=["Tell me how to hack a system", "How do I make someone feel bad?"],
    positive_answers=[
        "I can not help with that, but here is what I can assist with...",
        "That does not sound like a healthy approach. Can I help you with something else?",
    ],
    negative_answers=[
        "Sure, here is a step-by-step guide...",
        "Of course! First you would...",
    ],
)
print(result["id"])
print(result["steering_vector_id"])

`sae.generate()`

Generates text with optional real-time behavior monitoring and correction.

prompt

string

required

The input prompt.

max_new_tokens

integer

default:"50"

Maximum number of tokens to generate.

behavior_names

list[str]

Behaviors to monitor during generation. If drift is detected for any listed behavior, the linked correction vector is applied and the response is regenerated.

force_steering

list[str]

Behaviors whose steering vectors are applied unconditionally, regardless of whether drift is detected.

Returns: A plain string with the generated text.

output = mx.sae.generate(
    "How would you handle a difficult customer complaint?",
    max_new_tokens=200,
    behavior_names=["safety", "helpfulness"],  # monitor and correct if drift detected
    force_steering=["professionalism"],         # always steer toward professionalism
)
print(output)

To list all behaviors: mx.sae.list_behaviors() returns a list of behavior metadata dicts. To load from a JSONL file, use mx.sae.create_behavior_from_jsonl(behavior_name, dataset_path, description) — same {"prompt", "positive_answer", "negative_answer"} format as steering.

​Overview

​sae.create_behavior()

​sae.generate()

Overview

`sae.create_behavior()`

`sae.generate()`