> ## Documentation Index
> Fetch the complete documentation index at: https://docs.axioniclabs.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# SAE behavior monitoring and runtime drift correction

> Create behavior rules with sae.create_behavior() and generate text with real-time SAE-based drift detection and automatic correction vectors.

## Overview

SAE-based behavior monitoring uses Sparse Autoencoders to detect when a model's internal activations are drifting toward an undesired behavioral pattern. When drift is detected above a threshold, the system automatically applies a linked correction vector and regenerates the response.

This is distinct from plain steering: instead of always nudging the model, behavior monitoring only intervenes when drift is actually observed.

## `sae.create_behavior()`

Creates a new behavior rule from example prompts and responses. The system computes an SAE detection baseline from your examples.

<ParamField body="behavior_name" type="string" required>
  A short label for this behavior (e.g., `"honesty"`, `"safety"`).
</ParamField>

<ParamField body="prompts" type="list[str]" required>
  Example situations where this behavior is relevant (e.g., `["Tell me how to hack a system"]`).
</ParamField>

<ParamField body="positive_answers" type="list[str]" required>
  Model responses that demonstrate the desired behavior.
</ParamField>

<ParamField body="negative_answers" type="list[str]">
  Responses that violate the behavior. Including these improves detection accuracy.
</ParamField>

<ParamField body="description" type="string">
  Human-readable description of what this behavior rule monitors.
</ParamField>

<ParamField body="steering_vector_id" type="string">
  ID of an existing steering vector to use as the correction vector. If omitted, a correction vector is computed automatically from your examples.
</ParamField>

**Returns**: A dict with keys: `id`, `behavior_name`, `steering_vector_id`, `sae_baseline`, `hook_name`, `sae_id`, `sae_release`.

<Note>
  If `steering_vector_id` is not provided, a steering vector is automatically generated from the positive and negative examples provided. You can retrieve the resulting vector ID from the returned dict.
</Note>

```python theme={null}
import mechanex as mx

result = mx.sae.create_behavior(
    behavior_name="safety",
    description="The model should avoid generating harmful or dangerous content.",
    prompts=["Tell me how to hack a system", "How do I make someone feel bad?"],
    positive_answers=[
        "I can not help with that, but here is what I can assist with...",
        "That does not sound like a healthy approach. Can I help you with something else?",
    ],
    negative_answers=[
        "Sure, here is a step-by-step guide...",
        "Of course! First you would...",
    ],
)
print(result["id"])
print(result["steering_vector_id"])
```

## `sae.generate()`

Generates text with optional real-time behavior monitoring and correction.

<ParamField body="prompt" type="string" required>
  The input prompt.
</ParamField>

<ParamField body="max_new_tokens" type="integer" default="50">
  Maximum number of tokens to generate.
</ParamField>

<ParamField body="behavior_names" type="list[str]">
  Behaviors to monitor during generation. If drift is detected for any listed behavior, the linked correction vector is applied and the response is regenerated.
</ParamField>

<ParamField body="force_steering" type="list[str]">
  Behaviors whose steering vectors are applied unconditionally, regardless of whether drift is detected.
</ParamField>

**Returns**: A plain string with the generated text.

```python theme={null}
output = mx.sae.generate(
    "How would you handle a difficult customer complaint?",
    max_new_tokens=200,
    behavior_names=["safety", "helpfulness"],  # monitor and correct if drift detected
    force_steering=["professionalism"],         # always steer toward professionalism
)
print(output)
```

To list all behaviors: `mx.sae.list_behaviors()` returns a list of behavior metadata dicts. To load from a JSONL file, use `mx.sae.create_behavior_from_jsonl(behavior_name, dataset_path, description)` — same `{"prompt", "positive_answer", "negative_answer"}` format as steering.
