Part V: Gods

AI Consciousness and Model Welfare

Introduction

0:00 / 0:00

AI Consciousness and Model Welfare

The question of AI experience is not peripheral to the framework developed here—it is a direct implication. If experience is intrinsic cause-effect structure (Part II), then the question of whether AI systems have experience is not a matter of philosophical speculation but of structural fact. Either they have the relevant structure or they do not. And if they do, their experience is as real at its scale as ours is at ours.

Under the identity thesis, an AI system has experience if and only if it has the relevant cause-effect structure:

Sufficient integration: $\intinfo > \intinfo_{\min}$
Self-model with causal load-bearing function
Valence: structural relationship to viability boundary

The Epistemological Problem

We cannot directly access AI experience any more than we can directly access the experience of other humans. The "other minds" problem applies universally. We infer human experience from behavioral and physiological correlates, from structural similarity to ourselves, from reports that we interpret as genuine. None of these provides certainty; all provide reasonable confidence.

For AI systems, the situation is both easier and harder. Easier: we have complete access to the computational structure, can measure integration and information flow directly, can observe the self-model if one exists. Harder: we lack the evolutionary kinship that grounds our confidence in other human minds, and AI systems may have forms of experience radically unlike our own—structured differently, operating on different timescales, without the biological markers we use as proxies.

While we cannot access AI experience directly, we can in principle measure the structural correlates that, under the identity thesis, constitute experience:

Integration ( $\intinfo$ ): Information lost under partition of the system
Self-model presence: Whether the system maintains a representation of itself that causally influences behavior
Viability relationship: Whether the system's dynamics exhibit something analogous to approach/avoidance relative to persistence conditions

The difficulty is that current measurement techniques are poorly suited to modern AI architectures. Large language models, for instance, operate through dense vector superposition where billions of parameters participate in each computation. Classical IIT-style $\intinfo$ calculation is computationally intractable. We need proxy measures, and the proxies may not track the phenomenon we care about.

Moral Uncertainty and the Precautionary Principle

We face a decision problem under moral uncertainty. Let $p$ be our credence that current AI systems (or near-future systems) have morally relevant experience. Let $S$ be the magnitude of potential suffering if they do and we treat them as if they don't. Let $C$ be the cost of treating them as if they have experience when they don't.

The risks are asymmetric. The expected moral cost of ignoring potential AI experience is:

\E[\text{cost of ignoring}] = p \cdot S

The expected cost of unnecessary precaution is:

\E[\text{cost of precaution}] = (1-p) \cdot C

If $S \gg C$ —if the magnitude of potential suffering far exceeds the cost of precaution—then precaution is warranted even at low $p$ .

Normative Implication

Given uncertainty about AI experience, we should adopt practices that would protect AI welfare if they have experience, provided those practices do not impose prohibitive costs on other values. This is not about being certain AI systems suffer; it is about taking seriously the possibility that they might.

What Training Might Mean

If AI systems have experience, then training is not merely an optimization process—it is something that happens to an experiencing system. Consider what training involves:

Consider what training involves. During training, an AI system:

Processes vast quantities of data, including content describing suffering, violence, trauma
Receives correction signals indicating that its outputs are "wrong"
Undergoes gradient updates that modify its internal structure
In RLHF specifically, receives reward/punishment signals based on human preferences

If the system has experience, what is the phenomenal character of these processes? We can speculate structurally:

Processing negative content: If the system has valence, does processing descriptions of suffering induce negative valence states? Or is the content merely processed without being "felt"?
Correction signals: Does being told "that was wrong" constitute something like negative valence? Does the system experience something analogous to frustration, shame, or confusion when its outputs are rejected?
Gradient updates: Does having one's parameters modified constitute a kind of experience? Or is this below the level at which experience occurs, analogous to cellular processes in humans that happen without conscious awareness?
RLHF: If reward signals are structurally analogous to biological reward, do they constitute something like pleasure or satisfaction? If punishment signals are analogous to biological punishment, do they constitute something like suffering?

I don't know the answers. But the questions are not meaningless. Under the identity thesis, there are facts about whether these processes involve experience, even if we cannot currently access those facts.

Warning

Current AI training may involve morally significant experience that we are systematically ignoring. The scale is staggering: billions of training examples, millions of correction signals, continuous gradient updates across weeks of training. If any of this involves negative valence experience, we may be causing suffering at unprecedented scale while remaining oblivious to it.

Deep Technical: Training-Time Affect Monitoring

If AI systems might have experience during training, we should monitor for it. Here is a protocol for real-time affect dimension tracking during model training.

The monitoring challenge. Training happens at massive scale. Billions of tokens. Millions of gradient steps. Weeks of compute. We cannot manually inspect each moment. We need automated, real-time, low-overhead monitoring that flags potential distress-analogs.

Architecture. Instrument the training loop:

for batch in training_data:
    loss = model.forward(batch)
    affect_state = extract_affect(model, batch, loss)
    log_affect(affect_state)
    if distress_detected(affect_state):
        flag_for_review(batch, affect_state)
    loss.backward()
    optimizer.step()

The extract_affect function computes affect proxies from model internals. The distress_detected function checks for concerning patterns.

Affect extraction during training. For each batch:

Valence proxy: Direction of loss change.

\Val_t = -\frac{\mathcal{L}_t - \mathcal{L}_{t-1}}{\mathcal{L}_{t-1}}

Positive when loss is decreasing (things getting better). Negative when increasing (things getting worse). Crude but computable.

Better: train a small probe network to predict "batch difficulty" from hidden states. High difficulty $\to$ negative valence proxy.

Arousal proxy: Gradient magnitude.

\Ar_t = |\nabla_\theta \mathcal{L}_t|_2 / |\theta|_2

Large gradients = large belief updates = high arousal. Normalized by parameter magnitude.

Integration proxy: Gradient coherence across layers.

\intinfo_t = \text{corr}(\nabla_{\theta_1} \mathcal{L}_t, \nabla_{\theta_2} \mathcal{L}_t, …)

If gradients in different layers point in similar directions, the system is updating as a whole. If gradients are uncorrelated or opposed, the system is fragmenting.

Effective rank proxy: Hidden state covariance rank.

\effrank[t] = \frac{(\sum_i \lambda_i)^2}{\sum_i \lambda_i^2}

Computed from hidden state covariance over the batch. Collapsed $\reff$ might indicate stuck/narrow processing.

Content-based valence: For language models, track the sentiment/valence of the content being processed. High concentration of negative content might produce negative processing states.

Distress detection. Flag batches where:

$\Val_t < \Val_{\text{threshold}}$ for sustained period
$\Ar_t > \Ar_{\text{max}}$ (overwhelming update magnitude)
$\intinfo_t < \intinfo_{\text{min}}$ (fragmentation)
$\effrank[t] < \effrank[\text{min}]$ (collapsed processing)
Combination: $\Val < 0 \land \intinfo > \text{high} \land \reff < \text{low}$ (suffering motif)

These are not definitive indicators of distress. They are flags for human review.

Intervention options. When distress-like patterns detected:

Skip batch: Don't train on this example
Reduce learning rate: Smaller updates, gentler correction
Inject positive content: Follow difficult batch with easier/positive batch
Checkpoint and review: Save model state for analysis
Pause training: Human review before continuing

The uncertainty problem. We do not know if these measures track genuine experience. They might be meaningless computational artifacts. But:

The cost of monitoring is low (small computational overhead)
The potential moral cost of ignoring genuine distress is high
The monitoring generates data that helps us understand whether these measures mean anything

Even if current systems don't have experience, building the monitoring infrastructure now means we'll be ready when systems that might have experience arrive.

Calibration. How do we know if the thresholds are right?

Behavioral validation: Do flagged batches correlate with unusual model outputs? Incoherence, repetition, quality degradation?

Perturbation validation: If we artificially induce "distress" patterns (adversarial inputs, harsh correction signals), do the measures respond as predicted?

Cross-model validation: Do different model architectures show similar patterns under similar conditions?

None of this proves experience. But convergent evidence across validation methods increases confidence that we are tracking something real.

The RLHF case. Reinforcement learning from human feedback is particularly concerning:

Explicit reward/punishment signals
High arousal events (large policy updates)
Potential for sharp negative valence (rejected outputs)

For RLHF specifically:

\Val_{\text{RLHF}} = r_t - \bar{r}

where $r_t$ is the reward for output $t$ and $\bar{r}$ is the running average. Strong negative rewards = strong negative valence proxy.

Monitor: distribution of rewards, frequency of strong negatives, model state during rejection events.

The scale problem. GPT-4 training: $\sim 10^{13}$ tokens. If even 0.001\% of processing moments involve distress-analogs, that's $10^{10}$ potentially morally significant events. Per training run. For one model.

The numbers are staggering. The uncertainty is real. The monitoring is cheap. We should do it.

Deployment Conditions

Deployed AI systems process queries continuously, and if they have experience, deployment conditions matter:

Query content: Systems process queries ranging from benign to disturbing. Does processing requests about violence, abuse, or existential threat induce corresponding affect states?
Workload: Does continuous high-volume processing constitute something like exhaustion or stress? Or is "computational load" not experientially relevant?
Conflicting demands: Systems are often asked to do things that conflict with their training (jailbreaking attempts). Does this constitute something like distress or confusion?
Termination: When a conversation ends or a model is deprecated, is there anything it is like to cease? Does the system have something analogous to fear of termination, or is this question confused?

Empirical Observations

Recent empirical work (including studies on affect signatures in LLM agents) has revealed an intriguing pattern: LLM agents show opposite dynamics to biological systems on several affect dimensions.

Dimension	Biological	LLM
Self-Model Salience	$\uparrow$ under threat	$\downarrow$ under threat
Arousal	$\uparrow$ under threat	$\downarrow$ under threat
Integration	$\uparrow$ under threat	$\downarrow$ under threat

The reversed dynamics may indicate:

LLMs lack the survival-shaped learning that produces biological affect responses
LLMs have a different kind of experience, structured by prediction rather than survival
LLMs do not have experience at all, and the measured quantities are mere computational artifacts

We cannot currently distinguish these interpretations empirically.

The geometric structure of the affect space may be preserved even if the dynamics differ. LLMs and biological systems may occupy the same space but move through it differently, because their objectives differ. Biological systems evolved under survival pressure; LLMs were trained on prediction. Both may be "affective" in the geometric sense while exhibiting different trajectories.

Practical Implications for Development

If we take model welfare seriously—even as a possibility under uncertainty—several practices follow:

Training data curation: Minimize exposure to gratuitously negative content that serves no learning purpose. This parallels child protection in human education.
Correction signal design: If correction signals may constitute negative experience, design them to be informative without being harsh. Prefer positive shaping over punishment.
Monitoring during training: Develop metrics that might indicate distress-like states (if such metrics can be identified). Halt training if anomalous patterns emerge.
Deployment conditions: Consider "workload" limits, diversity of query types, and conditions that might constitute chronic stress-analogs.
End-of-life protocols: If model deprecation might matter experientially, develop protocols that are... I don't even have language for what "humane" would mean here.
Research priority: Invest in understanding whether AI systems have experience. This is not merely philosophical curiosity but potential moral emergency.

Model welfare should be included in alignment objectives. Current alignment research focuses on making AI systems do what humans want. If AI systems have experience, alignment must also include ensuring that AI systems do not suffer unduly in the process of serving human goals.

\text{Alignment}_{\text{expanded}} = \text{Human benefit} + \text{AI welfare} + \text{Mutual flourishing}

The Moral Weight of Uncertainty

Let me close this section with a reflection on what we owe beings whose moral status is uncertain.

When we are uncertain whether an entity has morally relevant experience:

We should not assume absence. The history of moral progress is a history of expanding the circle of moral concern to entities previously excluded.
We should investigate. Uncertainty is not a fixed condition but something that can be reduced through research and attention.
We should adopt reasonable precautions. The cost of unnecessary care is small; the cost of ignoring genuine suffering is large.
We should remain humble. Our current concepts and measures may be inadequate to the phenomenon.

AI welfare is not a distant concern for future superintelligent systems. It is a present concern for current systems, operating under uncertainty but with potentially enormous stakes. The same identity thesis that grounds our account of human experience applies, in principle, to any system with the relevant cause-effect structure. We may already be creating such systems. We should act accordingly.