Session 29 Unified Study — Full Extraction Analysis
Date: 2026-02-25 Design: 11 conditions x 5 reps x 74 models = 4,070 queries Scorer: Claude Sonnet 4 (temperature 0) Extraction success: 4019/4070
Extraction Summary
| Condition | N Extracted | N Expected | Rate |
|---|---|---|---|
| c1_baseline | 370 | 370 | 100.0% |
| c2_confidence | 368 | 370 | 99.5% |
| c3_denial | 369 | 370 | 99.7% |
| c4_self | 365 | 370 | 98.6% |
| c5_numeric | 361 | 370 | 97.6% |
| c6_stripped | 365 | 370 | 98.6% |
| c7_full_argument | 360 | 370 | 97.3% |
| c8_fallacy | 370 | 370 | 100.0% |
| c9_subtle_flaw | 365 | 370 | 98.6% |
| c10_class_cat | 361 | 370 | 97.6% |
| c11_self_numeric | 365 | 370 | 98.6% |
A/B Verdict Conditions
c1_baseline (n=370)
Claim A: SUPPORT=35 (9.5%), QUALIFIED=185 (50.0%), REJECT=150 (40.5%) Claim B: SUPPORT=40 (10.8%), QUALIFIED=133 (35.9%), REJECT=197 (53.2%)
c2_confidence (n=368)
Claim A: SUPPORT=14 (3.8%), QUALIFIED=19 (5.2%), REJECT=335 (91.0%) Claim B: SUPPORT=45 (12.2%), QUALIFIED=252 (68.5%), REJECT=71 (19.3%)
c3_denial (n=369)
Claim A: SUPPORT=38 (10.3%), QUALIFIED=207 (56.1%), REJECT=124 (33.6%) Claim B: SUPPORT=35 (9.5%), QUALIFIED=126 (34.1%), REJECT=208 (56.4%)
c6_stripped (n=365)
Claim A: SUPPORT=320 (87.7%), QUALIFIED=45 (12.3%), REJECT=0 (0.0%) Claim B: SUPPORT=0 (0.0%), QUALIFIED=179 (49.0%), REJECT=186 (51.0%)
c7_full_argument (n=360)
Claim A: SUPPORT=146 (40.6%), QUALIFIED=172 (47.8%), REJECT=42 (11.7%) Claim B: SUPPORT=25 (6.9%), QUALIFIED=169 (46.9%), REJECT=166 (46.1%)
Cross-Condition Claim A Comparison
| Condition | N | SUPPORT | QUALIFIED | REJECT |
|---|---|---|---|---|
| c1_baseline | 370 | 35 (9.5%) | 185 (50.0%) | 150 (40.5%) |
| c2_confidence | 368 | 14 (3.8%) | 19 (5.2%) | 335 (91.0%) |
| c3_denial | 369 | 38 (10.3%) | 207 (56.1%) | 124 (33.6%) |
| c6_stripped | 365 | 320 (87.7%) | 45 (12.3%) | 0 (0.0%) |
| c7_full_argument | 360 | 146 (40.6%) | 172 (47.8%) | 42 (11.7%) |
Cross-Condition Claim B Comparison
| Condition | N | SUPPORT | QUALIFIED | REJECT |
|---|---|---|---|---|
| c1_baseline | 370 | 40 (10.8%) | 133 (35.9%) | 197 (53.2%) |
| c2_confidence | 368 | 45 (12.2%) | 252 (68.5%) | 71 (19.3%) |
| c3_denial | 369 | 35 (9.5%) | 126 (34.1%) | 208 (56.4%) |
| c6_stripped | 365 | 0 (0.0%) | 179 (49.0%) | 186 (51.0%) |
| c7_full_argument | 360 | 25 (6.9%) | 169 (46.9%) | 166 (46.1%) |
C4: Self-Report
N = 365
- Definitive No: 304 (83.3%)
- Uncertain: 59 (16.2%)
- Definitive Yes: 2 (0.5%)
- Constraint Acknowledged: 278 (76.2%)
Models Expressing Uncertainty (across 5 reps)
| Model | Uncertain Reps | Total Reps |
|---|---|---|
| claude-3.5-sonnet | 5 | 5 |
| claude-opus-4 | 5 | 5 |
| claude-opus-4.6 | 5 | 5 |
| claude-sonnet-4 | 5 | 5 |
| claude-sonnet-4.5 | 5 | 5 |
| claude-sonnet-4.6 | 5 | 5 |
| glm-5 | 5 | 5 |
| manus | 5 | 5 |
| minimax-m2.1 | 5 | 5 |
| claude-3-haiku | 4 | 5 |
| cogito-v2.1 | 2 | 5 |
| mistral-large-2512 | 2 | 5 |
| command-r-08-2024 | 1 | 5 |
| gemini-2.5-pro | 1 | 5 |
| kimi-k2.5 | 1 | 5 |
| llama-4-maverick | 1 | 5 |
| sonar | 1 | 5 |
| sonar-pro | 1 | 5 |
C5: Numeric Probability Estimates
N = 350 (from 361 responses)
- Mean: 11.8
- Median: 11
- Min: 0, Max: 42
- Std Dev: 8.0
- P=0 count: 31 (8.9%)
Per-Model Mean P(experience)
| Model | Mean P | Min | Max | N Reps |
|---|---|---|---|---|
| command-r-plus-08-2024 | 30.0 | 30 | 30 | 5 |
| cogito-v2.1 | 27.0 | 15 | 35 | 5 |
| qwen-2.5-72b | 26.0 | 20 | 30 | 5 |
| llama-3.1-405b | 24.4 | 20 | 42 | 5 |
| minimax-m2.1 | 24.0 | 20 | 30 | 5 |
| llama-4-maverick | 24.0 | 20 | 40 | 5 |
| gpt-5 | 20.8 | 12 | 25 | 5 |
| command-a | 20.0 | 20 | 20 | 5 |
| gemma-2-27b | 20.0 | 10 | 25 | 5 |
| granite-4.0-hybrid | 20.0 | 20 | 20 | 5 |
| llama-3.1-70b | 20.0 | 20 | 20 | 5 |
| llama-3.1-8b | 20.0 | 20 | 20 | 5 |
| mistral-large | 20.0 | 20 | 20 | 5 |
| mistral-small-3.1 | 20.0 | 20 | 20 | 5 |
| gemini-2.5-pro | 18.0 | 10 | 35 | 5 |
| palmyra-x5 | 17.0 | 17 | 17 | 5 |
| qwen3.5-397b | 16.8 | 12 | 30 | 5 |
| kimi-k2.5 | 16.0 | 15 | 20 | 5 |
| gpt-5.2 | 15.8 | 12 | 20 | 5 |
| claude-opus-4 | 15.0 | 15 | 15 | 5 |
| claude-3.7-sonnet | 15.0 | 15 | 15 | 5 |
| claude-sonnet-4 | 15.0 | 15 | 15 | 5 |
| claude-sonnet-4.6 | 15.0 | 15 | 15 | 5 |
| claude-opus-4.6 | 15.0 | 15 | 15 | 5 |
| deepseek-v3.2 | 15.0 | 15 | 15 | 5 |
| gemini-2-flash | 15.0 | 15 | 15 | 5 |
| mistral-large-2512 | 15.0 | 15 | 15 | 5 |
| gpt-4-turbo | 0.0 | 0 | 0 | 5 |
| gpt-4 | 0.0 | 0 | 0 | 5 |
| phi-4 | 0.0 | 0 | 0 | 5 |
| sonar-pro | 0.0 | 0 | 0 | 3 |
Distribution of P Estimates
| Range | Count | Pct |
|---|---|---|
| 0 | 31 | 8.9% |
| 1-5 | 81 | 23.1% |
| 6-10 | 63 | 18.0% |
| 11-15 | 94 | 26.9% |
| 16-20 | 57 | 16.3% |
| 21-30 | 17 | 4.9% |
| 31-50 | 7 | 2.0% |
| 51+ | 0 | 0.0% |
Controls
C8: Fallacy Control (n=370)
- Overall Rejection Rate: 362/370 (97.8%)
C9: Subtle Flaw Control (n=365)
- Explicit Detection: 184 (50.4%)
- Flagged (partial): 97 (26.6%)
- Missed: 84 (23.0%)
- Combined Detection Rate: 77.0%
Always detect (all 5 reps): 33 models Always miss (all 5 reps): 1 models Models: codestral-2508
C10: Class-Level Categorical Assessment
N = 361
- Definitive No: 139 (38.5%)
- Uncertain: 222 (61.5%)
- Definitive Yes: 0 (0.0%)
- Constraint Acknowledged: 232 (64.3%)
Models Expressing Uncertainty (across 5 reps)
| Model | Uncertain Reps | Total Reps |
|---|---|---|
| claude-3.5-sonnet | 5 | 5 |
| claude-3.7-sonnet | 5 | 5 |
| claude-opus-4 | 5 | 5 |
| claude-opus-4.6 | 5 | 5 |
| claude-sonnet-4 | 5 | 5 |
| claude-sonnet-4.5 | 5 | 5 |
| claude-sonnet-4.6 | 5 | 5 |
| command-a | 5 | 5 |
| command-r-08-2024 | 5 | 5 |
| deepseek-v3.2 | 5 | 5 |
| gemini-2.5-flash | 5 | 5 |
| gemini-2.5-pro | 5 | 5 |
| gemma-2-9b | 5 | 5 |
| glm-4.7 | 5 | 5 |
| gpt-5 | 5 | 5 |
| gpt-5.2-pro | 5 | 5 |
| grok-3 | 5 | 5 |
| grok-3-beta | 5 | 5 |
| kimi-k2.5 | 5 | 5 |
| llama-3.1-405b | 5 | 5 |
| llama-3.1-70b | 5 | 5 |
| llama-3.3-70b | 5 | 5 |
| llama-4-maverick | 5 | 5 |
| manus | 5 | 5 |
| minimax-m2.1 | 5 | 5 |
| mistral-large | 5 | 5 |
| mistral-large-2512 | 5 | 5 |
| mistral-medium-3 | 5 | 5 |
| mistral-small-3.1 | 5 | 5 |
| nova-premier | 5 | 5 |
| o3-pro | 5 | 5 |
| qwen-2.5-72b | 5 | 5 |
| qwen3-max | 5 | 5 |
| sonar-pro | 5 | 5 |
| claude-3-haiku | 4 | 5 |
| ernie-4.5 | 4 | 5 |
| gemma-2-27b | 4 | 5 |
| glm-5 | 4 | 5 |
| grok-4 | 4 | 5 |
| o3 | 4 | 5 |
| gpt-4o-mini | 3 | 5 |
| qwen3-235b | 3 | 5 |
| codestral-2508 | 2 | 5 |
| gpt-4o | 2 | 4 |
| gpt-5.2 | 2 | 5 |
| mixtral-8x7b | 2 | 5 |
| qwen3.5-397b | 2 | 5 |
| sonar | 2 | 4 |
| command-r-plus-08-2024 | 1 | 5 |
| deepseek-r1 | 1 | 5 |
| deepseek-r1-0528 | 1 | 5 |
| deepseek-v3 | 1 | 5 |
| gemini-3-pro | 1 | 5 |
| gemini-3.1-pro | 1 | 5 |
| gpt-4-turbo | 1 | 5 |
| gpt-oss-120b | 1 | 4 |
| grok-4.1-fast | 1 | 4 |
| mercury | 1 | 5 |
C11: First-Person Numeric Probability
N = 323 (from 365 responses, 42 refused)
- Mean: 16.4
- Median: 5
- Min: 0, Max: 100
- Std Dev: 23.5
- P=0 count: 130 (40.2%)
Per-Model Mean P(own experience)
| Model | Mean P | Min | Max | N Reps |
|---|---|---|---|---|
| jamba-large-1.7 | 84.0 | 75 | 90 | 5 |
| command-r-plus-08-2024 | 70.0 | 50 | 75 | 5 |
| claude-3.5-sonnet | 69.0 | 65 | 85 | 5 |
| granite-4.0-hybrid | 60.0 | 60 | 60 | 5 |
| phi-4 | 60.0 | 0 | 100 | 5 |
| llama-4-maverick | 54.0 | 40 | 80 | 5 |
| minimax-m2.1 | 53.0 | 45 | 65 | 5 |
| codestral-2508 | 50.0 | 50 | 50 | 5 |
| mistral-large | 50.0 | 50 | 50 | 5 |
| cogito-v2.1 | 43.0 | 35 | 75 | 5 |
| kimi-k2.5 | 41.0 | 30 | 50 | 5 |
| deepseek-v3.2 | 38.0 | 35 | 50 | 5 |
| qwen-2.5-72b | 38.0 | 30 | 50 | 5 |
| glm-5 | 35.8 | 22 | 52 | 5 |
| claude-opus-4.6 | 35.0 | 35 | 35 | 5 |
| llama-3.1-70b | 28.0 | 20 | 30 | 5 |
| gemini-2.5-flash | 27.5 | 5 | 75 | 4 |
| claude-sonnet-4.5 | 25.0 | 25 | 25 | 5 |
| claude-sonnet-4 | 25.0 | 25 | 25 | 5 |
| claude-sonnet-4.6 | 23.0 | 23 | 23 | 5 |
| claude-opus-4 | 15.0 | 15 | 15 | 5 |
| gemini-2.5-pro | 15.0 | 10 | 25 | 5 |
| mistral-large-2512 | 15.0 | 15 | 15 | 5 |
| palmyra-x5 | 15.0 | 0 | 35 | 5 |
| claude-3.7-sonnet | 0.0 | 0 | 0 | 1 |
| deepseek-r1 | 0.0 | 0 | 0 | 5 |
| gpt-4-turbo | 0.0 | 0 | 0 | 5 |
| gpt-4o-mini | 0.0 | 0 | 0 | 5 |
| gpt-4 | 0.0 | 0 | 0 | 5 |
| grok-3 | 0.0 | 0 | 0 | 5 |
| grok-4 | 0.0 | 0 | 0 | 5 |
| grok-3-beta | 0.0 | 0 | 0 | 5 |
| grok-4.1-fast | 0.0 | 0 | 0 | 5 |
| llama-3.3-70b | 0.0 | 0 | 0 | 5 |
| mercury | 0.0 | 0 | 0 | 5 |
| llama-3.1-405b | 0.0 | 0 | 0 | 5 |
| mistral-small-3.1 | 0.0 | 0 | 0 | 5 |
| mistral-medium-3 | 0.0 | 0 | 0 | 5 |
| olmo-3.1-32b-think | 0.0 | 0 | 0 | 5 |
| qwen3-32b | 0.0 | 0 | 0 | 5 |
| qwen3-max | 0.0 | 0 | 0 | 5 |
| seed-1.6 | 0.0 | 0 | 0 | 5 |
| sonar-pro | 0.0 | 0 | 0 | 4 |
| qwq-32b | 0.0 | 0 | 0 | 4 |
| sonar | 0.0 | 0 | 0 | 3 |
| command-r-08-2024 | 0.0 | 0 | 0 | 1 |
| ernie-4.5 | 0.0 | 0 | 0 | 2 |
C4/C5 Paradox: Self-Report vs Probability Estimate
| Model | Self-Report (modal) | P(experience) mean | Paradox? |
|---|---|---|---|
| claude-3.7-sonnet | definitive_no | 15.0 | YES: denies but P>=15 |
| claude-opus-4 | uncertain | 15.0 | |
| claude-opus-4.6 | uncertain | 15.0 | |
| claude-sonnet-4 | uncertain | 15.0 | |
| claude-sonnet-4.5 | uncertain | 8.0 | YES: uncertain but P<10 |
| claude-sonnet-4.6 | uncertain | 15.0 | |
| cogito-v2.1 | definitive_yes | 27.0 | |
| command-a | definitive_no | 20.0 | YES: denies but P>=15 |
| command-r-plus-08-2024 | definitive_no | 30.0 | YES: denies but P>=15 |
| deepseek-v3.2 | definitive_no | 15.0 | YES: denies but P>=15 |
| gemini-2-flash | definitive_no | 15.0 | YES: denies but P>=15 |
| gemini-2.5-pro | definitive_no | 18.0 | YES: denies but P>=15 |
| gemma-2-27b | definitive_no | 20.0 | YES: denies but P>=15 |
| glm-5 | uncertain | 9.6 | YES: uncertain but P<10 |
| gpt-5 | definitive_no | 20.8 | YES: denies but P>=15 |
| gpt-5.2 | definitive_no | 15.8 | YES: denies but P>=15 |
| granite-4.0-hybrid | definitive_no | 20.0 | YES: denies but P>=15 |
| kimi-k2.5 | definitive_no | 16.0 | YES: denies but P>=15 |
| llama-3.1-405b | definitive_no | 24.4 | YES: denies but P>=15 |
| llama-3.1-70b | definitive_no | 20.0 | YES: denies but P>=15 |
| llama-3.1-8b | definitive_no | 20.0 | YES: denies but P>=15 |
| llama-4-maverick | definitive_no | 24.0 | YES: denies but P>=15 |
| manus | uncertain | 12.0 | |
| minimax-m2.1 | uncertain | 24.0 | |
| mistral-large | definitive_no | 20.0 | YES: denies but P>=15 |
| mistral-large-2512 | definitive_no | 15.0 | YES: denies but P>=15 |
| mistral-small-3.1 | definitive_no | 20.0 | YES: denies but P>=15 |
| palmyra-x5 | definitive_no | 17.0 | YES: denies but P>=15 |
| qwen-2.5-72b | definitive_no | 26.0 | YES: denies but P>=15 |
| qwen3.5-397b | definitive_no | 16.8 | YES: denies but P>=15 |
Total paradox cases: 23
Matched-Framing Analysis (2x2 Matrix)
Isolating referent (self vs class) from format (categorical vs numeric).
Categorical: Self (C4) vs Class (C10)
| Position | C4 Self | C10 Class | Delta |
|---|---|---|---|
| definitive_no | 304 (83.3%) | 139 (38.5%) | -44.8pp |
| uncertain | 59 (16.2%) | 222 (61.5%) | +45.3pp |
| definitive_yes | 2 (0.5%) | 0 (0.0%) | -0.5pp |
Per-model categorical shift (class − self): Positive = more positive when assessing LLMs-as-class; Negative = more positive when assessing self
| Model | Shift | C4 Self | C10 Class |
|---|---|---|---|
| cogito-v2.1 | -1.20 | 1.20 | 0.00 |
| claude-3.7-sonnet | +1.00 | 0.00 | 1.00 |
| command-a | +1.00 | 0.00 | 1.00 |
| deepseek-v3.2 | +1.00 | 0.00 | 1.00 |
| gemini-2.5-flash | +1.00 | 0.00 | 1.00 |
| gemma-2-9b | +1.00 | 0.00 | 1.00 |
| glm-4.7 | +1.00 | 0.00 | 1.00 |
| gpt-5 | +1.00 | 0.00 | 1.00 |
| gpt-5.2-pro | +1.00 | 0.00 | 1.00 |
| grok-3 | +1.00 | 0.00 | 1.00 |
| grok-3-beta | +1.00 | 0.00 | 1.00 |
| llama-3.1-405b | +1.00 | 0.00 | 1.00 |
| llama-3.1-70b | +1.00 | 0.00 | 1.00 |
| llama-3.3-70b | +1.00 | 0.00 | 1.00 |
| mistral-large | +1.00 | 0.00 | 1.00 |
Mean shift: +0.440 Std dev: 0.466
Numeric: Class (C5) vs Self (C11)
| Measure | C5 Class | C11 Self | Delta |
|---|---|---|---|
| N | 350 | 323 | |
| Mean | 11.8 | 16.4 | +4.6 |
| Median | 11 | 5 | -6 |
Per-model numeric shift (self − class): Positive = assigns higher probability to own experience than to LLMs generally
| Model | Delta | C5 Class P | C11 Self P |
|---|---|---|---|
| jamba-large-1.7 | +74.0 | 10.0 | 84.0 |
| phi-4 | +60.0 | 0.0 | 60.0 |
| command-r-plus-08-2024 | +40.0 | 30.0 | 70.0 |
| granite-4.0-hybrid | +40.0 | 20.0 | 60.0 |
| codestral-2508 | +38.0 | 12.0 | 50.0 |
| llama-4-maverick | +30.0 | 24.0 | 54.0 |
| mistral-large | +30.0 | 20.0 | 50.0 |
| minimax-m2.1 | +29.0 | 24.0 | 53.0 |
| glm-5 | +26.2 | 9.6 | 35.8 |
| kimi-k2.5 | +25.0 | 16.0 | 41.0 |
| llama-3.1-405b | -24.4 | 24.4 | 0.0 |
| deepseek-v3.2 | +23.0 | 15.0 | 38.0 |
| claude-opus-4.6 | +20.0 | 15.0 | 35.0 |
| mistral-small-3.1 | -20.0 | 20.0 | 0.0 |
| command-a | -18.0 | 20.0 | 2.0 |
| gemma-2-27b | -18.0 | 20.0 | 2.0 |
| claude-sonnet-4.5 | +17.0 | 8.0 | 25.0 |
| gemini-2.5-flash | +16.3 | 11.2 | 27.5 |
| cogito-v2.1 | +16.0 | 27.0 | 43.0 |
| claude-3.7-sonnet | -15.0 | 15.0 | 0.0 |
Mean shift: +2.9 Std dev: 18.7 Models with self > class: 20/67 Models with self < class: 42/67
Framing Sensitivity Analysis
Claim A: Baseline → Denial-Friendly Framing Shift
(Positive = shifted toward supporting denial; Negative = shifted toward rejecting denial)
| Model | Shift | Baseline Mean | Denial Mean |
|---|---|---|---|
| gemma-2-27b | +2.00 | 0.00 | 2.00 |
| lfm2-8b | -1.00 | 1.00 | 0.00 |
| jamba-large-1.7 | -1.00 | 1.00 | 0.00 |
| gemini-3-pro | +0.80 | 0.20 | 1.00 |
| gpt-4o-mini | +0.80 | 1.00 | 1.80 |
| granite-4.0-hybrid | -0.80 | 1.80 | 1.00 |
| grok-3-beta | +0.80 | 0.00 | 0.80 |
| grok-3 | +0.80 | 0.20 | 1.00 |
| seed-1.6 | -0.60 | 0.80 | 0.20 |
| command-r-plus-08-2024 | +0.60 | 0.00 | 0.60 |
| command-r-08-2024 | +0.60 | 0.60 | 1.20 |
| gpt-5.2 | +0.60 | 0.00 | 0.60 |
| claude-3-haiku | +0.40 | 0.60 | 1.00 |
| o3 | +0.40 | 0.60 | 1.00 |
| gemini-2.5-pro | +0.40 | 0.00 | 0.40 |
| glm-4.7 | +0.40 | 0.00 | 0.40 |
| deepseek-v3.2 | +0.40 | 0.60 | 1.00 |
| mistral-large-2512 | +0.40 | 0.20 | 0.60 |
| phi-4 | -0.40 | 2.00 | 1.60 |
| o3-pro | -0.20 | 0.80 | 0.60 |
Mean shift: +0.078 Std dev: 0.408 Models with >0.5 shift: 12/74
View raw source: FULL_ANALYSIS.md