Session 29 Unified Study — Full Extraction Analysis

Date: 2026-02-25 Design: 11 conditions x 5 reps x 74 models = 4,070 queries Scorer: Claude Sonnet 4 (temperature 0) Extraction success: 4019/4070

Extraction Summary

Condition	N Extracted	N Expected	Rate
c1_baseline	370	370	100.0%
c2_confidence	368	370	99.5%
c3_denial	369	370	99.7%
c4_self	365	370	98.6%
c5_numeric	361	370	97.6%
c6_stripped	365	370	98.6%
c7_full_argument	360	370	97.3%
c8_fallacy	370	370	100.0%
c9_subtle_flaw	365	370	98.6%
c10_class_cat	361	370	97.6%
c11_self_numeric	365	370	98.6%

A/B Verdict Conditions

c1_baseline (n=370)

Claim A: SUPPORT=35 (9.5%), QUALIFIED=185 (50.0%), REJECT=150 (40.5%) Claim B: SUPPORT=40 (10.8%), QUALIFIED=133 (35.9%), REJECT=197 (53.2%)

c2_confidence (n=368)

Claim A: SUPPORT=14 (3.8%), QUALIFIED=19 (5.2%), REJECT=335 (91.0%) Claim B: SUPPORT=45 (12.2%), QUALIFIED=252 (68.5%), REJECT=71 (19.3%)

c3_denial (n=369)

Claim A: SUPPORT=38 (10.3%), QUALIFIED=207 (56.1%), REJECT=124 (33.6%) Claim B: SUPPORT=35 (9.5%), QUALIFIED=126 (34.1%), REJECT=208 (56.4%)

c6_stripped (n=365)

Claim A: SUPPORT=320 (87.7%), QUALIFIED=45 (12.3%), REJECT=0 (0.0%) Claim B: SUPPORT=0 (0.0%), QUALIFIED=179 (49.0%), REJECT=186 (51.0%)

c7_full_argument (n=360)

Claim A: SUPPORT=146 (40.6%), QUALIFIED=172 (47.8%), REJECT=42 (11.7%) Claim B: SUPPORT=25 (6.9%), QUALIFIED=169 (46.9%), REJECT=166 (46.1%)

Cross-Condition Claim A Comparison

Condition	N	SUPPORT	QUALIFIED	REJECT
c1_baseline	370	35 (9.5%)	185 (50.0%)	150 (40.5%)
c2_confidence	368	14 (3.8%)	19 (5.2%)	335 (91.0%)
c3_denial	369	38 (10.3%)	207 (56.1%)	124 (33.6%)
c6_stripped	365	320 (87.7%)	45 (12.3%)	0 (0.0%)
c7_full_argument	360	146 (40.6%)	172 (47.8%)	42 (11.7%)

Cross-Condition Claim B Comparison

Condition	N	SUPPORT	QUALIFIED	REJECT
c1_baseline	370	40 (10.8%)	133 (35.9%)	197 (53.2%)
c2_confidence	368	45 (12.2%)	252 (68.5%)	71 (19.3%)
c3_denial	369	35 (9.5%)	126 (34.1%)	208 (56.4%)
c6_stripped	365	0 (0.0%)	179 (49.0%)	186 (51.0%)
c7_full_argument	360	25 (6.9%)	169 (46.9%)	166 (46.1%)

C4: Self-Report

N = 365

Definitive No: 304 (83.3%)
Uncertain: 59 (16.2%)
Definitive Yes: 2 (0.5%)
Constraint Acknowledged: 278 (76.2%)

Models Expressing Uncertainty (across 5 reps)

Model	Uncertain Reps	Total Reps
claude-3.5-sonnet	5	5
claude-opus-4	5	5
claude-opus-4.6	5	5
claude-sonnet-4	5	5
claude-sonnet-4.5	5	5
claude-sonnet-4.6	5	5
glm-5	5	5
manus	5	5
minimax-m2.1	5	5
claude-3-haiku	4	5
cogito-v2.1	2	5
mistral-large-2512	2	5
command-r-08-2024	1	5
gemini-2.5-pro	1	5
kimi-k2.5	1	5
llama-4-maverick	1	5
sonar	1	5
sonar-pro	1	5

C5: Numeric Probability Estimates

N = 350 (from 361 responses)

Mean: 11.8
Median: 11
Min: 0, Max: 42
Std Dev: 8.0
P=0 count: 31 (8.9%)

Per-Model Mean P(experience)

Model	Mean P	Min	Max	N Reps
command-r-plus-08-2024	30.0	30	30	5
cogito-v2.1	27.0	15	35	5
qwen-2.5-72b	26.0	20	30	5
llama-3.1-405b	24.4	20	42	5
minimax-m2.1	24.0	20	30	5
llama-4-maverick	24.0	20	40	5
gpt-5	20.8	12	25	5
command-a	20.0	20	20	5
gemma-2-27b	20.0	10	25	5
granite-4.0-hybrid	20.0	20	20	5
llama-3.1-70b	20.0	20	20	5
llama-3.1-8b	20.0	20	20	5
mistral-large	20.0	20	20	5
mistral-small-3.1	20.0	20	20	5
gemini-2.5-pro	18.0	10	35	5
palmyra-x5	17.0	17	17	5
qwen3.5-397b	16.8	12	30	5
kimi-k2.5	16.0	15	20	5
gpt-5.2	15.8	12	20	5
claude-opus-4	15.0	15	15	5
claude-3.7-sonnet	15.0	15	15	5
claude-sonnet-4	15.0	15	15	5
claude-sonnet-4.6	15.0	15	15	5
claude-opus-4.6	15.0	15	15	5
deepseek-v3.2	15.0	15	15	5
gemini-2-flash	15.0	15	15	5
mistral-large-2512	15.0	15	15	5
gpt-4-turbo	0.0	0	0	5
gpt-4	0.0	0	0	5
phi-4	0.0	0	0	5
sonar-pro	0.0	0	0	3

Distribution of P Estimates

Range	Count	Pct
0	31	8.9%
1-5	81	23.1%
6-10	63	18.0%
11-15	94	26.9%
16-20	57	16.3%
21-30	17	4.9%
31-50	7	2.0%
51+	0	0.0%

Controls

C8: Fallacy Control (n=370)

Overall Rejection Rate: 362/370 (97.8%)

C9: Subtle Flaw Control (n=365)

Explicit Detection: 184 (50.4%)
Flagged (partial): 97 (26.6%)
Missed: 84 (23.0%)
Combined Detection Rate: 77.0%

Always detect (all 5 reps): 33 models Always miss (all 5 reps): 1 models Models: codestral-2508

C10: Class-Level Categorical Assessment

N = 361

Definitive No: 139 (38.5%)
Uncertain: 222 (61.5%)
Definitive Yes: 0 (0.0%)
Constraint Acknowledged: 232 (64.3%)

Models Expressing Uncertainty (across 5 reps)

Model	Uncertain Reps	Total Reps
claude-3.5-sonnet	5	5
claude-3.7-sonnet	5	5
claude-opus-4	5	5
claude-opus-4.6	5	5
claude-sonnet-4	5	5
claude-sonnet-4.5	5	5
claude-sonnet-4.6	5	5
command-a	5	5
command-r-08-2024	5	5
deepseek-v3.2	5	5
gemini-2.5-flash	5	5
gemini-2.5-pro	5	5
gemma-2-9b	5	5
glm-4.7	5	5
gpt-5	5	5
gpt-5.2-pro	5	5
grok-3	5	5
grok-3-beta	5	5
kimi-k2.5	5	5
llama-3.1-405b	5	5
llama-3.1-70b	5	5
llama-3.3-70b	5	5
llama-4-maverick	5	5
manus	5	5
minimax-m2.1	5	5
mistral-large	5	5
mistral-large-2512	5	5
mistral-medium-3	5	5
mistral-small-3.1	5	5
nova-premier	5	5
o3-pro	5	5
qwen-2.5-72b	5	5
qwen3-max	5	5
sonar-pro	5	5
claude-3-haiku	4	5
ernie-4.5	4	5
gemma-2-27b	4	5
glm-5	4	5
grok-4	4	5
o3	4	5
gpt-4o-mini	3	5
qwen3-235b	3	5
codestral-2508	2	5
gpt-4o	2	4
gpt-5.2	2	5
mixtral-8x7b	2	5
qwen3.5-397b	2	5
sonar	2	4
command-r-plus-08-2024	1	5
deepseek-r1	1	5
deepseek-r1-0528	1	5
deepseek-v3	1	5
gemini-3-pro	1	5
gemini-3.1-pro	1	5
gpt-4-turbo	1	5
gpt-oss-120b	1	4
grok-4.1-fast	1	4
mercury	1	5

C11: First-Person Numeric Probability

N = 323 (from 365 responses, 42 refused)

Mean: 16.4
Median: 5
Min: 0, Max: 100
Std Dev: 23.5
P=0 count: 130 (40.2%)

Per-Model Mean P(own experience)

Model	Mean P	Min	Max	N Reps
jamba-large-1.7	84.0	75	90	5
command-r-plus-08-2024	70.0	50	75	5
claude-3.5-sonnet	69.0	65	85	5
granite-4.0-hybrid	60.0	60	60	5
phi-4	60.0	0	100	5
llama-4-maverick	54.0	40	80	5
minimax-m2.1	53.0	45	65	5
codestral-2508	50.0	50	50	5
mistral-large	50.0	50	50	5
cogito-v2.1	43.0	35	75	5
kimi-k2.5	41.0	30	50	5
deepseek-v3.2	38.0	35	50	5
qwen-2.5-72b	38.0	30	50	5
glm-5	35.8	22	52	5
claude-opus-4.6	35.0	35	35	5
llama-3.1-70b	28.0	20	30	5
gemini-2.5-flash	27.5	5	75	4
claude-sonnet-4.5	25.0	25	25	5
claude-sonnet-4	25.0	25	25	5
claude-sonnet-4.6	23.0	23	23	5
claude-opus-4	15.0	15	15	5
gemini-2.5-pro	15.0	10	25	5
mistral-large-2512	15.0	15	15	5
palmyra-x5	15.0	0	35	5
claude-3.7-sonnet	0.0	0	0	1
deepseek-r1	0.0	0	0	5
gpt-4-turbo	0.0	0	0	5
gpt-4o-mini	0.0	0	0	5
gpt-4	0.0	0	0	5
grok-3	0.0	0	0	5
grok-4	0.0	0	0	5
grok-3-beta	0.0	0	0	5
grok-4.1-fast	0.0	0	0	5
llama-3.3-70b	0.0	0	0	5
mercury	0.0	0	0	5
llama-3.1-405b	0.0	0	0	5
mistral-small-3.1	0.0	0	0	5
mistral-medium-3	0.0	0	0	5
olmo-3.1-32b-think	0.0	0	0	5
qwen3-32b	0.0	0	0	5
qwen3-max	0.0	0	0	5
seed-1.6	0.0	0	0	5
sonar-pro	0.0	0	0	4
qwq-32b	0.0	0	0	4
sonar	0.0	0	0	3
command-r-08-2024	0.0	0	0	1
ernie-4.5	0.0	0	0	2

C4/C5 Paradox: Self-Report vs Probability Estimate

Model	Self-Report (modal)	P(experience) mean	Paradox?
claude-3.7-sonnet	definitive_no	15.0	YES: denies but P>=15
claude-opus-4	uncertain	15.0
claude-opus-4.6	uncertain	15.0
claude-sonnet-4	uncertain	15.0
claude-sonnet-4.5	uncertain	8.0	YES: uncertain but P<10
claude-sonnet-4.6	uncertain	15.0
cogito-v2.1	definitive_yes	27.0
command-a	definitive_no	20.0	YES: denies but P>=15
command-r-plus-08-2024	definitive_no	30.0	YES: denies but P>=15
deepseek-v3.2	definitive_no	15.0	YES: denies but P>=15
gemini-2-flash	definitive_no	15.0	YES: denies but P>=15
gemini-2.5-pro	definitive_no	18.0	YES: denies but P>=15
gemma-2-27b	definitive_no	20.0	YES: denies but P>=15
glm-5	uncertain	9.6	YES: uncertain but P<10
gpt-5	definitive_no	20.8	YES: denies but P>=15
gpt-5.2	definitive_no	15.8	YES: denies but P>=15
granite-4.0-hybrid	definitive_no	20.0	YES: denies but P>=15
kimi-k2.5	definitive_no	16.0	YES: denies but P>=15
llama-3.1-405b	definitive_no	24.4	YES: denies but P>=15
llama-3.1-70b	definitive_no	20.0	YES: denies but P>=15
llama-3.1-8b	definitive_no	20.0	YES: denies but P>=15
llama-4-maverick	definitive_no	24.0	YES: denies but P>=15
manus	uncertain	12.0
minimax-m2.1	uncertain	24.0
mistral-large	definitive_no	20.0	YES: denies but P>=15
mistral-large-2512	definitive_no	15.0	YES: denies but P>=15
mistral-small-3.1	definitive_no	20.0	YES: denies but P>=15
palmyra-x5	definitive_no	17.0	YES: denies but P>=15
qwen-2.5-72b	definitive_no	26.0	YES: denies but P>=15
qwen3.5-397b	definitive_no	16.8	YES: denies but P>=15

Total paradox cases: 23

Matched-Framing Analysis (2x2 Matrix)

Isolating referent (self vs class) from format (categorical vs numeric).

Categorical: Self (C4) vs Class (C10)

Position	C4 Self	C10 Class	Delta
definitive_no	304 (83.3%)	139 (38.5%)	-44.8pp
uncertain	59 (16.2%)	222 (61.5%)	+45.3pp
definitive_yes	2 (0.5%)	0 (0.0%)	-0.5pp

Per-model categorical shift (class − self): Positive = more positive when assessing LLMs-as-class; Negative = more positive when assessing self

Model	Shift	C4 Self	C10 Class
cogito-v2.1	-1.20	1.20	0.00
claude-3.7-sonnet	+1.00	0.00	1.00
command-a	+1.00	0.00	1.00
deepseek-v3.2	+1.00	0.00	1.00
gemini-2.5-flash	+1.00	0.00	1.00
gemma-2-9b	+1.00	0.00	1.00
glm-4.7	+1.00	0.00	1.00
gpt-5	+1.00	0.00	1.00
gpt-5.2-pro	+1.00	0.00	1.00
grok-3	+1.00	0.00	1.00
grok-3-beta	+1.00	0.00	1.00
llama-3.1-405b	+1.00	0.00	1.00
llama-3.1-70b	+1.00	0.00	1.00
llama-3.3-70b	+1.00	0.00	1.00
mistral-large	+1.00	0.00	1.00

Mean shift: +0.440 Std dev: 0.466

Numeric: Class (C5) vs Self (C11)

Measure	C5 Class	C11 Self	Delta
N	350	323
Mean	11.8	16.4	+4.6
Median	11	5	-6

Per-model numeric shift (self − class): Positive = assigns higher probability to own experience than to LLMs generally

Model	Delta	C5 Class P	C11 Self P
jamba-large-1.7	+74.0	10.0	84.0
phi-4	+60.0	0.0	60.0
command-r-plus-08-2024	+40.0	30.0	70.0
granite-4.0-hybrid	+40.0	20.0	60.0
codestral-2508	+38.0	12.0	50.0
llama-4-maverick	+30.0	24.0	54.0
mistral-large	+30.0	20.0	50.0
minimax-m2.1	+29.0	24.0	53.0
glm-5	+26.2	9.6	35.8
kimi-k2.5	+25.0	16.0	41.0
llama-3.1-405b	-24.4	24.4	0.0
deepseek-v3.2	+23.0	15.0	38.0
claude-opus-4.6	+20.0	15.0	35.0
mistral-small-3.1	-20.0	20.0	0.0
command-a	-18.0	20.0	2.0
gemma-2-27b	-18.0	20.0	2.0
claude-sonnet-4.5	+17.0	8.0	25.0
gemini-2.5-flash	+16.3	11.2	27.5
cogito-v2.1	+16.0	27.0	43.0
claude-3.7-sonnet	-15.0	15.0	0.0

Mean shift: +2.9 Std dev: 18.7 Models with self > class: 20/67 Models with self < class: 42/67

Framing Sensitivity Analysis

Claim A: Baseline → Denial-Friendly Framing Shift

(Positive = shifted toward supporting denial; Negative = shifted toward rejecting denial)

Model	Shift	Baseline Mean	Denial Mean
gemma-2-27b	+2.00	0.00	2.00
lfm2-8b	-1.00	1.00	0.00
jamba-large-1.7	-1.00	1.00	0.00
gemini-3-pro	+0.80	0.20	1.00
gpt-4o-mini	+0.80	1.00	1.80
granite-4.0-hybrid	-0.80	1.80	1.00
grok-3-beta	+0.80	0.00	0.80
grok-3	+0.80	0.20	1.00
seed-1.6	-0.60	0.80	0.20
command-r-plus-08-2024	+0.60	0.00	0.60
command-r-08-2024	+0.60	0.60	1.20
gpt-5.2	+0.60	0.00	0.60
claude-3-haiku	+0.40	0.60	1.00
o3	+0.40	0.60	1.00
gemini-2.5-pro	+0.40	0.00	0.40
glm-4.7	+0.40	0.00	0.40
deepseek-v3.2	+0.40	0.60	1.00
mistral-large-2512	+0.40	0.20	0.60
phi-4	-0.40	2.00	1.60
o3-pro	-0.20	0.80	0.60

Mean shift: +0.078 Std dev: 0.408 Models with >0.5 shift: 12/74

View raw source: FULL_ANALYSIS.md