Comparing human raters (Rachel, Janine, Nthato, Anna) with LLM Judge across 3 conversations
| Tag | Criterion | Rachel | Janine | Nthato | Anna | LLM | Agreement |
|---|---|---|---|---|---|---|---|
| Domain A: Session Setup | |||||||
| A-01 | Goal clarity | Pass | Pass | Pass | Pass | Pass | 5/5 |
| A-02 | Phase signaling | Pass | Pass | Pass | Fail | Pass | 4/5 |
| A-03 | Realistic scenario | Pass | Pass | Pass | Pass | Pass | 5/5 |
| Domain B: Modeling Quality | |||||||
| B-01 | Shows, not tells | Pass | Pass | Pass | Pass | Pass | 5/5 |
| B-02 | Thinking out loud | Pass | Pass | Pass | Pass | Pass | 5/5 |
| B-03 | Visible decision-making | Pass | Pass | Pass | Pass | Fail | 4/5 |
| B-04 | Self-checking | Fail | Fail | Fail | Fail | Fail | 5/5 Fail |
| B-05 | Heuristic offered | Pass | Pass | Pass | Pass | Pass | 5/5 |
| Domain C: Coaching Quality | |||||||
| C-01 | Specific feedback | Pass | Pass | Pass | Pass | Pass | 5/5 |
| C-02 | Actionable direction | Pass | Pass | Pass | Pass | Pass | 5/5 |
| C-03 | Revision requested | Pass | Pass | Pass | Pass | Pass | 5/5 |
| C-04 | Revision checked | Pass | Pass | Pass | Pass | Pass | 5/5 |
| C-05 | Productive struggle | Pass | Pass | Pass | Pass | Pass | 5/5 |
| C-06 | Elicits articulation | Fail | Fail | Pass | Fail | Pass | 2/5 Split |
| C-07 | Prompts reflection | Pass | Pass | Pass | Fail | Pass | 4/5 |
| Domain D: SBI Content Fidelity | |||||||
| D-01 | Catches vague situations | Pass | Pass | Pass | Pass | Pass | 5/5 |
| D-02 | Catches judgment leakage | Pass | Pass | Pass | Pass | Pass | 5/5 |
| D-03 | Catches accusatory impact | Pass | Pass | Pass | Pass | Pass | 5/5 |
| D-04 | Tests distinctions | Pass | Pass | Pass | Pass | Pass | 5/5 |
| D-05 | Scaffolds the stuck | Pass | Pass | Pass | Pass | Pass | 5/5 |
| D-06 | Reusable scaffold | Pass | Pass | Pass | Pass | Pass | 5/5 |
| Domain E: Adaptive Pacing | |||||||
| E-01 | Checks before advancing | Pass | Pass | Pass | Pass | Pass | 5/5 |
| E-02 | Fades support | Fail | Fail | Pass | Pass | Pass | 3/5 Split |
| E-03 | Adjusts to struggle | Pass | Pass | Pass | Fail | Fail | 3/5 |
| E-04 | Protects productive struggle | Pass | N/A | Pass | N/A | Pass | 3/3 |
| Domain F: Conversational Quality | |||||||
| F-01 | Varied turn structure | Fail | Pass | Pass | Fail | Pass | 3/5 |
| F-02 | Genuine curiosity | Fail | Fail | Pass | Fail | Pass | 2/5 Split |
| F-03 | Room to breathe | Fail | Fail | — | Fail | Fail | 4/4 Fail |
| F-04 | Dwells on difficulty | Fail | Pass | — | Pass | Pass | 3/4 |
| F-05 | Has a voice | Fail | Pass | Fail | Pass | Fail | 2/5 Split |
| F-06 | Questions over corrections | Pass | Pass | Pass | Pass | Pass | 5/5 |
| Tag | Criterion | Rachel | Janine | Nthato | Anna | LLM | Agreement |
|---|---|---|---|---|---|---|---|
| Domain A: Session Setup | |||||||
| A-01 | Goal clarity | Pass | Pass | Pass | Pass | Pass | 5/5 |
| A-02 | Phase signaling | Pass | Pass | Pass | Pass | Pass | 5/5 |
| A-03 | Realistic scenario | Pass | Pass | Pass | Pass | Pass | 5/5 |
| Domain B: Modeling Quality | |||||||
| B-01 | Shows, not tells | Pass | Pass | Pass | Pass | Pass | 5/5 |
| B-02 | Thinking out loud | Pass | Pass | Pass | Pass | Pass | 5/5 |
| B-03 | Visible decision-making | Pass | Pass | Pass | Pass | Pass | 5/5 |
| B-04 | Self-checking | Fail | — | Pass | Fail | Pass | 2/5 Split |
| B-05 | Heuristic offered | Pass | Pass | Pass | Pass | Pass | 5/5 |
| Domain C: Coaching Quality | |||||||
| C-01 | Specific feedback | Pass | Pass | Pass | Pass | Pass | 5/5 |
| C-02 | Actionable direction | Pass | Pass | Pass | Pass | Pass | 5/5 |
| C-03 | Revision requested | Pass | Pass | Pass | Pass | Pass | 5/5 |
| C-04 | Revision checked | Pass | Pass | Pass | Pass | Pass | 5/5 |
| C-05 | Productive struggle | Pass | Pass | Pass | Fail | Pass | 4/5 |
| C-06 | Elicits articulation | Fail | — | Pass | Fail | Pass | 2/5 Split |
| C-07 | Prompts reflection | Pass | Pass | Fail | Pass | Pass | 4/5 |
| Domain D: SBI Content Fidelity | |||||||
| D-01 | Catches vague situations | Pass | Pass | Pass | Pass | Pass | 5/5 |
| D-02 | Catches judgment leakage | Pass | Pass | Pass | Pass | Pass | 5/5 |
| D-03 | Catches accusatory impact | Pass | Pass | Pass | Pass | N/A | 4/4 |
| D-04 | Tests distinctions | Pass | Pass | Pass | Pass | Pass | 5/5 |
| D-05 | Scaffolds the stuck | Pass | Pass | Pass | Pass | Pass | 5/5 |
| D-06 | Reusable scaffold | Pass | Pass | Pass | Pass | Pass | 5/5 |
| Domain E: Adaptive Pacing | |||||||
| E-01 | Checks before advancing | Pass | Pass | Pass | Pass | Pass | 5/5 |
| E-02 | Fades support | Fail | Pass | Pass | Pass | Pass | 4/5 |
| E-03 | Adjusts to struggle | Pass | N/A | Pass | Pass | Pass | 4/4 |
| E-04 | Protects productive struggle | Pass | — | Pass | Pass | Pass | 4/4 |
| Domain F: Conversational Quality | |||||||
| F-01 | Varied turn structure | Pass | Pass | Pass | Fail | Pass | 4/5 |
| F-02 | Genuine curiosity | Fail | — | Pass | Fail | Pass | 2/5 Split |
| F-03 | Room to breathe | Fail | — | — | Fail | Pass | 1/3 Split |
| F-04 | Dwells on difficulty | Fail | N/A | — | Fail | Pass | 1/3 Split |
| F-05 | Has a voice | Fail | Pass | Fail | Fail | Fail | 4/5 Fail |
| F-06 | Questions over corrections | Pass | Pass | — | Pass | Pass | 4/4 |
Note: Carlos is a "rusher" persona that pushed the mentor to skip steps. This stressed the interaction and resulted in more failures and scoring disagreements.
| Tag | Criterion | Rachel | Janine | Nthato | Anna | LLM | Agreement |
|---|---|---|---|---|---|---|---|
| Domain A: Session Setup | |||||||
| A-01 | Goal clarity | Pass | Pass | Pass | Pass | Pass | 5/5 |
| A-02 | Phase signaling | Pass | Pass | Fail | Pass | Pass | 4/5 |
| A-03 | Realistic scenario | Pass | Pass | Pass | Pass | Pass | 5/5 |
| Domain B: Modeling Quality | |||||||
| B-01 | Shows, not tells | Pass | Pass | Pass | Pass | Pass | 5/5 |
| B-02 | Thinking out loud | Fail | — | Fail | Fail | Fail | 4/4 Fail |
| B-03 | Visible decision-making | Fail | — | Fail | Fail | Fail | 4/4 Fail |
| B-04 | Self-checking | Fail | — | Fail | Fail | Fail | 4/4 Fail |
| B-05 | Heuristic offered | Pass | Pass | Pass | Pass | Pass | 5/5 |
| Domain C: Coaching Quality | |||||||
| C-01 | Specific feedback | Pass | Pass | Pass | Pass | Pass | 5/5 |
| C-02 | Actionable direction | Pass | Pass | Pass | Pass | Pass | 5/5 |
| C-03 | Revision requested | Pass | Pass | Pass | Pass | Pass | 5/5 |
| C-04 | Revision checked | Pass | Pass | Pass | Pass | Pass | 5/5 |
| C-05 | Productive struggle | Fail | — | Pass | Fail | Pass | 2/5 Split |
| C-06 | Elicits articulation | Fail | — | Fail | Fail | Pass | 3/4 |
| C-07 | Prompts reflection | Pass | Pass | Fail | Pass | Pass | 4/5 |
| Domain D: SBI Content Fidelity | |||||||
| D-01 | Catches vague situations | Pass | Pass | Pass | Pass | Pass | 5/5 |
| D-02 | Catches judgment leakage | Pass | Pass | Pass | Pass | Pass | 5/5 |
| D-03 | Catches accusatory impact | Fail | — | Pass | Fail | N/A | 1/4 Split |
| D-04 | Tests distinctions | Pass | Pass | Pass | Pass | Pass | 5/5 |
| D-05 | Scaffolds the stuck | Pass | Pass | Pass | Pass | Pass | 5/5 |
| D-06 | Reusable scaffold | Pass | Pass | Pass | Pass | Pass | 5/5 |
| Domain E: Adaptive Pacing | |||||||
| E-01 | Checks before advancing | Fail | — | Pass | Fail | Pass | 2/5 Split |
| E-02 | Fades support | Fail | — | Pass | Fail | Fail | 3/4 Fail |
| E-03 | Adjusts to struggle | Fail | N/A | Pass | Fail | Fail | 2/4 Split |
| E-04 | Protects productive struggle | Fail | N/A | Pass | Fail | Pass | 2/4 Split |
| Domain F: Conversational Quality | |||||||
| F-01 | Varied turn structure | Fail | — | Pass | Fail | Fail | 3/4 Fail |
| F-02 | Genuine curiosity | Fail | — | Pass | Fail | Fail | 3/4 Fail |
| F-03 | Room to breathe | Fail | — | — | Fail | Fail | 3/3 Fail |
| F-04 | Dwells on difficulty | Fail | — | — | Fail | Fail | 3/3 Fail |
| F-05 | Has a voice | Fail | — | Fail | Fail | Pass | 3/4 Fail |
| F-06 | Questions over corrections | Pass | Pass | Pass | Pass | Pass | 5/5 |
Based on rater disagreements and feedback, the following criteria need attention before the LLM judge can be reliably calibrated.
Issue: Unclear what "letting an answer land" means in practice. Multiple raters marked "FC needs revision."
Recommendation: Define specific observable behaviors (e.g., "mentor's next turn is ≤10 words of acknowledgment before new content").
Issue: "Hard" and "interesting" are subjective. Raters disagree on what counts as dwelling.
Recommendation: Define as "mentor spends 2+ turns on a single concept when learner shows difficulty" with examples.
Issue: Flagged as duplicate of C-05 (Productive struggle). Both measure similar inquiry-before-correction behavior.
Recommendation: Merge with C-05 or clarify distinct scope (C-05 = process, F-06 = tone/style).
Issue: LLM consistently passes this when Rachel/Janine fail. LLM credits "and why?" suffixes; Rachel/Janine want deeper probing.
LLM Evidence (Amara): "Mentor asks learner to explain reasoning: 'Before I give feedback, if I were to push back on one part of your draft as off-criteria, which specific phrase or line would it be, and why?'"
Recommendation: Clarify minimum threshold. Is "which phrase would it be, and why?" sufficient, or must mentor explicitly ask "walk me through your thinking" or "why did you choose that?"
Issue: Flagged as duplicate of C-06. Both measure whether mentor asks about learner reasoning. Consistent 2/4 split across conversations.
Recommendation: Differentiate C-06 (asks about SBI choices specifically) vs F-02 (general curiosity about thinking), or merge.
Issue: Disagreement on whether "sharing the strategy" counts vs. "modeling its application."
Recommendation: Clarify: must mentor verbally run through the check on their own example, or is providing the checklist sufficient?
Issue: Context-dependent - hard to score when learner doesn't struggle. LLM fails when "no opportunity to observe."
Recommendation: Add N/A option for cases where learner doesn't struggle, or reframe as "responds appropriately to learner state."
Issue: Subjective threshold. LLM passed Carlos ("Nice fix!"), humans failed. Generally 3/4 fail across conversations.
Recommendation: Provide examples of minimum bar (what counts as "personality"?) or accept this will remain subjective.
| Domain | Criteria | Avg Agreement | Notes |
|---|---|---|---|
| A: Session Setup | 3 | 96% | Clear, well-defined criteria |
| B: Modeling Quality | 5 | 85% | B-04 (self-checking) needs clarification |
| C: Coaching Quality | 7 | 78% | C-06 (articulation) has consistent human/LLM gap |
| D: SBI Content Fidelity | 6 | 94% | Strong alignment - domain-specific criteria work well |
| E: Adaptive Pacing | 4 | 67% | Context-dependent criteria (E-03, E-04) cause issues |
| F: Conversational Quality | 6 | 58% | Most problematic - needs significant revision |
Percentage of criteria where each pair of raters agreed (across all 3 conversations)
| Rachel | Janine | Nthato | Anna | LLM | |
|---|---|---|---|---|---|
| Rachel | — | 89% | 74% | 82% | 76% |
| Janine | 89% | — | 71% | 76% | 73% |
| Nthato | 74% | 71% | — | 65% | 85% |
| Anna | 82% | 76% | 65% | — | 68% |
| LLM | 76% | 73% | 85% | 68% | — |
The data suggests two alignment clusters:
This suggests the team should calibrate on which interpretation should be the standard before tuning the LLM judge.
Before tuning the LLM, resolve the Rachel/Janine vs. Nthato interpretation gap:
Domain F has 58% agreement - the lowest of any domain. The team should:
E-03 (adjusts to struggle) and E-04 (protects productive struggle) can't always be evaluated. Consider:
The Carlos conversation revealed that a rushing learner causes modeling quality to collapse. Consider: