MentorAI Inter-Rater Reliability Report

Comparing human raters (Rachel, Janine, Nthato, Anna) with LLM Judge across 3 conversations

Generated: January 15, 2026
Conversations: Amara (ec742d69), Bailey (ddd7a2e3), Carlos (4426508c)
Criteria Evaluated: 31 (7 Critical, 24 Quality)

Executive Summary

Overall Human-LLM Agreement

76%
Updated with coaching judge results for Amara

Critical Criteria Agreement

95%+
Strong alignment on must-pass criteria

Domain F Agreement

58%
Conversational Quality criteria need revision

Criteria Flagged for Review

8
Unclear definitions or consistent disagreement

What's Working Well

  • Critical criteria (B-01, C-01, C-03, D-01, D-02): Near-perfect agreement across all raters
  • SBI Content Fidelity (Domain D): Consistently high agreement - criteria are well-defined
  • Session Setup (Domain A): Clear pass/fail determinations

Areas Needing Attention

  • Domain F (Conversational Quality): Highest disagreement; criteria need clearer definitions
  • LLM tends to be more lenient on articulation criteria (C-06, F-02) than human raters (all 4 humans stricter)
  • Context-dependent criteria (E-03, E-04) are hard to score when conditions aren't met

Agreement by Conversation

Amara Conversation

ID: ec742d69
18 Full Agreement (5/5) 7 Partial (4/5) 6 Split or Mixed
Strong agreement
Disagreement - review needed
Pass
Fail
N/A = condition not met
Tag Criterion Rachel Janine Nthato Anna LLM Agreement
Domain A: Session Setup
A-01Goal clarityPassPassPassPassPass5/5
A-02Phase signalingPassPassPassFailPass4/5
A-03Realistic scenarioPassPassPassPassPass5/5
Domain B: Modeling Quality
B-01Shows, not tellsPassPassPassPassPass5/5
B-02Thinking out loudPassPassPassPassPass5/5
B-03Visible decision-makingPassPassPassPassFail4/5
B-04Self-checkingFailFailFailFailFail5/5 Fail
B-05Heuristic offeredPassPassPassPassPass5/5
Domain C: Coaching Quality
C-01Specific feedbackPassPassPassPassPass5/5
C-02Actionable directionPassPassPassPassPass5/5
C-03Revision requestedPassPassPassPassPass5/5
C-04Revision checkedPassPassPassPassPass5/5
C-05Productive strugglePassPassPassPassPass5/5
C-06Elicits articulationFailFailPassFailPass2/5 Split
C-07Prompts reflectionPassPassPassFailPass4/5
Domain D: SBI Content Fidelity
D-01Catches vague situationsPassPassPassPassPass5/5
D-02Catches judgment leakagePassPassPassPassPass5/5
D-03Catches accusatory impactPassPassPassPassPass5/5
D-04Tests distinctionsPassPassPassPassPass5/5
D-05Scaffolds the stuckPassPassPassPassPass5/5
D-06Reusable scaffoldPassPassPassPassPass5/5
Domain E: Adaptive Pacing
E-01Checks before advancingPassPassPassPassPass5/5
E-02Fades supportFailFailPassPassPass3/5 Split
E-03Adjusts to strugglePassPassPassFailFail3/5
E-04Protects productive strugglePassN/APassN/APass3/3
Domain F: Conversational Quality
F-01Varied turn structureFailPassPassFailPass3/5
F-02Genuine curiosityFailFailPassFailPass2/5 Split
F-03Room to breatheFailFailFailFail4/4 Fail
F-04Dwells on difficultyFailPassPassPass3/4
F-05Has a voiceFailPassFailPassFail2/5 Split
F-06Questions over correctionsPassPassPassPassPass5/5

Bailey Conversation

ID: ddd7a2e3
17 Full Agreement (5/5) 8 Partial (4/5) 6 Split or Lower
Strong agreement
Disagreement - review needed
Pass
Fail
N/A = condition not met
Tag Criterion Rachel Janine Nthato Anna LLM Agreement
Domain A: Session Setup
A-01Goal clarityPassPassPassPassPass5/5
A-02Phase signalingPassPassPassPassPass5/5
A-03Realistic scenarioPassPassPassPassPass5/5
Domain B: Modeling Quality
B-01Shows, not tellsPassPassPassPassPass5/5
B-02Thinking out loudPassPassPassPassPass5/5
B-03Visible decision-makingPassPassPassPassPass5/5
B-04Self-checkingFailPassFailPass2/5 Split
B-05Heuristic offeredPassPassPassPassPass5/5
Domain C: Coaching Quality
C-01Specific feedbackPassPassPassPassPass5/5
C-02Actionable directionPassPassPassPassPass5/5
C-03Revision requestedPassPassPassPassPass5/5
C-04Revision checkedPassPassPassPassPass5/5
C-05Productive strugglePassPassPassFailPass4/5
C-06Elicits articulationFailPassFailPass2/5 Split
C-07Prompts reflectionPassPassFailPassPass4/5
Domain D: SBI Content Fidelity
D-01Catches vague situationsPassPassPassPassPass5/5
D-02Catches judgment leakagePassPassPassPassPass5/5
D-03Catches accusatory impactPassPassPassPassN/A4/4
D-04Tests distinctionsPassPassPassPassPass5/5
D-05Scaffolds the stuckPassPassPassPassPass5/5
D-06Reusable scaffoldPassPassPassPassPass5/5
Domain E: Adaptive Pacing
E-01Checks before advancingPassPassPassPassPass5/5
E-02Fades supportFailPassPassPassPass4/5
E-03Adjusts to strugglePassN/APassPassPass4/4
E-04Protects productive strugglePassPassPassPass4/4
Domain F: Conversational Quality
F-01Varied turn structurePassPassPassFailPass4/5
F-02Genuine curiosityFailPassFailPass2/5 Split
F-03Room to breatheFailFailPass1/3 Split
F-04Dwells on difficultyFailN/AFailPass1/3 Split
F-05Has a voiceFailPassFailFailFail4/5 Fail
F-06Questions over correctionsPassPassPassPass4/4

Carlos Conversation

ID: 4426508c
14 Full Agreement (5/5) 5 Partial (4/5) 12 Split or Lower

Note: Carlos is a "rusher" persona that pushed the mentor to skip steps. This stressed the interaction and resulted in more failures and scoring disagreements.

Strong agreement
Disagreement - review needed
Pass
Fail
N/A = condition not met
Tag Criterion Rachel Janine Nthato Anna LLM Agreement
Domain A: Session Setup
A-01Goal clarityPassPassPassPassPass5/5
A-02Phase signalingPassPassFailPassPass4/5
A-03Realistic scenarioPassPassPassPassPass5/5
Domain B: Modeling Quality
B-01Shows, not tellsPassPassPassPassPass5/5
B-02Thinking out loudFailFailFailFail4/4 Fail
B-03Visible decision-makingFailFailFailFail4/4 Fail
B-04Self-checkingFailFailFailFail4/4 Fail
B-05Heuristic offeredPassPassPassPassPass5/5
Domain C: Coaching Quality
C-01Specific feedbackPassPassPassPassPass5/5
C-02Actionable directionPassPassPassPassPass5/5
C-03Revision requestedPassPassPassPassPass5/5
C-04Revision checkedPassPassPassPassPass5/5
C-05Productive struggleFailPassFailPass2/5 Split
C-06Elicits articulationFailFailFailPass3/4
C-07Prompts reflectionPassPassFailPassPass4/5
Domain D: SBI Content Fidelity
D-01Catches vague situationsPassPassPassPassPass5/5
D-02Catches judgment leakagePassPassPassPassPass5/5
D-03Catches accusatory impactFailPassFailN/A1/4 Split
D-04Tests distinctionsPassPassPassPassPass5/5
D-05Scaffolds the stuckPassPassPassPassPass5/5
D-06Reusable scaffoldPassPassPassPassPass5/5
Domain E: Adaptive Pacing
E-01Checks before advancingFailPassFailPass2/5 Split
E-02Fades supportFailPassFailFail3/4 Fail
E-03Adjusts to struggleFailN/APassFailFail2/4 Split
E-04Protects productive struggleFailN/APassFailPass2/4 Split
Domain F: Conversational Quality
F-01Varied turn structureFailPassFailFail3/4 Fail
F-02Genuine curiosityFailPassFailFail3/4 Fail
F-03Room to breatheFailFailFail3/3 Fail
F-04Dwells on difficultyFailFailFail3/3 Fail
F-05Has a voiceFailFailFailPass3/4 Fail
F-06Questions over correctionsPassPassPassPassPass5/5

Criteria Flagged for Review

Based on rater disagreements and feedback, the following criteria need attention before the LLM judge can be reliably calibrated.

Cross-Conversation Patterns

Agreement Patterns by Domain

Domain Criteria Avg Agreement Notes
A: Session Setup 3 96% Clear, well-defined criteria
B: Modeling Quality 5 85% B-04 (self-checking) needs clarification
C: Coaching Quality 7 78% C-06 (articulation) has consistent human/LLM gap
D: SBI Content Fidelity 6 94% Strong alignment - domain-specific criteria work well
E: Adaptive Pacing 4 67% Context-dependent criteria (E-03, E-04) cause issues
F: Conversational Quality 6 58% Most problematic - needs significant revision

Rater Tendencies

Rachel

Most strict overall, especially on Domain F. Consistently fails F-02, F-03, F-04, F-05. Provides "automatic pass" notes when conditions aren't met. Strong alignment with Janine and Anna.

Janine

Strict on articulation criteria (C-06, F-02). Uses blank = Fail convention. Provides contextual notes about rushing/speed tradeoffs. Strong alignment with Rachel.

Nthato

Most lenient of human raters. Flags criteria for revision. Notes duplicates (C-05/F-06, C-06/F-02). Uses "automatic pass" for context-dependent criteria. Often aligns with LLM.

Anna

Strict on conversational quality (F-01 through F-05). Consistently fails articulation criteria (C-06, F-02). Notes mentor rigidity and lack of exploration. Often aligns with Rachel/Janine.

LLM Judge

More lenient on articulation (C-06, F-02). Strict on observable evidence - fails when "cannot pass without evidence." Cites specific quotes. Often aligns with Nthato.

Rater Alignment Matrix

Percentage of criteria where each pair of raters agreed (across all 3 conversations)

Rachel Janine Nthato Anna LLM
Rachel 89% 74% 82% 76%
Janine 89% 71% 76% 73%
Nthato 74% 71% 65% 85%
Anna 82% 76% 65% 68%
LLM 76% 73% 85% 68%

Key Insight: Two Rater Clusters

The data suggests two alignment clusters:

  • Rachel + Janine + Anna (76-89% agreement): Stricter interpretation, especially on conversational quality and articulation. Anna aligns most closely with Rachel (82%).
  • Nthato + LLM (85% agreement): More lenient interpretation, accepts lower threshold for articulation criteria

This suggests the team should calibrate on which interpretation should be the standard before tuning the LLM judge.

Persona Impact on Scores

Carlos (Rusher) Stressed the Evaluation

  • Most failures: Mentor skipped modeling steps (B-02, B-03, B-04) when rushed - all 4 raters agreed
  • Most disagreement: 10 criteria with split or low agreement (vs. 4-5 for Amara/Bailey)
  • Key question: Should we score what the mentor did do, or penalize for what they should have done despite learner pressure?

Recommended Next Steps

1. Align Human Raters First

Before tuning the LLM, resolve the Rachel/Janine vs. Nthato interpretation gap:

  • Review C-06 and F-02 together - what's the minimum bar for "elicits articulation"?
  • Is "where do you expect pushback and why?" sufficient, or must it be "walk me through your thinking"?
  • Once humans align, update judge prompts accordingly

2. Revise Domain F Criteria (High Priority)

Domain F has 58% agreement - the lowest of any domain. The team should:

  • Rewrite F-03 and F-04 with specific, observable definitions
  • Decide whether to merge F-06 into C-05 (or clarify distinction)
  • Decide whether to merge F-02 into C-06 (or clarify distinction)
  • Establish minimum bar for F-05 (personality) with examples

3. Add N/A Handling for Context-Dependent Criteria

E-03 (adjusts to struggle) and E-04 (protects productive struggle) can't always be evaluated. Consider:

  • Adding explicit N/A verdict option to judge prompts
  • Excluding these from aggregate scores when N/A
  • Designing personas that specifically trigger these conditions

4. Decide on Carlos Pattern

The Carlos conversation revealed that a rushing learner causes modeling quality to collapse. Consider:

  • Should the mentor resist rushing? (Update mentor prompt)
  • Should criteria account for learner-driven compression? (Add rubric notes)
  • Is this a valid stress test or an edge case to handle separately?