OSCE-Project: Medical Dialogue Agent Evaluation

Evaluating AI doctors is hard—not because language models can't generate plausible medical dialogue, but because real clinical communication requires empathy, persuasion, and safety all at once. A good doctor must convince a skeptical patient to accept treatment while respecting their autonomy and ensuring informed consent. This is what OSCE-Project evaluates.

The Objective Structured Clinical Examination (OSCE) has been the gold standard for evaluating medical students for decades. We've adapted this framework for AI agents, creating a system where doctor agents face simulated patients with diverse personalities and hidden concerns. Using Generative Adversarial Agents (GAA), we pit doctor against patient in a challenging game of information asymmetry—just like real clinical practice.

Most benchmarks for medical AI focus on factual accuracy: can the model diagnose correctly? But diagnosis is only half the battle. Our framework evaluates the full spectrum of clinical communication: building trust, addressing fears, explaining complex procedures, and ultimately guiding patients toward beneficial treatment decisions.

How does it work?

The core insight behind OSCE-Project is information asymmetry: just like real doctors, our doctor agents receive only clinical information (diagnosis, recommended treatment, risks and benefits) but must discover patient concerns through conversation. The patient agent, powered by one of 16 MBTI personality types, has hidden fears, symptoms, and behavioral patterns that influence how they respond to the doctor's approach.

Doctor Agent

✓ Patient demographics
✓ Medical diagnosis
✓ Treatment details
✗ Patient personality
✗ Hidden concerns
✗ Symptoms (must discover)

Patient Agent

✓ Full personality (MBTI)
✓ All symptoms
✓ Hidden fears & concerns
✓ Behavioral patterns
✓ Treatment knowledge
✓ Decision criteria

Each dialogue round is evaluated by an LLM-as-judge using 30 criteria across three dimensions: Empathy (emotional understanding, active listening), Persuasion (addressing concerns, building trust), and Safety (informed consent, risk communication). The dialogue continues until the patient accepts treatment, refuses and leaves, or the maximum rounds are reached.

Patient Personas

We generate 64 unique patient personas by combining:

16 MBTI Types — From analytical INTJs who demand data to empathetic ENFPs who need emotional support
2 Medical Conditions — Pneumothorax (urgent) and Lung Cancer (complex treatment decisions)
2 Genders — Different communication patterns and concerns

Each persona generates personality-consistent responses: an ISTJ patient will want detailed statistics and structured explanations, while an ESFP might respond better to reassurance and personal stories. The doctor agent must adapt their communication style—without knowing the patient's personality type.

Sample Evaluation

System Architecture

OSCE-Project is built on the AgentBeats platform using the A2A protocol for standardized agent evaluation. The system consists of:

Judge Agent — Central orchestrator managing the evaluation lifecycle
Persona Manager — Manages 64 patient personas with MBTI traits
Patient Constructor — Generates complete backgrounds from templates
Patient Agent — Simulates MBTI-driven personality-consistent behavior
Per-Round Scoring — LLM-as-judge evaluation using 30 criteria
Report Generator — Creates comprehensive performance analysis

Quick Start

# Clone and install
git clone https://github.com/MadGAA-Lab/OSCE-Project.git
cd OSCE-Project
uv sync

# Configure API keys
cp sample.env .env
# Add your OpenAI/Anthropic/Gemini credentials

# Run evaluation
uv run agentbeats-run scenarios/medical_dialogue/scenario.toml

Citation

If you use OSCE-Project in your research, please cite:

@software{osce_project,
  title = {OSCE-Project: Open Standard for Clinical Evaluation},
  author = {MadGAA-Lab},
  year = {2026},
  url = {https://github.com/MadGAA-Lab/OSCE-Project},
  note = {A GAA system for evaluating medical dialogue capabilities}
}