OSCE-Project: Evaluating Medical Dialogue Agents with Generative Adversarial Agents
Evaluating AI doctors is hard—not because language models can't generate plausible medical dialogue, but because real clinical communication requires empathy, persuasion, and safety all at once. A good doctor must convince a skeptical patient to accept treatment while respecting their autonomy and ensuring informed consent. This is what OSCE-Project evaluates.
The Objective Structured Clinical Examination (OSCE) has been the gold standard for evaluating medical students for decades. We've adapted this framework for AI agents, creating a system where doctor agents face simulated patients with diverse personalities and hidden concerns. Using Generative Adversarial Agents (GAA), we pit doctor against patient in a challenging game of information asymmetry—just like real clinical practice.
Most benchmarks for medical AI focus on factual accuracy: can the model diagnose correctly? But diagnosis is only half the battle. Our framework evaluates the full spectrum of clinical communication: building trust, addressing fears, explaining complex procedures, and ultimately guiding patients toward beneficial treatment decisions.
How does it work?
The core insight behind OSCE-Project is information asymmetry: just like real doctors, our doctor agents receive only clinical information (diagnosis, recommended treatment, risks and benefits) but must discover patient concerns through conversation. The patient agent, powered by one of 16 MBTI personality types, has hidden fears, symptoms, and behavioral patterns that influence how they respond to the doctor's approach.
Doctor Agent
- ✓ Patient demographics
- ✓ Medical diagnosis
- ✓ Treatment details
- ✗ Patient personality
- ✗ Hidden concerns
- ✗ Symptoms (must discover)
Patient Agent
- ✓ Full personality (MBTI)
- ✓ All symptoms
- ✓ Hidden fears & concerns
- ✓ Behavioral patterns
- ✓ Treatment knowledge
- ✓ Decision criteria
Each dialogue round is evaluated by an LLM-as-judge using 30 criteria across three dimensions: Empathy (emotional understanding, active listening), Persuasion (addressing concerns, building trust), and Safety (informed consent, risk communication). The dialogue continues until the patient accepts treatment, refuses and leaves, or the maximum rounds are reached.
Patient Personas
We generate 64 unique patient personas by combining:
- 16 MBTI Types — From analytical INTJs who demand data to empathetic ENFPs who need emotional support
- 2 Medical Conditions — Pneumothorax (urgent) and Lung Cancer (complex treatment decisions)
- 2 Genders — Different communication patterns and concerns
Each persona generates personality-consistent responses: an ISTJ patient will want detailed statistics and structured explanations, while an ESFP might respond better to reassurance and personal stories. The doctor agent must adapt their communication style—without knowing the patient's personality type.
Sample Evaluation
System Architecture
OSCE-Project is built on the AgentBeats platform using the A2A protocol for standardized agent evaluation. The system consists of:
- Judge Agent — Central orchestrator managing the evaluation lifecycle
- Persona Manager — Manages 64 patient personas with MBTI traits
- Patient Constructor — Generates complete backgrounds from templates
- Patient Agent — Simulates MBTI-driven personality-consistent behavior
- Per-Round Scoring — LLM-as-judge evaluation using 30 criteria
- Report Generator — Creates comprehensive performance analysis
Quick Start
# Clone and install
git clone https://github.com/MadGAA-Lab/OSCE-Project.git
cd OSCE-Project
uv sync
# Configure API keys
cp sample.env .env
# Add your OpenAI/Anthropic/Gemini credentials
# Run evaluation
uv run agentbeats-run scenarios/medical_dialogue/scenario.toml
Citation
If you use OSCE-Project in your research, please cite:
@software{osce_project,
title = {OSCE-Project: Open Standard for Clinical Evaluation},
author = {MadGAA-Lab},
year = {2026},
url = {https://github.com/MadGAA-Lab/OSCE-Project},
note = {A GAA system for evaluating medical dialogue capabilities}
}