MBZUAI
MediX-R1 Logo

MediX-R1

Open Ended Medical Reinforcement Learning

Open source medical vision language model trained with composite reinforcement rewards.

Leaderboard
Sahal Shaji Mullappilly1*·Mohammed Irfan Kurpath1*·Omair Mohamed2·Mohamed Zidan3·Fahad Khan1·Salman Khan1·Rao Anwer1·Hisham Cholakkal1
1Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)2Jubilee Mission Medical College3JJM Medical College

* Equal Contribution

8B Model

68.8%overall average

Across both LLM and VLM benchmarks, outperforming the much larger 27B MedGemma (68.4%).

30B Model / Best Overall

73.6%overall average

Demonstrating the effectiveness of our composite reward design across all 17 benchmarks.

Benchmark Average

0.0%

Across 17 medical LLM & VLM benchmarks

USMLE-SA

0.0%

U.S. Medical Licensing Exam accuracy

Expert Preferred

0.0%

Chosen by doctors in blind review

Training Samples

0K

Spanning 16 medical imaging modalities

What Sets MediX-R1 Apart

The only medical VLM that combines all six capabilities: composite rewards, open-ended reasoning, and annotation-free training in a single RL stage.

Reward

Composite RL Reward

Four reward signals (LLM accuracy, embedding similarity, format enforcement, and modality recognition) work together to prevent reward hacking and stabilize training.

Output

Open-Ended Responses

Goes beyond MCQ to produce free-form clinical answers. LLM-as-judge and embedding rewards provide reliable feedback where string-matching metrics fail.

Reasoning

Interpretable Thinking

Emits structured reasoning traces in <think>...</think> blocks, enforced by a format reward. Every decision path is auditable and clinically grounded.

Training

Single-Stage RL

No multi-step pipelines. One RL stage with group-based optimization replaces the typical pretrain → SFT → RL chain, simplifying training end-to-end.

Modalities

16 Medical Modalities

X-Ray, CT, MRI, Microscopy, Ultrasound, Endoscopy, Mammography, OCT, Fundus, and more. All in a single model trained on 51K examples.

Data

Annotation-Free

No expensive human-curated rationales needed. RL rewards operate on final answers via LLM judge and embeddings, drastically lowering data curation cost.

MedVLM-R1
ModalitiesSingle-Stage RLReasoningOpen-EndedAnnotation-FreeComposite RL
3/6
BiMediX2
ModalitiesSingle-Stage RLReasoningOpen-EndedAnnotation-FreeComposite RL
2/6
HuatuoGPT-V
ModalitiesSingle-Stage RLReasoningOpen-EndedAnnotation-FreeComposite RL
2/6
MedGemma
ModalitiesSingle-Stage RLReasoningOpen-EndedAnnotation-FreeComposite RL
3/6
Lingshu
ModalitiesSingle-Stage RLReasoningOpen-EndedAnnotation-FreeComposite RL
3/6
MediX-R1
ModalitiesSingle-Stage RLReasoningOpen-EndedAnnotation-FreeComposite RL
6/6

Benchmarks

Evaluated across 17 medical benchmarks with LLM-as-judge scoring, human expert review, and real-world clinical validation.

Overall Ranking

MediX-R1 30B
73.6%
MediX-R1 8B
68.8%
MedGemma 27B
68.4%
MedMO 8B
62.1%
MedGemma 4B
56.6%
HuatuoGPT-V 7B
55.8%
BiMediX2 8B
55.6%
MediX-R1 2B
55.4%
MedVLM-R1 2B
42.7%

Average accuracy across all 17 benchmarks

BenchmarkMedVLM-R1 2BBiMediX2 8BHuatuoGPT-V 7BMedGemma 4BMedGemma 27BMedMO 8BMediX-R1 2BMediX-R1 8BMediX-R1 30B
MMLU-Clinical0.5400.7320.7210.7080.8790.8640.6600.8450.894
MMLU-Bio0.5490.7920.7080.7060.9720.9510.8060.9510.993
MMLU-Med0.4510.6940.6530.6050.8660.8270.6990.8790.890
MMLU-Genetics0.5600.7900.7100.8200.9400.9000.6800.9000.980
MMLU-ProfMed0.5000.6950.6250.7130.9120.8900.5810.8680.974
MMLU-Anat0.5190.6590.6000.5560.7930.7850.5630.7630.874
MedMCQA0.4080.5720.5110.5700.7270.6620.4920.6830.781
MedQA0.4000.5830.5340.6210.8660.8480.4970.7960.929
USMLE-SA0.3780.5910.5380.6390.8950.7420.5050.8220.951
PubMedQA0.5200.5200.5420.4700.4140.5860.4720.4820.490
MIMIC-CXR-Sum0.7040.6720.7070.6920.7670.7090.7860.7460.765
SLAKE-VQA0.4340.4680.5450.6780.6340.4790.6540.7030.683
RadVQA0.4040.5300.6140.6590.5850.4190.5390.5960.625
PathVQA0.2390.3230.3740.3170.3220.2720.4280.4550.445
PMC-VQA0.3980.4820.5320.4440.4780.3600.4910.5540.571
PMC-VQA-Hard0.0200.2290.2610.2140.3540.1770.2840.3170.307
MIMIC-CXR-Gen0.2400.1240.3160.2050.2240.0840.2800.3280.350
AVG0.4270.5560.5580.5660.6840.6210.5540.6880.736

Best and second best results are bold and underlined. Top section: LLM (text-only) tasks. Bottom section: VLM (image+text) tasks.

Expert Preferred

72.7%

Blind preference from doctors

MediX-R1 72.7%Llama3.2 13.6%MedGemma 9.2%HuatuoGPT 4.5%

Model Architecture

A reinforcement learning framework that teaches medical AI to think, not just memorize.

MediX-R1 Architecture
  • Combines a vision-language backbone with Group Based Reinforcement Learning to generate free-form medical answers, not just multiple choice.
  • A composite reward signal judges accuracy via LLM, captures medical synonyms via embeddings, and enforces structured reasoning and modality awareness.
  • Trained on only ~50K instruction examples, yet generalizes across radiology, pathology, and clinical text tasks.
  • Uses a Reference-based LLM-as-judge evaluation instead of brittle string-matching metrics, capturing real clinical correctness.

Qualitative Examples

Microscopy
Microscopy example — MediX-R1 identifies the optic tract

MediX-R1 correctly identifies the optic tract in a histology cross-section, providing step-by-step reasoning grounded in the visual features of the image.

Radiology
Radiology example — MediX-R1 explains heart size in PA vs AP view

Given a chest X-ray, MediX-R1 explains why heart size appears smaller in PA versus AP projection, demonstrating real clinical understanding, not pattern matching.

Composite Reward Design

0.350.400.450.500.550.600.650.7020406080100120140Step0.49

Training with individual reward signals (LLM-only or embedding-only) leads to instability and reward hacking. Combining LLM + embedding reduces volatility but doesn't eliminate it. MediX-R1's composite reward, blending accuracy, semantic, format, and modality signals, stabilizes learning and delivers the highest final performance.

Evaluation Framework

Stage 1: Generation

Batched inference via vLLM

Model processes input

Generates structured responses

Persists full output per sample

Example Input

"What does the dark blue color on the laser speckle contrast analysis perfusion image represent?"

Model Output

<MICROSCOPY>

<think>...reasoning...</think>

<answer>

Low blood flow or less perfusion

</answer>

Stage 2: Evaluation

Reference-based LLM-as-judge

BASE template for QA/MCQ

MIMIC template for reports

Binary decisions & rubric scores

Ground Truth (GT)

"Low perfusion"

Candidate Answer (CA)

"Low blood flow or less perfusion"

Judge Comparison

✔ GT and CA are semantically aligned

✔ CA is clinically appropriate to GT

Stage 3: Scoring

Aggregate to dataset metrics

Mean accuracy for binary eval

Average rubric scores

Macro averages across benchmarks

Binary Decision

CORRECT

Final Score

1.0

A three-stage pipeline: (1) Generate answers via vLLM inference, (2) Evaluate using a Reference-based LLM-as-judge with BASE and MIMIC templates, and (3) Aggregate scores. Supports binary decisions for QA/MCQ tasks and rubric-based scoring for long-form clinical reports.