BibTeX Citation
@inproceedings{peper2026ludobench,
  title={{LLMs} as Rules Oracles: Exploring Real-World Multimodal Reasoning in Tabletop Strategy Game Environments},
  author={Peper, Joseph J. and Gandra, Sai Krishna and Zhang, Yunxiang and Chennareddy, Vaibhav and Jha, Shloki and Payani, Ali and Wang, Lu},
  booktitle={Proceedings of the Fourteenth International Conference on Learning Representations (ICLR)},
  year={2026},
  address={Rio de Janeiro, Brazil}
}

Leaderboard

Accuracy of multimodal models across 5 board games, 3 reasoning tiers, and 3 rulebook modalities. Click any column header to sort.

Tier:
Game:
Rules Modality:

Dataset Browser

Browse 638 annotated QA examples across 5 games and 3 difficulty tiers.

Loading…
Show solution

LLMs as Rules Oracles: Exploring Real-World Multimodal Reasoning in Tabletop Strategy Game Environments [ICLR 2026]

Abstract

We introduce LudoBench, a multimodal reasoning benchmark that evaluates whether vision-enabled large language models (LMs) can acquire, integrate, and reason over heterogeneous game knowledge in mainstream analog tabletop games. Unlike prior works that emphasize deep strategic mastery, LudoBench targets an initial reasoning challenge uninitiated gamers face: correctly comprehending a new tabletop strategy game for the first time. We examine whether, given a visual depiction of a tabletop scene and a corresponding ruleset, a model can correctly answer grounded questions about the pictured scenario.

Concretely, LudoBench tests three cumulative situated game-comprehension capabilities:

  • Tier 1: Environment Perception — recognizing objects, counting components, and identifying basic game state features
  • Tier 2: Heterogeneous Rules Integration — applying multimodal rulebook knowledge to answer grounded questions
  • Tier 3: Short-Horizon Optimization — planning optimal moves requiring strategic reasoning over game mechanics

These progressively stress-test the foundational reasoning required for real-world game comprehension.

Evaluating frontier LMs on five diverse strategy games, we find that even the strongest models achieve only ~76% accuracy on simple T1 environment perception tasks and fall below 13% on situated T3 multi-step comprehension puzzles that hobbyist gamers can routinely solve. Our extensive failure analysis and knowledge-ablation experiments reveal that models largely fail to comprehend rich cross-modal reference knowledge and are subsequently unable to apply this knowledge to messy and unfamiliar situated environments. Our findings highlight the many steps remaining for current methods to succeed on complex multimodal reasoning in the real world.

Games Overview

The dataset consists of five tabletop strategy games that vary widely in complexity, components, and rule structure. The details of each game—along with representative sample game states—are shown below.

Games Overview
Game Rulebook Diff. Unique Game Properties # Rules # Figs.
Kingdomino4 pg.1.2tile-laying, spatial scoring, individual player areas356
Carcassonne8 pg.1.9shared tile-laying, dynamic board topology, position-coded roles3930
Catan16 pg.2.3network building, connectivity constraints, action chaining4419
Res Arcana12 pg.2.6card-based interactions, heavy symbol usage, card orientation, action sequencing11231
Pax Ren. (2e)44 pg.4.6shared map, private cards/tableau, large number of components, intricate ruleset24758

Tiers Overview

The benchmark evaluates models across three tiered reasoning levels that progressively increase in difficulty, from basic visual perception to rule integration and short-horizon planning. An example of how questions differ for each tier in Kingdomino is shown below:

Tier-wise Q&A Example

Knowledge Ablation: Rules Modalities

A central question in LudoBench is whether models can acquire and apply game rules from different knowledge sources. To investigate this, every question is evaluated under three rules modality conditions that vary what reference knowledge is provided alongside the game-state image and question:

None (Parametric)

No rulebook is provided. The model must rely entirely on parametric knowledge — whatever it has internalized about the game from pretraining. This baseline reveals how much a model already "knows" about a game's rules.

Text Rules

The game's rulebook is provided as extracted text in the prompt context. This tests whether explicit textual rule descriptions improve situated reasoning, and whether models can ground text-based rules against a visual game state.

Image Rules

The rulebook is provided as images of the original pages, including diagrams, icons, and annotated examples. This tests the model's ability to extract and apply rules from rich, cross-modal visual documents — the format real players actually encounter.

Across all three conditions, models consistently struggle to comprehend cross-modal reference knowledge. Notably, providing rulebook content — whether as text or images — does not uniformly improve performance, revealing fundamental gaps in how models integrate heterogeneous knowledge with situated visual environments.

Failure Analysis

We analyze where models go wrong by collecting common failure cases across multiple models and organizing them for visualization on Kingdomino. The table below summarizes the relevant rulebook rules, supporting annotations, and the observed model errors for each failure pattern.

Failure Analysis

Citation

@inproceedings{peper2026ludobench,
  title={{LLMs} as Rules Oracles: Exploring Real-World Multimodal Reasoning in Tabletop Strategy Game Environments},
  author={Peper, Joseph J. and Gandra, Sai Krishna and Zhang, Yunxiang and Chennareddy, Vaibhav and Jha, Shloki and Payani, Ali and Wang, Lu},
  booktitle={Proceedings of the Fourteenth International Conference on Learning Representations (ICLR)},
  year={2026},
  address={Rio de Janeiro, Brazil}
}