Sphinx: A Synthetic Environment for Visual Perception and Reasoning

Md Tanvirul Alam ¹ · Saksham Aggarwal ¹ · Justin Yang Chae ² · Nidhi Rastogi ¹

¹ Rochester Institute of Technology · ² University of Washington

2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition- FINDINGS Track (CVPRF)

Positional Count

Q: What is the total number of triangles strictly below the red triangle (below its bottommost point)?

A: 5

Chart Comparison

Q: The top diagram is a bar chart. The bottom row contains four pie charts. Which pie chart matches the same proportional breakdown by color as the bar chart?

A: (b)

Missing Tiles

Q: Look at the top image: one region is uncolored. Which option (a)-(d) provides the exact missing colors?

A: (d)

Sequence Rotation

Q: In the top row, the motif rotates by a constant angle each step. Which option (a)-(d) below fills the blank?

A: (b)

Tiles Recoloring

Q: Two tiles are shown (left/right). A cell counts as different if its color differs, including filled versus blank. How many cells differ?

A: 5

Transform Pair Infer

Q: Identify the transformation applied between the left and right images.

A: (c) reflect across the anti-diagonal

Abstract

We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision–language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.

Highlights

A synthetic environment with 25 procedurally generated visual perception and reasoning tasks grounded in verifiable answers.

A benchmark showing a large human-model gap, with GPT-5 reaching 51.1% accuracy against 75.8% human performance.

RLVR on SPHINX improves multimodal reasoning and transfers to external visual reasoning benchmarks.

Design

Sphinx is a modular synthetic environment that composes motifs, tilings, charts, icons, and geometric primitives into visual reasoning puzzles with verifiable answers. The framework supports 25 task types spanning symmetry, spatial reasoning, chart interpretation, transformations, and sequence prediction, enabling both controlled evaluation and large-scale dataset generation.

Benchmark

The benchmark contains 2,500 examples across 25 tasks and exposes a large gap between humans and frontier LVLMs. GPT-5 reaches 51.1% accuracy versus 75.8% for humans, with especially large weaknesses on symmetry, sequential reasoning, and transformation-heavy tasks.

Analysis

Per-task comparisons show that failure modes are highly structured rather than uniform across the benchmark. Humans retain strong advantages on several tasks grounded in explicit spatial and transformational reasoning, while model-to-model differences reveal substantial variation in what current LVLMs can solve reliably.

Reinforcement Learning

Reinforcement learning with verifiable rewards substantially improves the Qwen3 models on both SPHINX and external multimodal reasoning benchmarks. The strongest gains come from the Qwen3-VL-4B and Qwen3-VL-8B families, with consistent improvements across held-out SPHINX tasks and the math-oriented transfer benchmarks.

@inproceedings{alam2026sphinx, title={Sphinx: A Synthetic Environment for Visual Perception and Reasoning}, author={Md Tanvirul Alam and Saksham Aggarwal and Justin Yang Chae and Nidhi Rastogi}, booktitle={2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition- FINDINGS Track (CVPRF)}, year={2026}, }