Our ARC AGI 2 journey

An independent team working in their free time can make meaningful research progress
ai
arc-agi
Author

Dibya Chakravorty

Published

February 24, 2026

My friends and I recently obtained a simple but surprising result on ARC AGI 2, getting > 4x performance improvement from GPT OSS 120B and double digit gains in GPT 5.2. Our team’s blog post describes the work in formal detail, while I will record our journey here.

My friends and I casually started working on ARC AGI 2 last summer, with the goal of participating in the ARC Prize Kaggle competition. Early on, we were exploring agentic coding with frontier reasoning models and found that models like o3 and o4-mini could generate high-quality synthetic ARC-style puzzles.

We generated a dataset containing ~ 8000 high-quality synthetic puzzles of varying complexity. Here are a few illustrative examples.

A standard synthetic puzzle

A standard synthetic puzzle

A synthetic puzzle requiring contextual rule application

A synthetic puzzle requiring contextual rule application

A synthetic puzzle requiring multi-step composition

A synthetic puzzle requiring multi-step composition

We planned to use these synthetic puzzles to train a smaller model via agentic reinforcement learning (RLVR with interleaved thinking).

We wanted to bootstrap training by distilling on successful solution traces from an open-weight reasoning model. That requirement led us to investigate GPT-OSS-120B. Initially, we were disappointed since we weren’t able to reliably elicit interleaved thinking from the model, no matter whether we used inference providers on Openrouter or self-hosted solutions like vLLM and SGLang. This led us on a journey of investigating how vLLM and SGLang implements the chat template for the model. We found that they are buggy, and patched vLLM to fix the bug.

At this point, we noticed something unexpected: simply placing the model into an agentic coding regime produced large and consistent score improvements on the ARC AGI public eval. We are talking about > 4x improvement relative to plain COT. We couldn’t believe the scores we were getting from a medium sized OSS model!

This observation ultimately shifted the focus of our work as we wanted to find out how universally this observation applies. We tested three model families and got positive results on all three. At this point, we decided to publish our results.

Accuracy on the ARC AGI 2 public eval set

Shortly afterwards, a neolab called Symbolica announced SOTA applying the same method with the newly released Claude Opus 4.6. Here’s their post from X.

A few weeks later, a YCombinator startup Confluence Labs saturated the ARC AGI 2 public eval (97.9%) using the same method with the newly released Gemini 3.1 Pro. Here’s their post from X.

Other than clear implications on SOTA, I think this raises interesting scientific questions.