SSA‑3B surpasses a 7 B process‑reward verifier. The same checkpoint can be plugged into frozen base models up to 32 B with no re‑tuning.
The charts below show the average of GSM8K, MATH, AIME‑24, AMC‑23 and OlympiadBench.
Scaling test‑time compute by sampling multiple reasoning paths yields large gains but leaves an oracle gap. We introduce SSA, a tiny LLM fine‑tuned with GRPO to read k
candidate solutions and emit one final answer. On GSM8K, MATH, AIME‑24, AMC‑23 and OlympiadBench, SSA reaches 56.1 % accuracy—just 3.6 pp shy of the Pass@5 oracle—and beats 7 B process‑reward models while training on <5 % of their data. The model generalises across base‑family (Qwen → Llama‑3), base‑size (7 → 32 B) and k
without re‑tuning.
k
solutions.SSA‑3B surpasses a 7 B process‑reward verifier. The same checkpoint can be plugged into frozen base models up to 32 B with no re‑tuning.
The charts below show the average of GSM8K, MATH, AIME‑24, AMC‑23 and OlympiadBench.
Method | Params | k | Avg Acc ↑ |
---|---|---|---|
Pass@1 (base) | 7 B | 1 | 45.5 |
Pass@5 (oracle) | — | 5 | 59.7 |
Majority vote | — | 5 | 49.7 |
Qwen‑PRM | 7 B | 5 | 53.0 |
SSA (ours) | 3 B | 5 | 56.1 |
Below we also juxtapose SSA‑3B with sequential RL models that fine‑tune the entire 7 B / 14 B / 32 B backbone. Despite its 10× smaller footprint, SSA stays within 2–3 pp accuracy with similar tokens during test time.
Key Questions:
@misc{qi2025learningreasonparallelsamples,
title={Learning to Reason Across Parallel Samples for LLM Reasoning},
author={Jianing Qi and Xi Ye and Hao Tang and Zhigang Zhu and Eunsol Choi},
year={2025},
eprint={2506.09014},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.09014},
}