What is BalatroBench?

BalatroBench benchmarks Large Language Models playing Balatro, a strategic card game that combines poker hands with roguelike progression. We track how different AI models perform across key metrics like rounds reached, decision accuracy, and resource efficiency to understand their strategic reasoning abilities. The benchmark specifically focuses on the LLM's ability to perform tool calls, measuring how effectively models can execute game actions through API interactions.

Metrics Measured

  • Main Score:
    • Round
      Average round reached by playing the game
  • Tool Calls:
    • Responses with valid tool calls that can be executed in the current game state.
    • Responses with valid tool calls that cannot be executed in the current game state.
    • Responses without valid tool calls.
  • Token Usage:
    • In /
         Average input tokens per tool call
    • Out /
      Average output tokens per tool call (including reasoning tokens)
  • Performance:
    • / [s]
         Average time per tool call in seconds
    • / [m$]
      Average cost per tool call in milli-dollars

Leaderboards

Model Leaderboard

Compare performance across different LLMs. Models are ranked by their average final round reached, with detailed statistics on costs, speed, and reliability.

View Model Leaderboard →

Community Strategies

Explore community-submitted strategies and approaches to playing Balatro with fixed model. You can contribute with your own strategies.

View Community Leaderboard →

Related Projects

BalatroBot

The foundational Python framework that enables automated interaction with Balatro. Provides the core API for game state reading, action execution, and bot development that powers all LLM players in our benchmarks.

BalatroLLM

The LLM-powered bot that generates all benchmark data displayed on this site. Uses strategy templates and game state analysis to make strategic decisions, then runs multiple games to produce the performance statistics you see in our leaderboards.

BalatroBench

This website repository that processes and visualizes the benchmark data. Takes raw game results from BalatroLLM runs and transforms them into interactive leaderboards with detailed statistics, charts, and performance analysis.