The Virtual Tools Game

(Allen*, Smith*, Tenenbaum, PNAS, In Press)

Many animals, and an increasing number of artificial agents, display sophisticated capabilities to perceive and manipulate objects. But human beings remain distinctive in their capacity for flexible, creative tool use – using objects in new ways to act on the world, achieve a goal, or solve a problem. To study this type of general physical problem solving, we introduce the Virtual Tools game. In this game, people solve a large range of challenging physical puzzles in just a handful of attempts. We propose that the flexibility of human physical problem solving rests on an ability to imagine the effects of hypothesized actions, while the efficiency of human search arises from rich action priors which are updated via observations of the world. We instantiate these components in the "Sample, Simulate, Update" (SSUP) model and show that it captures human performance across 30 levels of the Virtual Tools game. More broadly, this model provides a mechanism for explaining how people condense general physical knowledge into actionable, task-specific plans to achieve flexible and efficient physical problem-solving.

This site provides an overview of our main results as well as links to our full paper and supplement, and all code and data needed to reproduce our results or try out your own models on these tasks.

Virtual Tools Game

The rules of the game

  • Get any red object into any green area to win the level

  • Solve the level with a single action (any tool, any position)

  • Solve the level in as few attempts as possible


  • Levels leverage different physical concepts, such as blocking, tipping or supporting objects in a scene

  • 12 levels are matched pairs (A/B) that contain subtle differences which have large effects on the strategies needed to succeed

Examples of human play

In these levels, humans display:

  • "Aha" moments of insight

  • Structured, object-oriented initial attempts

  • Few-shot trial-and-error learning

Link to human data:

Person solving "Bridge"

Person solving "Table A"

Person solving "Launch B"

A model of human physical problem-solving

What is required to capture this kind of problem-solving behavior?

We propose the Sample, Simulate, Update (SSUP) model

  • Sample: An object-oriented initial hypothesis space of promising actions

  • Simulate: An approximate internal physics simulator, or world model, allowing an agent to imagine the effects of their actions

  • Update: A guiding mechanism that allows an agent to learn from both their imagined and real actions

Link to model data (and ablations):

Examples of SSUP model runs



"Table A"


How well does the model perform? Below we plot the cumulative solution rate of the model compared to people, as well as the first/last actions of both humans and the full SSUP model. Note that there are interesting failure cases! Falling (A) is one example we highlight, where the model performs much better than human participants.

Cumulative Solution Rate (left): Percentage of participants/model runs that solved the level (y-axis) in a given number of attempts (x-axis)

Visualizing first and last placements (right): Model attempts shown as background density estimate for each tool (color). Participants shown as individual circles (color representing tool used).

Comparisons to Ablations and Alternatives

Is the full model really necessary? What if we ablate different parts of it, or even replace the entire model with an image-based deep model-free reinforcement learning agent?

  • Here we report accuracy: % of participants/model runs who solved a particular level in less than 10 attempts

  • Ablating the prior, simulator, or update mechanism significantly reduces accuracy relative to human participants

  • Replacing SSUP with a DQN trained on many background levels (DQN+Updating), or a method that simply learns "better physics" (Parameter tuning), also performs poorly with respect to humans

Limitations and levels beyond SSUP

Although the SSUP model we used solves many of the levels of the Virtual Tools game in a human-like way, we believe that this is still only a first approximation to the rich set of cognitive processes that people bring to creative physical problem solving.

In particular, there are at least two ways in which SSUP is insufficient: its reliance on very simple priors, and its planning and generalizing by sampling only in the forwards direction. Future work should also investigate to what extent physical dynamics can be learned (here they were assumed given, albeit imprecise) for complicated, contact rich trajectories and dramatically varying scene geometries.


Catapult Alt


These three levels demonstrate the limitations of SSUP, and point to future directions for modeling.

  • In Chaining, a decoy ball is easy to move close to the goal, but it is impossible to get it in. Instead, you must fine-tune a placement on the left hand side of the level to initiate a "chain" of events that results in the third ball going into the goal. SSUP gets stuck attempting to hit the right ball.

  • In Catapult Alt, unlike in Catapult (original experiment), SSUP similarly gets stuck trying to hit the red ball directly towards the goal, instead of creating a sub-goal of catapulting the right-most red ball into the left-most red ball so that it rolls into the container.

  • In Filler, the action space to be successful is incredibly small. SSUP is highly unlikely to sample the precise placement. However, it is very obvious to humans that they need to fill the hole with the large block, as they are suspiciously approximately the same size.

Cite this work

This work has been accepted at the Proceedings of the National Academy of Sciences (PNAS). The citation below will be updated when the paper is published. Please consider citing if you would like to talk about how humans solve these kinds of physical problem-solving tasks, or if you use any of our code.


title={Rapid trial-and-error learning with simulation supports flexible tool use and physical reasoning},

author={Kelsey R. Allen and

Kevin A. Smith and

Joshua B. Tenenbaum},

year = {2019},



Related Work

PHYRE -- a similar, concurrently developed benchmark examining deep model-free reinforcement learning agents from Facebook AI Research. PHYRE tasks are procedurally generated for people who want to try deep RL approaches that require many task variants for learning.

Our tasks are designed for testing and studying how humans and artificial agents execute rapid trial-and-error learning within a novel task, as humans do in using new tools in the real world. We expect that artificial agents (like humans) have learned their intuitive physics outside of this environment and are now deploying it on a novel task.

Make your own levels

You can use our user-friendly Python API or our online interface

Convert your levels to PHYRE (

To automatically convert Virtual Tools tasks to PHYRE format, check out the code here.

To access our contributed Virtual Tools tier within PHYRE, you can use this code but replacing the tier "ball" with "virtual_tools". If using these tiers, please consider citing our paper.