Results

Performance Analysis

We evaluated the performance of GPT-4V, GPT-4o, Gemini 1.5 Flash, and Claude 3 Haiku across several Atari games. Models were assessed on game scores, visual understanding, spatial reasoning, and strategic capabilities.

Best model: GPT-4o with a normalized performance of 23.2% when compared to a human.
There is a significant gap in visual and spatial reasoning.
Model inferencing in its current state is not yet fast enough for real time gameplay.
Models outperformed random agents but lagged behind humans and RL agents.

Human Normalized Scores

Human Normalized Scores per environment

Visual and Spatial Reasoning

Key Insights

Visual reasoning is moderately successful; spatial reasoning remains a bottleneck.
Inference time (2-7 seconds) is a major hurdle for real-time applications.
Models demonstrated basic understanding of game mechanics, signaling potential for improvement.

How to Cite

If you find our work useful, please use the following citation:

        @misc{waytowich2024atarigptinvestigatingcapabilitiesmultimodal,
            title={Atari-GPT: Investigating the Capabilities of Multimodal Large Language Models as Low-Level Policies for Atari Games}, 
            author={Nicholas R. Waytowich and Devin White and MD Sunbeam and Vinicius G. Goecks},
            year={2024},
            eprint={2408.15950},
            archivePrefix={arXiv},
            primaryClass={cs.AI},
            url={https://arxiv.org/abs/2408.15950}
        }

You can also find our paper on arXiv.