Performance Analysis
We evaluated the performance of GPT-4V, GPT-4o, Gemini 1.5 Flash, and Claude 3 Haiku across several Atari games. Models were assessed on game scores, visual understanding, spatial reasoning, and strategic capabilities.
- Best model: GPT-4o with a normalized performance of 23.2% when compared to a human.
- There is a significant gap in visual and spatial reasoning.
- Model inferencing in its current state is not yet fast enough for real time gameplay.
- Models outperformed random agents but lagged behind humans and RL agents.
Human Normalized Scores
Human Normalized Scores per environment
Visual and Spatial Reasoning
Key Insights
- Visual reasoning is moderately successful; spatial reasoning remains a bottleneck.
- Inference time (2-7 seconds) is a major hurdle for real-time applications.
- Models demonstrated basic understanding of game mechanics, signaling potential for improvement.
How to Cite
If you find our work useful, please use the following citation:
@misc{waytowich2024atarigptinvestigatingcapabilitiesmultimodal, title={Atari-GPT: Investigating the Capabilities of Multimodal Large Language Models as Low-Level Policies for Atari Games}, author={Nicholas R. Waytowich and Devin White and MD Sunbeam and Vinicius G. Goecks}, year={2024}, eprint={2408.15950}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2408.15950} }
You can also find our paper on arXiv.