Results

Performance Analysis

We evaluated the performance of GPT-4V, GPT-4o, Gemini 1.5 Flash, and Claude 3 Haiku across several Atari games. Models were assessed on game scores, visual understanding, spatial reasoning, and strategic capabilities.

Human Normalized Scores

Performance Chart

Human Normalized Scores per environment

Performance Chart

Visual and Spatial Reasoning

Performance Chart

Key Insights

How to Cite

If you find our work useful, please use the following citation:

        @misc{waytowich2024atarigptinvestigatingcapabilitiesmultimodal,
            title={Atari-GPT: Investigating the Capabilities of Multimodal Large Language Models as Low-Level Policies for Atari Games}, 
            author={Nicholas R. Waytowich and Devin White and MD Sunbeam and Vinicius G. Goecks},
            year={2024},
            eprint={2408.15950},
            archivePrefix={arXiv},
            primaryClass={cs.AI},
            url={https://arxiv.org/abs/2408.15950}
        }
            

You can also find our paper on arXiv.