AutoGLM: Autonomous Foundation Agents for GUIs

Xiao Liu12, Bo Qin1†, Dongzhu Liang1†, Guang Dong1†, Hanyu Lai12*†, Hanchen Zhang12*†,
Hanlin Zhao1†, Iat Long Iong12*†, Jiadai Sun1†, Jiaqi Wang1†, Junjie Gao1†, Junjun Shan1†,
Kangning Liu1†, Shudan Zhang12*†, Shuntian Yao1*†, Siyi Cheng1*†, Wentao Yao12*†,
Wenyi Zhao1†, Xinghan Liu12*†, Xinyi Liu1†, Xinying Chen1†, Xinyue Yang1†, Yang Yang1†,
Yifan Xu12*†, Yu Yang1†, Yujia Wang1†, Yulin Xu1†, Zehan Qi12*†, Yuxiao Dong2, Jie Tang2
Zhipu AI1 Tsinghua University2
* Work done while these authors interned at Zhipu AI.
These authors are listed alphabetically by first names.
AutoGLM is a new series developed from ChatGLM family, which targets autonomous mission completion agents via Graphical User Interfaces (GUIs) such as Phone and Web. Its web use ability will be progressively available to public via Qingyan Plugin and its phone use ability on Android is currently under invited internal testing (Application Form for CHN Mainland or for outside CHN Mainland).

(a) AutoGLM demonstration on Phone (integrated version).


(a) AutoGLM demonstration on Web (integrated version).

Abstract

We present AutoGLM, a new series in the ChatGLM family~\cite{glm2024chatglm}, designed to serve as foundation agents for autonomous control of digital devices through Graphical User Interfaces (GUIs). While foundation models excel at acquiring human knowledge, they often struggle with decision-making in dynamic real-world environments, limiting their progress toward artificial general intelligence. This limitation underscores the importance of developing foundation agents capable of learning through autonomous environmental interactions by reinforcing existing models. Focusing on Web Browser and Android as representative GUI scenarios, we have developed AutoGLM as a practical foundation agent system for real-world GUI interactions. Our approach integrates a comprehensive suite of techniques and infrastructures to create deployable agent systems suitable for user delivery. Through this development, we have derived two key insights: First, the design of an appropriate "intermediate interface" for GUI control is crucial, enabling the separation of planning and grounding behaviors, which require distinct optimization for flexibility and accuracy respectively. Second, we have developed a novel progressive training framework that enables self-evolving online curriculum reinforcement learning with AutoGLM. Our evaluations demonstrate AutoGLM's effectiveness across multiple domains. For web browsing, AutoGLM achieves a 55.2\% success rate on VAB-WebArena-Lite (improving to 59.1\% with a second attempt) and 96.2\% on OpenTable evaluation tasks. In Android device control, AutoGLM attains a 36.2\% success rate on AndroidLab (VAB-Mobile) and 89.7\% on common tasks in popular Chinese APPs.

Phone Use (Real Speed Recording)


(a) [Gmail] Write an email querying about the project progress with subject hi to harry66@gmail.com, scheduled to send on Oct.30 8:00 AM

(b) [Google Maps] Find the nearest top rated coffee shop and direct me there on foot

(c) [Temu] Add two paris of top saled running shoes for women of size 7.5 to my cart

(d) [X] Help me find AK's homepage url

(f) 在美团上点一杯瑞幸咖啡的标准美式,半糖

(g) 在大众点评上给全聚德清华科技园店写一个五星好评

(h) 在微信上给老板最近的一条朋友圈点赞,并评论“深有启发”

(i) 在携程上订一家11月5到到10号上海迪士尼附近评价最好的酒店

Web Browser Use (Real Speed Recording)


(a) Secure a table on OpenTable for 2 people at Saffron Fine Indian Cuisine on Nov.6 2024 at 7:30 PM?

(b) Check my issues and create an issue called "excellent engineer wanted" for project Zhipu AI on GitLab.

(c) Show me the "chairs"listings by ascending price on OneStopShop.

(d) Reserve for my parents and I at Megan's Kitchen on Oct. 23, 2024 7:30 PM

(e) Set all reviews with keyword "sweet" to approved on Client Management System.

(f) Get durations to first drive from MIT to Harvard, and then from Harvard to Boston Airport

(g) 在小红书上,帮我找找热度最高的罗马旅游的图文攻略,并特别总结一下提到了哪些必去的景点

(h) 总结一下 deepspeed 有哪些节省显存的策略,参考最多赞同的文章

(i) 检索知识图谱最新的学术期刊发表内容,只看北大核心