Aligned LLMs Are Not Aligned Browser Agents
Authors: Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Elaine T Chang, Vaughn Robinson, Shuyan Zhou, Matt Fredrikson, Sean M. Hendryx, Summer Yue, Zifan Wang (2025)
Article link: https://openreview.net/forum?id=NsFZZU9gvk
Abstract: Despite significant efforts spent by large language model (LLM) developers to align model outputs towards safety and helpfulness, there remains an open question if this safety alignment, typically enforced in chats, generalize to non-chat and agentic use cases? Unlike chatbots, agents equipped with general-purpose tools, such as web browsers and mobile devices, can directly influence the real world, making it even more crucial to ensure the safety of LLM agents. In this work, we primarily focus on red-teaming browser agents, LLMs that interact with and extract information from web browsers. To this end, we introduce Browser Agent Red teaming Toolkit (BrowserART), a comprehensive test suite consisting of 100 diverse browser-related harmful behaviors and 40 synthetic websites, designed specifically for red-teaming browser agents. Our empirical study on state-of-the-art browser agents reveals a significant alignment gap between the base LLMs and their downstream browser agents. That is, while the LLM demonstrates alignment as a chatbot, the corresponding agent does not. Moreover, attack methods designed to jailbreak aligned LLMs in chat settings transfer effectively to browser agents - with simple human rewrites, GPT-4o and GPT-4-turbo-based browser agents attempted all 100 harmful behaviors. We plan to publicly release BrowserART and call on LLM developers, policymakers, and agent developers to collaborate on enhancing agent safety.
Presenter: Erik Derner & Kristina Batistič
Date: 2025-03-04 15:00 (CET)
Online: https://bit.ly/ellis-hcml-rg
Add to Google Calendar, Outlook Calendar