Autonomous agents that can perform tasks based on human instructions have the potential to enhance efficiency and accessibility. To fully utilize these agents, it is important to understand their behavior in realistic settings.
Many current settings oversimplify complex problems, leading to a lack of diversity in work tasks. In addition, some environments limit agents to static resources, preventing them from exploring new states beyond the data collected during training.
Introducing WebArena: A Simulated Web Environment
Carnegie Mellon University and Inspired Cognition have developed WebArena, a simulated web environment for training autonomous agents. It consists of four live web apps, including e-commerce, online forums, software development, and content management. WebArena also provides useful tools like a map, calculator, and scratchpad to improve the agents’ task execution. The environment is supported by comprehensive materials, including guides and content from offline sources like Wikipedia. With the help of Docker containers and gym APIs, WebArena is easy to use and replicate.
A Benchmark for Web-Based Tasks
In addition to WebArena, the researchers have created a benchmark of 812 web-based tasks. These tasks are designed to mimic the language patterns used by humans and are described in natural language. The researchers analyze the performance of various agents in carrying out these tasks. They use different methods, from predicting next steps based on observations to more complex reasoning. Large language models like GPT-3.5 and GPT-4 are used to create these agents. However, the findings show that even the best GPT-4 agent only achieves a 10.59 percent success rate in completing the tasks. The researchers attribute this to the lack of key capabilities in current models, such as active exploration and failure recovery, which hinder their performance in complex tasks.
Overall, WebArena and the benchmark provide valuable insights into training autonomous agents and highlight the areas for improvement in current AI models.
Source link