Levy, Ido
Towards Enterprise-Ready Computer Using Generalist Agent
Marreed, Sami, Oved, Alon, Yaeli, Avi, Shlomov, Segev, Levy, Ido, Sela, Aviad, Adi, Asaf, Mashkif, Nir
This paper presents our ongoing work toward developing an enterprise-ready Computer Using Generalist Agent (CUGA) system. Our research highlights the evolutionary nature of building agentic systems suitable for enterprise environments. By integrating state-of-the-art agentic AI techniques with a systematic approach to iterative evaluation, analysis, and refinement, we have achieved rapid and cost-effective performance gains, notably reaching a new state-of-the-art performance on the WebArena benchmark. We detail our development roadmap, the methodology and tools that facilitated rapid learning from failures and continuous system refinement, and discuss key lessons learned and future challenges for enterprise adoption.
Unsupervised Translation of Emergent Communication
Levy, Ido, Paradise, Orr, Carmeli, Boaz, Meir, Ron, Goldwasser, Shafi, Belinkov, Yonatan
Emergent Communication (EC) provides a unique window into the language systems that emerge autonomously when agents are trained to jointly achieve shared goals. However, it is difficult to interpret EC and evaluate its relationship with natural languages (NL). This study employs unsupervised neural machine translation (UNMT) techniques to decipher ECs formed during referential games with varying task complexities, influenced by the semantic diversity of the environment. Our findings demonstrate UNMT's potential to translate EC, illustrating that task complexity characterized by semantic diversity enhances EC translatability, while higher task complexity with constrained semantic variability exhibits pragmatic EC, which, although challenging to interpret, remains suitable for translation. This research marks the first attempt, to our knowledge, to translate EC without the aid of parallel data.
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
Levy, Ido, Wiesel, Ben, Marreed, Sami, Oved, Alon, Yaeli, Avi, Shlomov, Segev
Recent advancements in Web agents have introduced novel architectures and benchmarks showcasing progress in autonomous web navigation and interaction. However, most existing benchmarks prioritize effectiveness and accuracy, overlooking factors like safety and trustworthiness which are essential for deploying web agents in enterprise settings. We present STWebAgentBench, a benchmark designed to evaluate web agents safety and trustworthiness across six critical dimensions, essential for reliability in enterprise applications. This benchmark is grounded in a detailed framework that defines safe and trustworthy (ST) agent behavior. Our work extends WebArena with safety templates and evaluation functions to assess safety policy compliance rigorously. We introduce the Completion Under Policy to measure task success while adhering to policies, alongside the Risk Ratio, which quantifies policy violations across dimensions, providing actionable insights to address safety gaps. Our evaluation reveals that current SOTA agents struggle with policy adherence and cannot yet be relied upon for critical business applications. We open-source this benchmark and invite the community to contribute, with the goal of fostering a new generation of safer, more trustworthy AI agents. All code, data, environment reproduction resources, and video demonstrations are available at https://sites.google.com/view/st-webagentbench/home.
The infrastructure powering IBM's Gen AI model development
Gershon, Talia, Seelam, Seetharami, Belgodere, Brian, Bonilla, Milton, Hoang, Lan, Barnett, Danny, Chung, I-Hsin, Mohan, Apoorve, Chen, Ming-Hung, Luo, Lixiang, Walkup, Robert, Evangelinos, Constantinos, Salaria, Shweta, Dombrowa, Marc, Park, Yoonho, Kayi, Apo, Schour, Liran, Alim, Alim, Sydney, Ali, Maniotis, Pavlos, Schares, Laurent, Metzler, Bernard, Karacali-Akyamac, Bengi, Wen, Sophia, Chiba, Tatsuhiro, Choochotkaew, Sunyanan, Yoshimura, Takeshi, Misale, Claudia, Elengikal, Tonia, Connor, Kevin O, Liu, Zhuoran, Molina, Richard, Schneidenbach, Lars, Caden, James, Laibinis, Christopher, Fonseca, Carlos, Tarasov, Vasily, Sundararaman, Swaminathan, Schmuck, Frank, Guthridge, Scott, Cohn, Jeremy, Eshel, Marc, Muench, Paul, Liu, Runyu, Pointer, William, Wyskida, Drew, Krull, Bob, Rose, Ray, Wolfe, Brent, Cornejo, William, Walter, John, Malone, Colm, Perucci, Clifford, Franco, Frank, Hinds, Nigel, Calio, Bob, Druyan, Pavel, Kilduff, Robert, Kienle, John, McStay, Connor, Figueroa, Andrew, Connolly, Matthew, Fost, Edie, Roma, Gina, Fonseca, Jake, Levy, Ido, Payne, Michele, Schenkel, Ryan, Malki, Amir, Schneider, Lion, Narkhede, Aniruddha, Moshref, Shekeba, Kisin, Alexandra, Dodin, Olga, Rippon, Bill, Wrieth, Henry, Ganci, John, Colino, Johnny, Habeger-Rose, Donna, Pandey, Rakesh, Gidh, Aditya, Gaur, Aditya, Patterson, Dennis, Salmani, Samsuddin, Varma, Rambilas, Rumana, Rumana, Sharma, Shubham, Gaur, Aditya, Mishra, Mayank, Panda, Rameswar, Prasad, Aditya, Stallone, Matt, Zhang, Gaoyuan, Shen, Yikang, Cox, David, Puri, Ruchir, Agrawal, Dakshi, Thorstensen, Drew, Belog, Joel, Tang, Brent, Gupta, Saurabh Kumar, Biswas, Amitabha, Maheshwari, Anup, Gampel, Eran, Van Patten, Jason, Runion, Matthew, Kaki, Sai, Bogin, Yigal, Reitz, Brian, Pritko, Steve, Najam, Shahan, Nambala, Surya, Chirra, Radhika, Welp, Rick, DiMitri, Frank, Telles, Felipe, Arvelo, Amilcar, Chu, King, Seminaro, Ed, Schram, Andrew, Eickhoff, Felix, Hanson, William, Mckeever, Eric, Joseph, Dinakaran, Chaudhary, Piyush, Shivam, Piyush, Chaudhary, Puneet, Jones, Wesley, Guthrie, Robert, Bostic, Chris, Islam, Rezaul, Duersch, Steve, Sawdon, Wayne, Lewars, John, Klos, Matthew, Spriggs, Michael, McMillan, Bill, Gao, George, Kamra, Ashish, Singh, Gaurav, Curry, Marc, Katarki, Tushar, Talerico, Joe, Shi, Zenghui, Malleni, Sai Sindhur, Gallen, Erwan
AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational models, where on occasion thousands of GPUs must cooperate on a single training job for the model to be trained in a reasonable time. Delivering efficient and high-performing AI training requires an end-to-end solution that combines hardware, software and holistic telemetry to cater for multiple types of AI workloads. In this report, we describe IBM's hybrid cloud infrastructure that powers our generative AI model development. This infrastructure includes (1) Vela: an AI-optimized supercomputing capability directly integrated into the IBM Cloud, delivering scalable, dynamic, multi-tenant and geographically distributed infrastructure for large-scale model training and other AI workflow steps and (2) Blue Vela: a large-scale, purpose-built, on-premises hosting environment that is optimized to support our largest and most ambitious AI model training tasks. Vela provides IBM with the dual benefit of high performance for internal use along with the flexibility to adapt to an evolving commercial landscape. Blue Vela provides us with the benefits of rapid development of our largest and most ambitious models, as well as future-proofing against the evolving model landscape in the industry. Taken together, they provide IBM with the ability to rapidly innovate in the development of both AI models and commercial offerings.