AITopics

arXiv.org Artificial IntelligenceMar-10-2025

BEARCUBS: A benchmark for computer-using web agents

Song, Yixiao, Thai, Katherine, Pham, Chau Minh, Chang, Yapei, Nadaf, Mazin, Iyyer, Mohit

Modern web agents possess computer use abilities that allow them to interact with webpages by sending commands to a virtual keyboard and mouse. While such agents have considerable potential to assist human users with complex tasks, evaluating their capabilities in real-world settings poses a major challenge. To this end, we introduce BEARCUBS, a "small but mighty" benchmark of 111 information-seeking questions designed to evaluate a web agent's ability to search, browse, and identify factual information from the web. Unlike prior web agent benchmarks, solving BEARCUBS requires (1) accessing live web content rather than synthetic or simulated pages, which captures the unpredictability of real-world web interactions; and (2) performing a broad range of multimodal interactions (e.g., video understanding, 3D navigation) that cannot be bypassed via text-based workarounds. Each question in BEARCUBS has a corresponding short, unambiguous answer and a human-validated browsing trajectory, allowing for transparent evaluation of agent performance and strategies. A human study confirms that BEARCUBS questions are solvable but non-trivial (84.7% human accuracy), revealing search inefficiencies and domain knowledge gaps as common failure points. By contrast, state-of-the-art computer-using agents underperform, with the best-scoring system (OpenAI's Operator) reaching only 24.3% accuracy. These results highlight critical areas for improvement, including reliable source selection and more powerful multimodal capabilities. To facilitate future research, BEARCUBS will be updated periodically to replace invalid or contaminated questions, keeping the benchmark fresh for future generations of web agents.

large language model, machine learning, natural language, (21 more...)

2503.07919

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
North America > United States > Maryland > Prince George's County > College Park (0.04)
(3 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Communications > Web (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)

arXiv.org Artificial IntelligenceMar-4-2025

Unlocking a New Rust Programming Experience: Fast and Slow Thinking with LLMs to Conquer Undefined Behaviors

Jiang, Renshuang, Dong, Pan, Duan, Zhenling, Shi, Yu, Fang, Xiaoxiang, Ding, Yan, Ma, Jun, Zhao, Shuai, Jiang, Zhe

To provide flexibility and low-level interaction capabilities, the unsafe tag in Rust is essential in many projects, but undermines memory safety and introduces Undefined Behaviors (UBs) that reduce safety. Eliminating these UBs requires a deep understanding of Rust's safety rules and strong typing. Traditional methods require depth analysis of code, which is laborious and depends on knowledge design. The powerful semantic understanding capabilities of LLM offer new opportunities to solve this problem. Although existing large model debugging frameworks excel in semantic tasks, limited by fixed processes and lack adaptive and dynamic adjustment capabilities. Inspired by the dual process theory of decision-making (Fast and Slow Thinking), we present a LLM-based framework called RustBrain that automatically and flexibly minimizes UBs in Rust projects. Fast thinking extracts features to generate solutions, while slow thinking decomposes, verifies, and generalizes them abstractly. To apply verification and generalization results to solution generation, enabling dynamic adjustments and precise outputs, RustBrain integrates two thinking through a feedback mechanism. Experimental results on Miri dataset show a 94.3% pass rate and 80.4% execution rate, improving flexibility and Rust projects safety.

rust, rustbrain, ubs, (16 more...)

2503.02335

Country:

Asia > China > Hunan Province > Changsha (0.05)
Europe > Switzerland (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)
Asia > China > Guangdong Province > Guangzhou (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

#artificialintelligenceFeb-2-2023, 22:40:38 GMT

ChatGPT May Be the Fastest Growing App in History

It's no secret that ChatGPT, the large language model-powered artificial intelligence from OpenAI, has taken the internet by storm. Everyone is talking about it, everywhere online--Gizmodo included. The AI chatbot can almost instantly generate paragraphs of human-like, fluid text in answer to basically any prompt you can come up with (just don't rely on it to do your math homework correctly, or provide an accurate substitute for researched writing). And the scope of ChatGPT's ascent is probably even more astounding than you think. The chatbot has become the fastest growing consumer-facing application in history, according to a new analysis from Swiss investment bank, UBS, as reported by multiple financial outlets.

large language model, machine learning, natural language, (10 more...)

Industry:

Banking & Finance > Trading (0.39)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.31)

#artificialintelligenceJan-3-2022, 16:47:10 GMT

AI is the cornerstone of our data intelligence & automation strategy: Jayashree Mitra, UBS

Artificial intelligence can enhance efficiency and productivity in financial services and is hence emerging as an important tool in the industry. It can reduce human errors and biases, along with improving the quality by spotting anomalies that cannot be picked up from current reporting methods. We caught up with Jayashree Mitra, the head of technology (end-user services), Asia Pacific, to understand more about AI and automation in this industry. She has 23 years of cross-industry experience, of which she has spent 20 years working in Financial Services Technology for Standard Chartered Bank and UBS. Jayashree Mitra: Throughout my childhood, I was always taught the virtues of self-reliance.

cornerstone, data intelligence & automation strategy, jayashree mitra, (8 more...)

Country: Asia > India (0.08)

Genre: Personal > Interview (0.56)

Industry: Banking & Finance > Financial Services (0.57)

Technology: Information Technology > Artificial Intelligence (1.00)

Yang, Jiong, Chakraborty, Supratik, Meel, Kuldeep S.

Projected Model Counting: Beyond Independent Support

arXiv.org Artificial IntelligenceOct-18-2021

The past decade has witnessed a surge of interest in practical techniques for projected model counting. Despite significant advancements, however, performance scaling remains the Achilles' heel of this field. A key idea used in modern counters is to count models projected on an \emph{independent support} that is often a small subset of the projection set, i.e. original set of variables on which we wanted to project. While this idea has been effective in scaling performance, the question of whether it can benefit to count models projected on variables beyond the projection set, has not been explored. In this paper, we study this question and show that contrary to intuition, it can be beneficial to project on variables beyond the projection set. In applications such as verification of binarized neural networks, quantification of information flow, reliability of power grids etc., a good upper bound of the projected model count often suffices. We show that in several such cases, we can identify a set of variables, called upper bound support (UBS), that is not necessarily a subset of the projection set, and yet counting models projected on UBS guarantees an upper bound of the true projected model count. Theoretically, a UBS can be exponentially smaller than the smallest independent support. Our experiments show that even otherwise, UBS-based projected counting can be more efficient than independent support-based projected counting, while yielding bounds of very high quality. Based on extensive experiments, we find that UBS-based projected counting can solve many problem instances that are beyond the reach of a state-of-the-art independent support-based projected model counter.

approxmc4, projection, ubs, (15 more...)

2110.09171

Country: Asia > Singapore (0.04)

Genre: Research Report (1.00)

Industry: Energy > Power Industry (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)

arXiv.org Artificial IntelligenceMar-3-2021

NOMU: Neural Optimization-based Model Uncertainty

Heiss, Jakob, Weissteiner, Jakob, Wutte, Hanna, Seuken, Sven, Teichmann, Josef

We introduce a new approach for capturing model uncertainty for neural networks (NNs) in regression, which we call Neural Optimization-based Model Uncertainty (NOMU). The main idea of NOMU is to design a network architecture consisting of two connected sub-networks, one for the model prediction and one for the model uncertainty, and to train it using a carefully designed loss function. With this design, NOMU can provide model uncertainty for any given (previously trained) NN by plugging it into the framework as the sub-network used for model prediction. NOMU is designed to yield uncertainty bounds (UBs) that satisfy four important desiderata regarding model uncertainty, which established methods often do not satisfy. Furthermore, our UBs are themselves representable as a single NN, which leads to computational cost advantages in applications such as Bayesian optimization. We evaluate NOMU experimentally in multiple settings. For regression, we show that NOMU performs as well as or better than established benchmarks. For Bayesian optimization, we show that NOMU outperforms all other benchmarks.

artificial intelligence, neural network, nomu, (18 more...)

2102.1364

Country:

North America > United States > California (0.14)
Europe > Sweden (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Energy > Oil & Gas (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.89)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)

#artificialintelligenceOct-29-2020, 06:15:44 GMT

Banks roll out robots as pandemic shakes up IT plans

LONDON (Reuters) - When banks were flooded with loan requests from businesses struggling with the fallout of the coronavirus pandemic, hastily built robots helped several lenders cope with the deluge. The bots were one of many quick technology changes deployed across the industry during the crisis, a contrast to the slow progress it's made in the past two decades to improve technology in the face of increasing competition from fintech rivals. Now the jolt from the COVID-19 pandemic has accelerated the process even though banks globally are having to cut IT spending this year for the first time since 2009, based on data from research company IDC. "Bots allowed us to process a much higher volume of applications than we would have been able to do before. It meant the timelines didn't get longer with the massive volume," said Simon McNamara, chief administrative officer at Britain's NatWest, which has granted more than 13 billion pounds ($16.90 billion) of state-backed loans.

artificial intelligence, bank roll, robot, (13 more...)

Country:

Europe > United Kingdom > England > Greater London > London (0.25)
Europe > Switzerland (0.05)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology: Information Technology > Artificial Intelligence > Robots (0.61)

#artificialintelligenceSep-22-2020, 04:01:36 GMT

AI Should Change What You Do -- Not Just How You Do It

Few leaders would dispute the fact that business today is driven by data and smart algorithms. Yet, rather than real digital transformation, many instead pursue digital incrementalism, using automation to cut costs or, worse -- cut jobs. Doing so might buy you some time from impatient shareholders, but it will be short-lived unless you can face the challenge: How do you reimagine what you do for a new era of AI-powered competition? The high unemployment numbers of the Covid-19 recession have obscured a systemic problem: the accelerating effect of automation on the workforce. We have been here before.

artificial intelligence, automation, customer, (16 more...)

Industry:

Banking & Finance > Economy (0.87)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.35)
Health & Medicine > Therapeutic Area > Immunology (0.35)

Technology: Information Technology > Artificial Intelligence (1.00)

Placidi, Giuseppe, Cinque, Luigi, Polsinelli, Matteo

Convolutional Neural Networks for Automatic Detection of Artifacts from Independent Components Represented in Scalp Topographies of EEG Signals

arXiv.org Artificial IntelligenceSep-8-2020

Electroencephalography (EEG) measures the electrical brain activity in real-time by using sensors placed on the scalp. Artifacts, due to eye movements and blink, muscular/cardiac activity and generic electrical disturbances, have to be recognized and eliminated to allow a correct interpretation of the useful brain signals (UBS) of EEG. Independent Component Analysis (ICA) is effective to split the signal into independent components (ICs) whose re-projections on 2D scalp topographies (images), also called topoplots, allow to recognize/separate artifacts and by UBS. Until now, IC topoplot analysis, a gold standard in EEG, has been carried on visually by human experts and, hence, not usable in automatic, fast-response EEG. We present a completely automatic and effective framework for EEG artifact recognition by IC topoplots, based on 2D Convolutional Neural Networks (CNNs), capable to divide topoplots in 4 classes: 3 types of artifacts and UBS. The framework setup is described and results are presented, discussed and compared with those obtained by other competitive strategies. Experiments, carried on public EEG datasets, have shown an overall accuracy of above 98%, employing 1.4 sec on a standard PC to classify 32 topoplots, that is to drive an EEG system of 32 sensors. Though not real-time, the proposed framework is efficient enough to be used in fast-response EEG-based Brain-Computer Interfaces (BCI) and faster than other automatic methods based on ICs.

artificial intelligence, machine learning, topoplot, (18 more...)

2009.03696

Country:

Europe > Italy > Abruzzo > L'Aquila Province > L'Aquila (0.04)
Europe > Italy > Lazio > Rome (0.04)

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.84)