desktop application
Grounding Computer Use Agents on Human Demonstrations
Feizi, Aarash, Nayak, Shravan, Jian, Xiangru, Lin, Kevin Qinghong, Li, Kaixin, Awal, Rabiul, Lù, Xing Han, Obando-Ceron, Johan, Rodriguez, Juan A., Chapados, Nicolas, Vazquez, David, Romero-Soriano, Adriana, Rabbany, Reihaneh, Taslakian, Perouz, Pal, Christopher, Gella, Spandana, Rajeswar, Sai
Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. CUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents. The vision of computer-use agents (CUA) that operate software on behalf of users has gained significant momentum with recent progress in multimodal large language model-based agents (OpenAI, 2025; Anthropic, 2024a; Qin et al., 2025; Wang et al., 2025a). These agents promise to automate routine work and make complex digital tools more accessible. For such agents to succeed, they must first plan the next step in a task, then ground the plan to the exact on-screen element to click, type, or drag. Accurate grounding is critical: without correctly identifying the right button or menu item, even a flawless plan cannot be executed. In FreeCAD, for instance, when asked to "open the color picker" (Figure 1), the agent must distinguish a small palette icon from look-alike tools, one of which it must precisely click. When grounding fails, the plan quickly veers off course, minor errors compound, and tasks ultimately fail (Nayak et al., 2025). Moreover, grounding in desktop applications is challenging due to their complexity and diversity. These applications often feature high-resolution displays with dense layouts and visually similar elements, making precise localization difficult.
An Embedded Intelligent System for Attendance Monitoring
Abderraouf, Touzene, Wassim, Abed Abdeljalil, Larabi, Slimane
In this paper, we propose an intelligent embedded system for monitoring class attendance and sending the attendance list to a remote computer. The proposed system consists of two parts : an embedded device (Raspberry with PI camera) for facial recognition and a web application for attendance management. The proposed solution take into account the different challenges: the limited resources of the Raspberry Pi, the need to adapt the facial recognition model and achieving acceptable performance using images provided by the Raspberry Pi camera.
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
Kapoor, Raghav, Butala, Yash Parag, Russak, Melisa, Koh, Jing Yu, Kamble, Kiran, Alshikh, Waseem, Salakhutdinov, Ruslan
For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention. In this paper, we introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet". Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task. We run several strong baseline language model agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents. Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks and motivates future work towards building multimodal models that bridge large language models and the visual grounding of computer screens.
ChatGPT for Windows - Desktop Application
Use ChatGPT directly on the Windows desktop instead of in the browser: "ChatGPT Desktop Application" makes it possible. ChatGPT is currently the hottest text AI tool that is usually completely free to use on the web. Instead of using the AI text generator in the browser, the desktop application available here brings ChatGPT directly to the desktop. As a result, you no longer have to keep ChatGPT open in its own tab to interact with the chatbot. Yes, ChatGPT can easily create longer texts.
Top 10 Programming Languages Recruiters are Looking For in 2022
Post pandemic, AI has become one of the top agendas for businesses as it offers enhanced customer experience, resilience, and reliability. With the advancements in machine learning, data analytics, and conversational AI, companies are finding it feasible and affordable to deploy AI tools that allow them to solve problems and increase efficiency. Here are the 10 most popular programming languages among job seekers. Python can be regarded as the future of programming languages. As per the latest statistics, Python is the main coding language for around 80% of developers.
How to Detect Rotten Fruits Using Image Processing Python?
Freshness provides one of the essential characteristics for consumers. Consumers prefer fresh fruits rather than rotten ones when it comes to hygiene. An efficient fruit detection system is required to facilitate humans. So, for the easiness of people, this desktop application is proposed, named "Detection of Rotten Fruits (DRF)" by using Artificial Intelligence and Computer Vision. DRF is a desktop application for detecting rottenness in fruits that can be used to indicate the fruits according to their rottenness.
Top 10 Open Source and Free RPA Tools of 2020
Similar to numerous software usage, there's a build-or-buy choice when getting started with Robotic Process Automation (RPA). Actually, Gartner recently called RPA the fastest-growing enterprise software segment of 2018, with 63% development in worldwide incomes. It's a serious market, as well, you have alternatives. Besides, commercial RPA merchants have commonly tried to prioritize ease of use, with expectations of empowering non-developers to have the option to make and deploy bots without a huge amount of technical overhead. Some of the commercial merchants offer a "freemium" product as a method of tempting prospective customers to kick the tires on their platforms. There are various RPA tools accessible in the market and picking one could be a challenge.
mysam
Sam is an open-source, web-based "intelligent" assistant. It can listen to you, learn new actions and is extensible with JavaScript plugins. Sam runs a NodeJS server and in any modern browser or as an Electron desktop application. At first startup Sam will load the basic frontend training data (like learning your name, provide help, saying hi or to learn something new) and ask for your name. To talk to Sam press CTRL SPACE (make sure the window is focused).
Fake celebrity porn is all over Reddit thanks to a new app
Back in December, it was discovered that Reddit users were creating fake pornography using celebrity faces pasted on to adult film actresses' bodies. The disturbing videos, created by Reddit user deepfakes, look strikingly real as a result of a sophisticated machine learning algorithm, which uses photographs to create human masks that are then overlaid on top of adult film footage. Now, AI-assisted porn is spreading all over Reddit, thanks to an easy-to-use app that can be downloaded directly to your desktop computer, according to Motherboard. Star Wars lead Daisy Ridley has been featured in a fake video on the Reddit thread. One of the site's users, deepfakeapp, created a desktop application called FakeApp that lets users take adult film footage and swap any female celebrity's face onto porn actresses' bodies The app, called FakeApp, uses deepfakes' algorithm, but doesn't require any knowledge of coding.
Finding Image Regions with Human Computation and Games with a Purpose
Lux, Mathias (Klagenfurt University) | Müller, Alexander (Klagenfurt University) | Guggenberger, Mario (Klagenfurt University)
Manual image annotation is a tedious and time-consuming task, while automated methods are error prone and limited in their results. Human computation, and especially games with a purpose, have shown potential to create high quality annotations by "hiding the complexity" of the actual annotation task and employing the "wisdom of the crowds". In this demo paper we present two games with a single purpose: finding regions in images that correspond to given terms. We discuss approach, implementation, and preliminary results of our work and give an outlook to immediate future work.