input text
- Law (1.00)
- Health & Medicine > Therapeutic Area (0.94)
- Information Technology (0.93)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.69)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (12 more...)
- Overview (1.00)
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.46)
- Law (1.00)
- Information Technology (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- (2 more...)
- Research Report > Experimental Study (0.93)
- Workflow (0.67)
- Information Technology (0.67)
- Media (0.46)
- Government (0.46)
A Appendix
A.1 TPPE Method We present the pseudo code for TPPE in this paper, using the Insertion mode as an example. According to Alg. 1, we reduce the query time complexity from In our study, we assume the worst-case scenario of applying punctuation-level attacks. Softmax layer is adopted to predict the label of the input text. Paraphrase (TPPEP) to achieve a single-shot attack. We describe the TPPEP method as being decomposed into two parts: training and searching.
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
Detecting LLM-Generated Text with Performance Guarantees
Zhou, Hongyi, Zhu, Jin, Yang, Ying, Shi, Chengchun
Large language models (LLMs) such as GPT, Claude, Gemini, and Grok have been deeply integrated into our daily life. They now support a wide range of tasks -- from dialogue and email drafting to assisting with teaching and coding, serving as search engines, and much more. However, their ability to produce highly human-like text raises serious concerns, including the spread of fake news, the generation of misleading governmental reports, and academic misconduct. To address this practical problem, we train a classifier to determine whether a piece of text is authored by an LLM or a human. Our detector is deployed on an online CPU-based platform https://huggingface.co/spaces/stats-powered-ai/StatDetectLLM, and contains three novelties over existing detectors: (i) it does not rely on auxiliary information, such as watermarks or knowledge of the specific LLM used to generate the text; (ii) it more effectively distinguishes between human- and LLM-authored text; and (iii) it enables statistical inference, which is largely absent in the current literature. Empirically, our classifier achieves higher classification accuracy compared to existing detectors, while maintaining type-I error control, high statistical power, and computational efficiency.
- Media > News (1.00)
- Information Technology (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
ProBench: Benchmarking GUI Agents with Accurate Process Information
Yang, Leyang, Wang, Ziwei, Tang, Xiaoxuan, Zhou, Sheng, Chen, Dajun, Jiang, Wei, Li, Yong
With the deep integration of artificial intelligence and interactive technology, Graphical User Interface (GUI) Agent, as the carrier connecting goal-oriented natural language and real-world devices, has received widespread attention from the community. Contemporary benchmarks aim to evaluate the comprehensive capabilities of GUI agents in GUI operation tasks, generally determining task completion solely by inspecting the final screen state. However, GUI operation tasks consist of multiple chained steps while not all critical information is presented in the final few pages. Although a few research has begun to incorporate intermediate steps into evaluation, accurately and automatically capturing this process information still remains an open challenge. To address this weakness, we introduce ProBench, a comprehensive mobile benchmark with over 200 challenging GUI tasks covering widely-used scenarios. Remaining the traditional State-related Task evaluation, we extend our dataset to include Process-related Task and design a specialized evaluation method. A newly introduced Process Provider automatically supplies accurate process information, enabling presice assessment of agent's performance. Our evaluation of advanced GUI agents reveals significant limitations for real-world GUI scenarios. These shortcomings are prevalent across diverse models, including both large-scale generalist models and smaller, GUI-specific models. A detailed error analysis further exposes several universal problems, outlining concrete directions for future improvements.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Asia > China > Hong Kong (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (5 more...)
- Workflow (0.68)
- Research Report (0.64)
- Leisure & Entertainment > Sports (1.00)
- Information Technology (1.00)
A Our Designed Prompts for FLUB
Figure 4: Our designed prompts without the Chain-of-Thought idea. Task 3(b) is for inquiries. Figure 5: Our designed prompts with the Chain-of-Thought idea. Task 3(b) is for inquiries. Thought prompts for Task 1 and Task 2 are presented in Figure 5. Scoring Objective For the LLMs' output response to each input cunning text, please refer to the Scoring Rules The scoring values are defined as {1, 2, 3, 4, 5}.
- Law (1.00)
- Health & Medicine > Therapeutic Area (0.94)
- Information Technology (0.93)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.69)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (12 more...)
- Overview (1.00)
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.46)
- Law (1.00)
- Information Technology (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- (2 more...)
- Research Report > Experimental Study (0.93)
- Workflow (0.67)
- Information Technology (0.67)
- Media (0.46)