Goto

Collaborating Authors

 crawler


AMBER: Aerial deployable gripping crawler with compliant microspine for canopy manipulation

Wigner, P. A., Romanello, L., Hammad, A., Nguyen, P. H., Lan, T., Armanini, S. F., Kocer, B. B., Kovac, M.

arXiv.org Artificial Intelligence

This paper presents an aerially deployable crawler designed for adaptive locomotion and manipulation within tree canopies. The system combines compliant microspine-based tracks, a dual-track rotary gripper, and an elastic tail, enabling secure attachment and stable traversal across branches of varying curvature and inclination. Experiments demonstrate reliable gripping up to 90 degrees of body roll and inclination, while effective climbing on branches inclined up to 67.5 degrees, achieving a maximum speed of 0.55 body lengths per second on horizontal branches. The compliant tracks allow yaw steering of up to 10 degrees, enhancing maneuverability on irregular surfaces. Power measurements show efficient operation with a dimensionless cost of transport over an order of magnitude lower than typical hovering power consumption in aerial robots. Integrated within a drone-tether deployment system, the crawler provides a robust, low-power platform for environmental sampling and in-canopy sensing, bridging the gap between aerial and surface-based ecological robotics.




A Whole New World: Creating a Parallel-Poisoned Web Only AI-Agents Can See

Zychlinski, Shaked

arXiv.org Artificial Intelligence

This paper introduces a novel attack vector that leverages website cloaking techniques to compromise autonomous web-browsing agents powered by Large Language Models (LLMs). As these agents become more prevalent, their unique and often homogenous digital fingerprints - comprising browser attributes, automation framework signatures, and network characteristics - create a new, distinguishable class of web traffic. The attack exploits this fingerprintability. A malicious website can identify an incoming request as originating from an AI agent and dynamically serve a different, "cloaked" version of its content. While human users see a benign webpage, the agent is presented with a visually identical page embedded with hidden, malicious instructions, such as indirect prompt injections. This mechanism allows adversaries to hijack agent behavior, leading to data exfiltration, malware execution, or misinformation propagation, all while remaining completely invisible to human users and conventional security crawlers. This work formalizes the threat model, details the mechanics of agent fingerprinting and cloaking, and discusses the profound security implications for the future of agentic AI, highlighting the urgent need for robust defenses against this stealthy and scalable attack.


A simulation framework for autonomous lunar construction work

Linde, Mattias, Lindmark, Daniel, Ålstig, Sandra, Servin, Martin

arXiv.org Artificial Intelligence

We present a simulation framework for lunar construction work involving multiple autonomous machines. The framework supports modelling of construction scenarios and autonomy solutions, execution of the scenarios in simulation, and analysis of work time and energy consumption throughout the construction project. The simulations are based on physics-based models for contacting multibody dynamics and deformable terrain, including vehicle-soil interaction forces and soil flow in real time. A behaviour tree manages the operational logic and error handling, which enables the representation of complex behaviours through a discrete set of simpler tasks in a modular hierarchical structure. High-level decision-making is separated from lower-level control algorithms, with the two connected via ROS2. Excavation movements are controlled through inverse kinematics and tracking controllers. The framework is tested and demonstrated on two different lunar construction scenarios that involve an excavator and dump truck with actively controlled articulated crawlers.


Evaluating the Use of LLMs for Documentation to Code Traceability

Alor, Ebube, Khatoonabadi, SayedHassan, Shihab, Emad

arXiv.org Artificial Intelligence

Large Language Models (LLMs) offer new potential for automating documentation-to-code traceability, yet their capabilities remain underexplored. We present a comprehensive evaluation of LLMs (Claude 3.5 Sonnet, GPT-4o, and o3-mini) in establishing trace links between various software documentation (including API references and user guides) and source code. We create two novel datasets from two open-source projects (Unity Catalog and Crawl4AI). Through systematic experiments, we assess three key capabilities: (1) trace link identification accuracy, (2) relationship explanation quality, and (3) multi-step chain reconstruction. Results show that the best-performing LLM achieves F1-scores of 79.4% and 80.4% across the two datasets, substantially outperforming our baselines (TF-IDF, BM25, and CodeBERT). While fully correct relationship explanations range from 42.9% to 71.1%, partial accuracy exceeds 97%, indicating that fundamental connections are rarely missed. For multi-step chains, LLMs maintain high endpoint accuracy but vary in capturing precise intermediate links. Error analysis reveals that many false positives stem from naming-based assumptions, phantom links, or overgeneralization of architectural patterns. We demonstrate that task-framing, such as a one-to-many matching strategy, is critical for performance. These findings position LLMs as powerful assistants for trace discovery, but their limitations could necessitate human-in-the-loop tool design and highlight specific error patterns for future research.


Cloudflare will now, by default, block AI bots from crawling its clients' websites

MIT Technology Review

However, such systems don't provide the same opportunities for monetization and credit as search engines historically have. AI models draw from a great deal of data on the web to generate their outputs, but these data sources are often not credited, limiting the creators' ability to make money from their work. Search engines that feature AI-generated answers may include links to original sources, but they may also reduce people's interest in clicking through to other sites and could even usher in a "zero-click" future. "Traditionally, the unspoken agreement was that a search engine could index your content, then they would show the relevant links to a particular query and send you traffic back to your website," Will Allen, Cloudflare's head of AI privacy, control, and media products, wrote in an email to MIT Technology Review. Generally, creators and publishers want to decide how their content is used, how it's associated with them, and how they are paid for it.


Millions Use It Every Day. It's One of the Internet's Most Important Websites. Bots Are Destroying It, Piece by Piece.

Slate

Sign up for the Slatest to get the most insightful analysis, criticism, and advice out there, delivered to your inbox daily. In the years since ChatGPT's debut transformed Silicon Valley into an artificial intelligence hype factory, the internet's most vibrant communities have puzzled over how to adapt to the ensuing deluge of A.I. slop, especially as autogenerated outputs become more sophisticated. Perhaps no platform exemplifies this conundrum better than Reddit, the anonymized message-board network that's been connecting millions of humans across the world for 20 years now--as many users there increasingly wonder whether they are, indeed, still connecting with other humans. Such concerns aren't new, but they've been heightened thanks to a shocking exercise of A.I.-powered manipulation. In late April, the moderation team for the popular subreddit r/ChangeMyView disclosed that researchers from the University of Zurich had conducted an "unauthorized experiment" on community members that "deployed AI-generated comments to study how AI could be used to change views."


Tree-based Focused Web Crawling with Reinforcement Learning

Kontogiannis, Andreas, Kelesis, Dimitrios, Pollatos, Vasilis, Giannakopoulos, George, Paliouras, Georgios

arXiv.org Artificial Intelligence

A focused crawler aims at discovering as many web pages and web sites relevant to a target topic as possible, while avoiding irrelevant ones. Reinforcement Learning (RL) has been a promising direction for optimizing focused crawling, because RL can naturally optimize the long-term profit of discovering relevant web locations within the context of a reward. In this paper, we propose TRES, a novel RL-empowered framework for focused crawling that aims at maximizing both the number of relevant web pages (aka \textit{harvest rate}) and the number of relevant web sites (\textit{domains}). We model the focused crawling problem as a novel Markov Decision Process (MDP), which the RL agent aims to solve by determining an optimal crawling strategy. To overcome the computational infeasibility of exhaustively searching for the best action at each time step, we propose Tree-Frontier, a provably efficient tree-based sampling algorithm that adaptively discretizes the large state and action spaces and evaluates only a few representative actions. Experimentally, utilizing online real-world data, we show that TRES significantly outperforms and Pareto-dominates state-of-the-art methods in terms of harvest rate and the number of retrieved relevant domains, while it provably reduces by orders of magnitude the number of URLs needed to be evaluated at each crawling step.


Document Quality Scoring for Web Crawling

Pezzuti, Francesca, Mueller, Ariane, MacAvaney, Sean, Tonellotto, Nicola

arXiv.org Artificial Intelligence

The internet contains large amounts of low-quality content, yet users expect web search engines to deliver high-quality, relevant results. The abundant presence of low-quality pages can negatively impact retrieval and crawling processes by wasting resources on these documents. Therefore, search engines can greatly benefit from techniques that leverage efficient quality estimation methods to mitigate these negative impacts. Quality scoring methods for web pages are useful for many processes typical for web search systems, including static index pruning, index tiering, and crawling. Building on work by Chang et al.~\cite{chang2024neural}, who proposed using neural estimators of semantic quality for static index pruning, we extend their approach and apply their neural quality scorers to assess the semantic quality of web pages in crawling prioritisation tasks. In our experimental analysis, we found that prioritising semantically high-quality pages over low-quality ones can improve downstream search effectiveness. Our software contribution consists of a Docker container that computes an effective quality score for a given web page, allowing the quality scorer to be easily included and used in other components of web search systems.