AITopics | crawler

Collaborating Authors

crawler

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Cloudflare will filter out web crawlers that serve AI companies

EngadgetJul-2-2026, 21:17:10 GMT

The hosting platform wants sites to have more control over how AI companies use their content. Cloudflare has announced plans to automatically block mixed-use web crawlers that index websites for search engines and act as AI agents and trainers at the same time. The company previously offered its customers the optional ability to prevent crawlers from scraping their sites for AI chatbots, but now Cloudflare's stance is becoming more defensive by default. Now that the majority of traffic on the Internet is non-human, we must go further and act faster so that a sustainable ecosystem can emerge, Matthew Prince, Cloudflare's CEO and co-founder shared in a statement. Cloudflare's new tools and partnerships give website owners increased visibility and commercial opportunities and benefit AI companies that have bots with clear and transparent intent.

artificial intelligence, data mining, natural language, (14 more...)

Engadget

Industry: Leisure & Entertainment > Games > Computer Games (0.71)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science > Data Mining > Web Mining (0.65)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.59)

Add feedback

c3738949a80306cc48a8ea8ba0560f9d-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-18-2026, 00:02:11 GMT

data mining, large language model, machine learning, (22 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
Asia > Philippines (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media > News (1.00)
Law (1.00)
Information Technology > Services (0.93)
(3 more...)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Web (1.00)
(7 more...)

Add feedback

AMBER: Aerial deployable gripping crawler with compliant microspine for canopy manipulation

Wigner, P. A., Romanello, L., Hammad, A., Nguyen, P. H., Lan, T., Armanini, S. F., Kocer, B. B., Kovac, M.

arXiv.org Artificial IntelligenceDec-9-2025

This paper presents an aerially deployable crawler designed for adaptive locomotion and manipulation within tree canopies. The system combines compliant microspine-based tracks, a dual-track rotary gripper, and an elastic tail, enabling secure attachment and stable traversal across branches of varying curvature and inclination. Experiments demonstrate reliable gripping up to 90 degrees of body roll and inclination, while effective climbing on branches inclined up to 67.5 degrees, achieving a maximum speed of 0.55 body lengths per second on horizontal branches. The compliant tracks allow yaw steering of up to 10 degrees, enhancing maneuverability on irregular surfaces. Power measurements show efficient operation with a dimensionless cost of transport over an order of magnitude lower than typical hovering power consumption in aerial robots. Integrated within a drone-tether deployment system, the crawler provides a robust, low-power platform for environmental sampling and in-canopy sensing, bridging the gap between aerial and surface-based ecological robotics.

artificial intelligence, crawler, robot, (16 more...)

arXiv.org Artificial Intelligence

2512.0768

Country: North America > United States (0.46)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

Consent in Crisis: The Rapid Decline of the AI Data Commons

Neural Information Processing SystemsOct-10-2025, 15:49:35 GMT

The web has become the primary communal source of data, or "data commons", for general-purpose

arxiv preprint arxiv, restriction, website, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
Asia > Philippines (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media > News (1.00)
Law (1.00)
Information Technology > Services (0.93)
(3 more...)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Web (1.00)
(7 more...)

Add feedback

Learning to Crawl: Latent Model-Based Reinforcement Learning for Soft Robotic Adaptive Locomotion

Gzenda, Vaughn, Chhabra, Robin

arXiv.org Artificial IntelligenceOct-8-2025

Soft robotic crawlers are mobile robots that utilize soft body deformability and compliance to achieve locomotion through surface contact. Designing control strategies for such systems is challenging due to model inaccuracies, sensor noise, and the need to discover locomotor gaits. In this work, we present a model-based reinforcement learning (MB-RL) framework in which latent dynamics inferred from onboard sensors serve as a predictive model that guides an actor-critic algorithm to optimize locomotor policies. We evaluate the framework on a minimal crawler model in simulation using inertial measurement units and time-of-flight sensors as observations. The learned latent dynamics enable short-horizon motion prediction while the actor-critic discovers effective locomotor policies. This approach highlights the potential of latent-dynamics MB-RL for enabling embodied soft robotic adaptive locomotion based solely on noisy sensor feedback.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

arXiv.org Artificial Intelligence

2510.05957

Country: North America > Canada (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.71)
Information Technology > Artificial Intelligence > Robots > Locomotion (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

A Whole New World: Creating a Parallel-Poisoned Web Only AI-Agents Can See

Zychlinski, Shaked

arXiv.org Artificial IntelligenceSep-3-2025

This paper introduces a novel attack vector that leverages website cloaking techniques to compromise autonomous web-browsing agents powered by Large Language Models (LLMs). As these agents become more prevalent, their unique and often homogenous digital fingerprints - comprising browser attributes, automation framework signatures, and network characteristics - create a new, distinguishable class of web traffic. The attack exploits this fingerprintability. A malicious website can identify an incoming request as originating from an AI agent and dynamically serve a different, "cloaked" version of its content. While human users see a benign webpage, the agent is presented with a visually identical page embedded with hidden, malicious instructions, such as indirect prompt injections. This mechanism allows adversaries to hijack agent behavior, leading to data exfiltration, malware execution, or misinformation propagation, all while remaining completely invisible to human users and conventional security crawlers. This work formalizes the threat model, details the mechanics of agent fingerprinting and cloaking, and discusses the profound security implications for the future of agentic AI, highlighting the urgent need for robust defenses against this stealthy and scalable attack.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2509.00124

Genre: Research Report (0.65)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Evaluating the Use of LLMs for Documentation to Code Traceability

Alor, Ebube, Khatoonabadi, SayedHassan, Shihab, Emad

arXiv.org Artificial IntelligenceAug-8-2025

Large Language Models (LLMs) offer new potential for automating documentation-to-code traceability, yet their capabilities remain underexplored. We present a comprehensive evaluation of LLMs (Claude 3.5 Sonnet, GPT-4o, and o3-mini) in establishing trace links between various software documentation (including API references and user guides) and source code. We create two novel datasets from two open-source projects (Unity Catalog and Crawl4AI). Through systematic experiments, we assess three key capabilities: (1) trace link identification accuracy, (2) relationship explanation quality, and (3) multi-step chain reconstruction. Results show that the best-performing LLM achieves F1-scores of 79.4% and 80.4% across the two datasets, substantially outperforming our baselines (TF-IDF, BM25, and CodeBERT). While fully correct relationship explanations range from 42.9% to 71.1%, partial accuracy exceeds 97%, indicating that fundamental connections are rarely missed. For multi-step chains, LLMs maintain high endpoint accuracy but vary in capturing precise intermediate links. Error analysis reveals that many false positives stem from naming-based assumptions, phantom links, or overgeneralization of architectural patterns. We demonstrate that task-framing, such as a one-to-many matching strategy, is critical for performance. These findings position LLMs as powerful assistants for trace discovery, but their limitations could necessitate human-in-the-loop tool design and highlight specific error patterns for future research.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.1644

Genre:

Workflow (1.00)
Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Cloudflare will now, by default, block AI bots from crawling its clients' websites

MIT Technology ReviewJul-1-2025, 10:00:57 GMT

However, such systems don't provide the same opportunities for monetization and credit as search engines historically have. AI models draw from a great deal of data on the web to generate their outputs, but these data sources are often not credited, limiting the creators' ability to make money from their work. Search engines that feature AI-generated answers may include links to original sources, but they may also reduce people's interest in clicking through to other sites and could even usher in a "zero-click" future. "Traditionally, the unspoken agreement was that a search engine could index your content, then they would show the relevant links to a particular query and send you traffic back to your website," Will Allen, Cloudflare's head of AI privacy, control, and media products, wrote in an email to MIT Technology Review. Generally, creators and publishers want to decide how their content is used, how it's associated with them, and how they are paid for it.

cloudflare, crawler, website, (4 more...)

MIT Technology Review

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Information Management > Search (0.84)
Information Technology > Artificial Intelligence > Natural Language (0.82)

Add feedback

Millions Use It Every Day. It's One of the Internet's Most Important Websites. Bots Are Destroying It, Piece by Piece.

SlateJun-23-2025, 16:10:38 GMT

Sign up for the Slatest to get the most insightful analysis, criticism, and advice out there, delivered to your inbox daily. In the years since ChatGPT's debut transformed Silicon Valley into an artificial intelligence hype factory, the internet's most vibrant communities have puzzled over how to adapt to the ensuing deluge of A.I. slop, especially as autogenerated outputs become more sophisticated. Perhaps no platform exemplifies this conundrum better than Reddit, the anonymized message-board network that's been connecting millions of humans across the world for 20 years now--as many users there increasingly wonder whether they are, indeed, still connecting with other humans. Such concerns aren't new, but they've been heightened thanks to a shocking exercise of A.I.-powered manipulation. In late April, the moderation team for the popular subreddit r/ChangeMyView disclosed that researchers from the University of Zurich had conducted an "unauthorized experiment" on community members that "deployed AI-generated comments to study how AI could be used to change views."

large language model, machine learning, natural language, (21 more...)

Slate

Country:

Europe > Switzerland > Zürich > Zürich (0.27)
North America > United States > California (0.24)

Industry:

Media > News (1.00)
Information Technology > Security & Privacy (0.70)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.52)

Add feedback

Tree-based Focused Web Crawling with Reinforcement Learning

Kontogiannis, Andreas, Kelesis, Dimitrios, Pollatos, Vasilis, Giannakopoulos, George, Paliouras, Georgios

arXiv.org Artificial IntelligenceMay-20-2025

A focused crawler aims at discovering as many web pages and web sites relevant to a target topic as possible, while avoiding irrelevant ones. Reinforcement Learning (RL) has been a promising direction for optimizing focused crawling, because RL can naturally optimize the long-term profit of discovering relevant web locations within the context of a reward. In this paper, we propose TRES, a novel RL-empowered framework for focused crawling that aims at maximizing both the number of relevant web pages (aka \textit{harvest rate}) and the number of relevant web sites (\textit{domains}). We model the focused crawling problem as a novel Markov Decision Process (MDP), which the RL agent aims to solve by determining an optimal crawling strategy. To overcome the computational infeasibility of exhaustively searching for the best action at each time step, we propose Tree-Frontier, a provably efficient tree-based sampling algorithm that adaptively discretizes the large state and action spaces and evaluates only a few representative actions. Experimentally, utilizing online real-world data, we show that TRES significantly outperforms and Pareto-dominates state-of-the-art methods in terms of harvest rate and the number of retrieved relevant domains, while it provably reduces by orders of magnitude the number of URLs needed to be evaluated at each crawling step.

data mining, machine learning, reinforcement learning, (22 more...)

arXiv.org Artificial Intelligence

2112.0762

Country:

North America > United States (0.46)
Asia (0.46)

Genre: Research Report (0.84)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Web (1.00)
(3 more...)

Add feedback