css selector
WebLists: Extracting Structured Information From Complex Interactive Websites Using Executable LLM Agents
Bohra, Arth, Saroyan, Manvel, Melkozerov, Danil, Karufanyan, Vahe, Maher, Gabriel, Weinberger, Pascal, Harutyunyan, Artem, Campagna, Giovanni
Most recent web agent research has focused on navigation and transaction tasks, with little emphasis on extracting structured data at scale. We present WebLists, a benchmark of 200 data-extraction tasks across four common business and enterprise use-cases. Each task requires an agent to navigate to a webpage, configure it appropriately, and extract complete datasets with well-defined schemas. We show that both LLMs with search capabilities and SOTA web agents struggle with these tasks, with a recall of 3% and 31%, respectively, despite higher performance on question-answering tasks. To address this challenge, we propose BardeenAgent, a novel framework that enables web agents to convert their execution into repeatable programs, and replay them at scale across pages with similar structure. BardeenAgent is also the first LLM agent to take advantage of the regular structure of HTML. In particular BardeenAgent constructs a generalizable CSS selector to capture all relevant items on the page, then fits the operations to extract the data. On the WebLists benchmark, BardeenAgent achieves 66% recall overall, more than doubling the performance of SOTA web agents, and reducing cost per output row by 3x.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > California > Alameda County > Berkeley (0.14)
- Europe > Switzerland (0.04)
- Europe > Belgium (0.04)
Optimal Scraping Technique: CSS Selector, XPath, & RegEx - DataScienceCentral.com
In nearly all cases, what is required is a small sample from a very large file (e.g. Therefore, an essential part of scraping is searching through an HTML document and finding the correct information. How that should be done is the matter of some debate, preferences, experience, and types of data. While all scraping and parsing methods are "correct", some of them have benefits that may be vital when more optimization is required. Some methods may be easier for specific types of data.
- Information Technology > Artificial Intelligence (0.37)
- Information Technology > Communications > Web (0.36)
- Information Technology > Software (0.32)
Multi-modal Program Inference: a Marriage of Pre-trainedLanguage Models and Component-based Synthesis
Rahmani, Kia, Raza, Mohammad, Gulwani, Sumit, Le, Vu, Morris, Daniel, Radhakrishna, Arjun, Soares, Gustavo, Tiwari, Ashish
Multi-modal program synthesis refers to the task of synthesizing programs (code) from their specification given in different forms, such as a combination of natural language and examples. Examples provide a precise but incomplete specification, and natural language provides an ambiguous but more "complete" task description. Machine-learned pre-trained models (PTMs) are adept at handling ambiguous natural language, but struggle with generating syntactically and semantically precise code. Program synthesis techniques can generate correct code, often even from incomplete but precise specifications, such as examples, but they are unable to work with the ambiguity of natural languages. We present an approach that combines PTMs with component-based synthesis (CBS): PTMs are used to generate candidates programs from the natural language description of the task, which are then used to guide the CBS procedure to find the program that matches the precise examples-based specification. We use our combination approach to instantiate multi-modal synthesis systems for two programming domains: the domain of regular expressions and the domain of CSS selectors. Our evaluation demonstrates the effectiveness of our domain-agnostic approach in comparison to a state-of-the-art specialized system, and the generality of our approach in providing multi-modal program synthesis from natural language and examples in different programming domains.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Taiwan > Taiwan Province > Taipei (0.04)
- (7 more...)
GoodReads: Webscraping and Text Analysis with R (Part 1)
Inspired by this article about sentiment analysis and this guide to webscraping, I have decided to get my hands dirty by scraping and analyzing a sample of reviews on the website Goodreads. The goal of this project is to demonstrate a complete example, going from data collection to machine learning analysis, and to illustrate a few of the dead ends and mistakes I encountered on my journey. We'll be looking at the reviews for five popular romance books. I have voluntarily chosen books in the same genre in order to make comments text more homogeneous a priori; these five books are popular enough that I can easily pull a few thousands reviews for each, yielding a significant corpus with minimum effort. If you don't like romance books, feel free to replicate the analysis with your genre of choice!