specific question
A Hierarchical Framework for Measuring Scientific Paper Innovation via Large Language Models
Tan, Hongming, Zhan, Shaoxiong, Jia, Fengwei, Zheng, Hai-Tao, Chan, Wai Kin
Measuring scientific paper innovation is both important and challenging. Existing content-based methods often overlook the full-paper context, fail to capture the full scope of innovation, and lack generalization. We propose HSPIM, a hierarchical and training-free framework based on large language models (LLMs). It introduces a Paper-to-Sections-to-QAs decomposition to assess innovation. We segment the text by section titles and use zero-shot LLM prompting to implement section classification, question-answering (QA) augmentation, and weighted innovation scoring. The generated QA pair focuses on section-level innovation and serves as additional context to improve the LLM scoring. For each chunk, the LLM outputs a novelty score and a confidence score. We use confidence scores as weights to aggregate novelty scores into a paper-level innovation score. To further improve performance, we propose a two-layer question structure consisting of common and section-specific questions, and apply a genetic algorithm to optimize the question-prompt combinations. Furthermore, under the fine-grained structure of innovation, we extend HSPIM to an HSPIM$^+$ that generates novelty, contribution, and feasibility scores with respective confidence scores. Comprehensive experiments on scientific conference paper datasets show that HSPIM outperforms baseline methods in effectiveness, generalization, and interpretability. Demo code is available at https://github.com/Jasaxion/HSPIM.
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
major questions raised by the reviewers. 1 Learning rates. To address the reviewers ' comments on learning rates, we will add results with easy-to-implement
We thank the reviewers for very helpful comments. To address the reviewers' comments on learning rates, we will add results with More specifically, this requires two changes: (1) the epoch length needs to keep increasing (i.e. at the end of every Proof of Theorem 5. We sketch the proof for the piecewise choice (1), which follows easily from our Theorem 1. We will clarify this in the revision. Given that |S||A| is often enormous in practice, our theory potentially leads to a notable improvement. ": See the response above on "learning rates". Q-update and (2) choosing δ to be sufficiently small. We will add this in the revision. We will clarify this in the revision to avoid confusion. ": See the response above on "learning rates".
3261769be720b0fefbfffec05e9d9202-AuthorFeedback.pdf
We thank all reviewers for very helpful comments. This letter addresses the major questions raised by the reviewers. Please see the response below for "distribution assumptions" and "global null and We will correct our references and typos in the table. We shall elaborate more in our revised version to make these more clear. To conquer this issue, we provided high-probability guarantees in Section 2 and 3. Please also see Please see below for "global null and group of coefficients" and "more discussions We will elaborate more in our revised version when the eigen-spetra are not as nicely behaved.
DRBench: A Realistic Benchmark for Enterprise Deep Research
Abaskohi, Amirhossein, Chen, Tianyi, Muñoz-Mármol, Miguel, Fox, Curtis, Ramesh, Amrutha Varshini, Marcotte, Étienne, Lù, Xing Han, Chapados, Nicolas, Gella, Spandana, Pal, Christopher, Drouin, Alexandre, Laradji, Issam H.
We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike prior benchmarks that focus on simple questions or web-only queries, DRBench evaluates agents on multi-step queries (for example, ``What changes should we make to our product roadmap to ensure compliance with this standard?") that require identifying supporting facts from both the public web and private company knowledge base. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web. Tasks are generated through a carefully designed synthesis pipeline with human-in-the-loop verification, and agents are evaluated on their ability to recall relevant insights, maintain factual accuracy, and produce coherent, well-structured reports. We release 15 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance. We demonstrate the effectiveness of DRBench by evaluating diverse DR agents across open- and closed-source models (such as GPT, Llama, and Qwen) and DR strategies, highlighting their strengths, weaknesses, and the critical path for advancing enterprise deep research. Code is available at https://github.com/ServiceNow/drbench.
- North America > United States (0.14)
- Oceania > Australia > Victoria > Bass Strait (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > Canada > British Columbia (0.04)
- Research Report > New Finding (1.00)
- Workflow (0.93)
- Retail (1.00)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- (4 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)
We will add a series of nu-2 merical experiments to demonstrate the minimax optimality of the model-3
We thank all reviewers for very helpful comments. This letter addresses several major questions raised by the reviewers. Indeed, reward perturbation is introduced merely to facilitate analysis. Take Section 4.3 of the Arxiv version We will elucidate the motivation and intuition of reward perturbation earlier on in the revised paper. We understand from the reviewer's comment that there might be confusion in our This will be made clear in the final paper.
some specific questions, but will incorporate all feedback in the final version
We thank the reviewers for their careful reading and insightful comments. We will add this in the final version. Transformer-based) models to further shrink the search space. Number of nodes in the graphs seems to be quite low ( 200 for GNMT). Is there some manual grouping operation performed on the computational graph?
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.37)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.32)
'Do not pet': Why are robot dogs patrolling Mar-A-Lago?
Video of Spot strutting around the property has gone viral on TikTok - where reactions range from calling them cool and cute, to creepy - and become fodder for jokes on American late night television. But its mission is no laughing matter. "Safeguarding the president-elect is a top priority," said Anthony Guglielmi, US Secret Service chief of communications, in a statement to the BBC. In the months leading up to the US presidential election, Trump was the target of two apparent assassination attempts. The first took place at a July rally in Butler, Pennsylvania and the other occurred at the Mar-a-Lago golf course in September.
- North America > United States > Florida > Palm Beach County > Palm Beach (0.66)
- North America > United States > Pennsylvania (0.30)
- Information Technology > Communications > Social Media (0.92)
- Information Technology > Artificial Intelligence > Robots (0.62)
Are VLMs Really Blind
Singh, Ayush, Gupta, Mansi, Garg, Shivank
Vision Language Models excel in handling a wide range of complex tasks, including Optical Character Recognition (OCR), Visual Question Answering (VQA), and advanced geometric reasoning. However, these models fail to perform well on low-level basic visual tasks which are especially easy for humans. Our goal in this work was to determine if these models are truly "blind" to geometric reasoning or if there are ways to enhance their capabilities in this area. Our work presents a novel automatic pipeline designed to extract key information from images in response to specific questions. Instead of just relying on direct VQA, we use question-derived keywords to create a caption that highlights important details in the image related to the question. This caption is then used by a language model to provide a precise answer to the question without requiring external fine-tuning.
Question-Answering Based Summarization of Electronic Health Records using Retrieval Augmented Generation
Saba, Walid, Wendelken, Suzanne, Shanahan, James.
Summarization of electronic health records (EHRs) can substantially minimize 'screen time' for both patients as well as medical personnel. In recent years summarization of EHRs have employed machine learning pipelines using state of the art neural models. However, these models have produced less than adequate results that are attributed to the difficulty of obtaining sufficient annotated data for training. Moreover, the requirement to consider the entire content of an EHR in summarization has resulted in poor performance due to the fact that attention mechanisms in modern large language models (LLMs) adds a quadratic complexity in terms of the size of the input. We propose here a method that mitigates these shortcomings by combining semantic search, retrieval augmented generation (RAG) and question-answering using the latest LLMs. In our approach summarization is the extraction of answers to specific questions that are deemed important by subject-matter experts (SMEs). Our approach is quite efficient; requires minimal to no training; does not suffer from the 'hallucination' problem of LLMs; and it ensures diversity, since the summary will not have repeated content but diverse answers to specific questions.
- North America > United States > Maine > Cumberland County > Portland (0.06)
- Asia > Middle East > Israel (0.05)