Butler County
InvertiTune: High-Quality Data Synthesis for Cost-Effective Single-Shot Text-to-Knowledge Graph Generation
Faez, Faezeh, Tahaei, Marzieh S., Hu, Yaochen, Pourranjbar, Ali, Biparva, Mahdi, Coates, Mark, Zhang, Yingxue
Large Language Models (LLMs) have revolutionized the ability to understand and generate text, enabling significant progress in automatic knowledge graph construction from text (Text2KG). Many Text2KG methods, however, rely on iterative LLM prompting, making them computationally expensive and prone to overlooking complex relations distributed throughout the text. To address these limitations, we propose InvertiTune, a framework that combines a controlled data generation pipeline with supervised fine-tuning (SFT). Within this framework, the data-generation pipeline systematically extracts subgraphs from large knowledge bases, applies noise filtering, and leverages LLMs to generate corresponding natural text descriptions, a task more aligned with LLM capabilities than direct KG generation from text. This pipeline enables generating datasets composed of longer texts paired with larger KGs that better reflect real-world scenarios compared to existing benchmarks, thus supporting effective SFT of lightweight models for single-shot KG construction. Experimental results on CE12k, a dataset generated using the introduced pipeline, show that InvertiTune outperforms larger non-fine-tuned LLMs as well as state-of-the-art Text2KG approaches, while also demonstrating stronger cross-dataset generalization on CrossEval-1200, a test set created from three established benchmark datasets and CE12k. These findings highlight the importance of realistic, high-quality training data for advancing efficient and high-performing Text2KG systems.
- Africa > Angola (0.06)
- Europe > Hungary (0.05)
- Antarctica (0.05)
- (15 more...)
Private Continual Counting of Unbounded Streams
We study the problem of differentially private continual counting in the unbounded setting where the input size $n$ is not known in advance. Current state-of-the-art algorithms based on optimal instantiations of the matrix mechanism cannot be directly applied here because their privacy guarantees only hold when key parameters are tuned to $n$. Using the common `doubling trick' avoids knowledge of $n$ but leads to suboptimal and non-smooth error. We solve this problem by introducing novel matrix factorizations based on logarithmic perturbations of the function $\frac{1}{\sqrt{1-z}}$ studied in prior works, which may be of independent interest. The resulting algorithm has smooth error, and for any $α> 0$ and $t\leq n$ it is able to privately estimate the sum of the first $t$ data points with $O(\log^{2+2α}(t))$ variance. It requires $O(t)$ space and amortized $O(\log t)$ time per round, compared to $O(\log(n)\log(t))$ variance, $O(n)$ space and $O(n \log n)$ pre-processing time for the nearly-optimal bounded-input algorithm of Henzinger et al. (SODA 2023). Empirically, we find that our algorithm's performance is also comparable to theirs in absolute terms: our variance is less than $1.5\times$ theirs for $t$ as large as $2^{24}$.
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > Kansas > Butler County (0.04)
- North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
ResAlignNet: A Data-Driven Approach for INS/DVL Alignment
Abstract--Autonomous underwater vehicles rely on precise navigation systems that combine the inertial navigation system and the Doppler velocity log for successful missions in challenging environments where satellite navigation is unavailable. The effectiveness of this integration critically depends on accurate alignment between the sensor reference frames. Standard model-based alignment methods between these sensor systems suffer from lengthy convergence times, dependence on prescribed motion patterns, and reliance on external aiding sensors, significantly limiting operational flexibility. T o address these limitations, this paper presents ResAlignNet, a data-driven approach using the 1D ResNet-18 architecture that transforms the alignment problem into deep neural network optimization, operating as an in-situ solution that requires only sensors on board without external positioning aids or complex vehicle maneuvers, while achieving rapid convergence in seconds. Additionally, the approach demonstrates the learning capabilities of Sim2Real transfer, enabling training in synthetic data while deploying in operational sensor measurements. Experimental validation using the Snapir autonomous underwater vehicle demonstrates that ResAlignNet achieves alignment accuracy within 0.8 using only 25 seconds of data collection, representing a 65% reduction in convergence time compared to standard velocity-based methods. The trajectory-independent solution eliminates motion pattern requirements and enables immediate vehicle deployment without lengthy pre-mission procedures, advancing underwater navigation capabilities through robust sensor-agnostic alignment that scales across different operational scenarios and sensor specifications. Underwater navigation systems are critical for a wide range of marine applications, particularly autonomous underwater vehicles (AUVs) operating in challenging environments where global navigation satellite systems (GNSSs) are unavailable [1].
- Asia > Middle East > Israel > Haifa District > Haifa (0.77)
- Atlantic Ocean > Mediterranean Sea (0.04)
- North America > United States > Massachusetts > Norfolk County > Norwood (0.04)
- (4 more...)
- Shipbuilding (0.40)
- Government > Military > Navy (0.40)
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > Middle East > Oman (0.04)
- North America > United States > Texas (0.04)
- (6 more...)
- Transportation > Ground > Road (1.00)
- Transportation > Passenger (0.67)
Citation Amnesia: On The Recency Bias of NLP and Other Academic Fields
Wahle, Jan Philip, Ruas, Terry, Abdalla, Mohamed, Gipp, Bela, Mohammad, Saif M.
This study examines the tendency to cite older work across 20 fields of study over 43 years (1980--2023). We put NLP's propensity to cite older work in the context of these 20 other fields to analyze whether NLP shows similar temporal citation patterns to these other fields over time or whether differences can be observed. Our analysis, based on a dataset of approximately 240 million papers, reveals a broader scientific trend: many fields have markedly declined in citing older works (e.g., psychology, computer science). We term this decline a 'citation age recession', analogous to how economists define periods of reduced economic activity. The trend is strongest in NLP and ML research (-12.8% and -5.5% in citation age from previous peaks). Our results suggest that citing more recent works is not directly driven by the growth in publication rates (-3.4% across fields; -5.2% in humanities; -5.5% in formal sciences) -- even when controlling for an increase in the volume of papers. Our findings raise questions about the scientific community's engagement with past literature, particularly for NLP, and the potential consequences of neglecting older but relevant research. The data and a demo showcasing our results are publicly available.
- North America > Canada > Alberta (0.46)
- Europe > Germany > Lower Saxony > Gottingen (0.14)
- Asia > Singapore (0.04)
- (13 more...)
Visual Data Diagnosis and Debiasing with Concept Graphs
Chakraborty, Rwiddhi, Wang, Yinong, Gao, Jialu, Zheng, Runkai, Zhang, Cheng, De la Torre, Fernando
The widespread success of deep learning models today is owed to the curation of extensive datasets significant in size and complexity. However, such models frequently pick up inherent biases in the data during the training process, leading to unreliable predictions. Diagnosing and debiasing datasets is thus a necessity to ensure reliable model performance. In this paper, we present ConBias, a novel framework for diagnosing and mitigating Concept co-occurrence Biases in visual datasets. ConBias represents visual datasets as knowledge graphs of concepts, enabling meticulous analysis of spurious concept co-occurrences to uncover concept imbalances across the whole dataset. Moreover, we show that by employing a novel clique-based concept balancing strategy, we can mitigate these imbalances, leading to enhanced performance on downstream tasks. Extensive experiments show that data augmentation based on a balanced concept distribution augmented by Conbias improves generalization performance across multiple datasets compared to state-of-the-art methods.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Texas (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- (5 more...)
- Transportation > Ground > Road (1.00)
- Transportation > Passenger (0.68)
Open-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models
Islam, Shayekh Bin, Rahman, Md Asib, Hossain, K S M Tozammel, Hoque, Enamul, Joty, Shafiq, Parvez, Md Rizwan
Retrieval-Augmented Generation (RAG) has been shown to enhance the factual accuracy of Large Language Models (LLMs), but existing methods often suffer from limited reasoning capabilities in effectively using the retrieved evidence, particularly when using open-source LLMs. To mitigate this gap, we introduce a novel framework, Open-RAG, designed to enhance reasoning capabilities in RAG with open-source LLMs. Our framework transforms an arbitrary dense LLM into a parameter-efficient sparse mixture of experts (MoE) model capable of handling complex reasoning tasks, including both single- and multi-hop queries. Open-RAG uniquely trains the model to navigate challenging distractors that appear relevant but are misleading. As a result, Open-RAG leverages latent learning, dynamically selecting relevant experts and integrating external knowledge effectively for more accurate and contextually relevant responses. In addition, we propose a hybrid adaptive retrieval method to determine retrieval necessity and balance the trade-off between performance gain and inference speed. Experimental results show that the Llama2-7B-based Open-RAG outperforms state-of-the-art LLMs and RAG models such as ChatGPT, Self-RAG, and Command R+ in various knowledge-intensive tasks. We open-source our code and models at https://openragmoe.github.io/
- North America > United States > Texas (0.14)
- North America > United States > California (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (9 more...)
Leveraging Machine Learning for Official Statistics: A Statistical Manifesto
Puts, Marco, Salgado, David, Daas, Piet
It is important for official statistics production to apply ML with statistical rigor, as it presents both opportunities and challenges. Although machine learning has enjoyed rapid technological advances in recent years, its application does not possess the methodological robustness necessary to produce high quality statistical results. In order to account for all sources of error in machine learning models, the Total Machine Learning Error (TMLE) is presented as a framework analogous to the Total Survey Error Model used in survey methodology. As a means of ensuring that ML models are both internally valid as well as externally valid, the TMLE model addresses issues such as representativeness and measurement errors. There are several case studies presented, illustrating the importance of applying more rigor to the application of machine learning in official statistics.
- North America > United States > New York (0.04)
- North America > United States > Kansas > Butler County (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (2 more...)
Multi-Objective Global Path Planning for Lunar Exploration With a Quadruped Robot
Richter, Julia, Kolvenbach, Hendrik, Valsecchi, Giorgio, Hutter, Marco
In unstructured environments the best path is not always the shortest, but needs to consider various objectives like energy efficiency, risk of failure or scientific outcome. This paper proposes a global planner, based on the A* algorithm, capable of individually considering multiple layers of map data for different cost objectives. We introduce weights between the objectives, which can be adapted to achieve a variety of optimal paths. In order to find the best of these paths, a tool for statistical path analysis is presented. Our planner was tested on exemplary lunar topographies to propose two trajectories for exploring the Aristarchus Plateau. The optimized paths significantly reduce the risk of failure while yielding more scientific value compared to a manually planned paths in the same area. The planner and analysis tool are made open-source in order to simplify mission planning for planetary scientists.
- Europe > Switzerland > Zürich > Zürich (0.04)
- North America > United States > Kansas > Butler County (0.04)
- North America > United States > Arizona (0.04)
- Europe > Netherlands > South Holland > Noordwijk (0.04)