AITopics

doi: 10.1109/BIBM62325.2024.10822860

2501.1016

Country:

Europe > Spain > Castile and León > Salamanca Province > Salamanca (0.05)
Europe > United Kingdom > Scotland (0.04)
North America > United States (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine > Health Care Providers & Services (1.00)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.46)

Adhikari, Andrick, Das, Sanchari, Dewri, Rinku

Natural Language Processing of Privacy Policies: A Survey

arXiv.org Artificial IntelligenceJan-17-2025

Natural Language Processing (NLP) is an essential subset of artificial intelligence. It has become effective in several domains, such as healthcare, finance, and media, to identify perceptions, opinions, and misuse, among others. Privacy is no exception, and initiatives have been taken to address the challenges of usable privacy notifications to users with the help of NLP. To this aid, we conduct a literature review by analyzing 109 papers at the intersection of NLP and privacy policies. First, we provide a brief introduction to privacy policies and discuss various facets of associated problems, which necessitate the application of NLP to elevate the current state of privacy notices and disclosures to users. Subsequently, we a) provide an overview of the implementation and effectiveness of NLP approaches for better privacy policy communication; b) identify the methodologies that can be further enhanced to provide robust privacy policies; and c) identify the gaps in the current state-of-the-art research. Our systematic analysis reveals that several research papers focus on annotating and classifying privacy texts for analysis but need to adequately dwell on other aspects of NLP applications, such as summarization. More specifically, ample research opportunities exist in this domain, covering aspects such as corpus generation, summarization vectors, contextualized word embedding, identification of privacy-relevant statement categories, fine-grained classification, and domain-specific model tuning.

information retrieval, large language model, machine learning, (20 more...)

2501.10319

Country:

North America > United States > Colorado > Denver County > Denver (0.04)
Oceania > Palau (0.04)
North America > United States > Virginia > Fairfax County > Fairfax (0.04)
(6 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
(5 more...)

Samir, Asif Mohammed, Rahman, Mohammad Masudur

Improved IR-based Bug Localization with Intelligent Relevance Feedback

arXiv.org Artificial IntelligenceJan-17-2025

Software bugs pose a significant challenge during development and maintenance, and practitioners spend nearly 50% of their time dealing with bugs. Many existing techniques adopt Information Retrieval (IR) to localize a reported bug using textual and semantic relevance between bug reports and source code. However, they often struggle to bridge a critical gap between bug reports and code that requires in-depth contextual understanding, which goes beyond textual or semantic relevance. In this paper, we present a novel technique for bug localization - BRaIn - that addresses the contextual gaps by assessing the relevance between bug reports and code with Large Language Models (LLM). It then leverages the LLM's feedback (a.k.a., Intelligent Relevance Feedback) to reformulate queries and re-rank source documents, improving bug localization. We evaluate BRaIn using a benchmark dataset, Bench4BL, and three performance metrics and compare it against six baseline techniques from the literature. Our experimental results show that BRaIn outperforms baselines by 87.6%, 89.5%, and 48.8% margins in MAP, MRR, and HIT@K, respectively. Additionally, it can localize approximately 52% of bugs that cannot be localized by the baseline techniques due to the poor quality of corresponding bug reports. By addressing the contextual gaps and introducing Intelligent Relevance Feedback, BRaIn advances not only theory but also improves IR-based bug localization.

large language model, machine learning, natural language, (17 more...)

2501.10542

Country:

North America > Canada > Nova Scotia > Halifax Regional Municipality > Halifax (0.14)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.47)

Severo, Daniel, Ottaviano, Giuseppe, Muckley, Matthew, Ullrich, Karen, Douze, Matthijs

Lossless Compression of Vector IDs for Approximate Nearest Neighbor Search

arXiv.org Artificial IntelligenceJan-16-2025

Approximate nearest neighbor search for vectors relies on indexes that are most often accessed from RAM. Therefore, storage is the factor limiting the size of the database that can be served from a machine. Lossy vector compression, i.e., embedding quantization, has been applied extensively to reduce the size of indexes. However, for inverted file and graph-based indices, auxiliary data such as vector ids and links (edges) can represent most of the storage cost. We introduce and evaluate lossless compression schemes for these cases. These approaches are based on asymmetric numeral systems or wavelet trees that exploit the fact that the ordering of ids is irrelevant within the data structures. In some settings, we are able to compress the vector ids by a factor 7, with no impact on accuracy or search runtime. On billion-scale datasets, this results in a reduction of 30% of the index size. Furthermore, we show that for some datasets, these methods can also compress the quantized vector codes losslessly, by exploiting sub-optimalities in the original quantization algorithm. The source code for our approach available at https://github.com/facebookresearch/vector_db_id_compression.

artificial intelligence, information retrieval, natural language, (17 more...)

2501.10479

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Massachusetts (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (0.62)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.61)

arXiv.org Artificial IntelligenceJan-16-2025

KU AIGEN ICL EDI@BC8 Track 3: Advancing Phenotype Named Entity Recognition and Normalization for Dysmorphology Physical Examination Reports

Kim, Hajung, Kim, Chanhwi, Sohn, Jiwoong, Beck, Tim, Rei, Marek, Kim, Sunkyu, Simpson, T Ian, Posma, Joram M, Lain, Antoine, Sung, Mujeen, Kang, Jaewoo

The objective of BioCreative8 Track 3 is to extract phenotypic key medical findings embedded within EHR texts and subsequently normalize these findings to their Human Phenotype Ontology (HPO) terms. However, the presence of diverse surface forms in phenotypic findings makes it challenging to accurately normalize them to the correct HPO terms. To address this challenge, we explored various models for named entity recognition and implemented data augmentation techniques such as synonym marginalization to enhance the normalization step. Our pipeline resulted in an exact extraction and normalization F1 score 2.6\% higher than the mean score of all submissions received in response to the challenge. Furthermore, in terms of the normalization F1 score, our approach surpassed the average performance by 1.9\%. These findings contribute to the advancement of automated medical data extraction and normalization techniques, showcasing potential pathways for future research and application in the biomedical domain.

entity recognition and normalization, sapbert, synonym marginalization, (7 more...)

doi: 10.5281/zenodo.10104803

2501.09744

Country: Europe > United Kingdom > England > Nottinghamshire > Nottingham (0.04)

Genre: Research Report (0.50)

Industry: Health & Medicine > Diagnostic Medicine (0.41)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.99)

Burger, Christopher, Walter, Charles

Improving Stability Estimates in Adversarial Explainable AI through Alternate Search Methods

Advances in the effectiveness of machine learning models have come at the cost of enormous complexity resulting in a poor understanding of how they function. Local surrogate methods have been used to approximate the workings of these complex models, but recent work has revealed their vulnerability to adversarial attacks where the explanation produced is appreciably different while the meaning and structure of the complex model's output remains similar. This prior work has focused on the existence of these weaknesses but not on their magnitude. Here we explore using an alternate search method with the goal of finding minimum viable perturbations, the fewest perturbations necessary to achieve a fixed similarity value between the original and altered text's explanation. Intuitively, a method that requires fewer perturbations to expose a given level of instability is inferior to one which requires more. This nuance allows for superior comparisons of the stability of explainability methods.

adversarial explainable ai, search method, stability estimate

2501.09006

Genre: Research Report (0.69)

Technology:

Information Technology > Information Management > Search (0.79)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.60)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.60)
(2 more...)

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

Dong, Kuicai, Chang, Yujing, Goh, Xin Deik, Li, Dexun, Tang, Ruiming, Liu, Yong

Multi-modal document retrieval is designed to identify and retrieve various forms of multi-modal content, such as figures, tables, charts, and layout information from extensive documents. Despite its significance, there is a notable lack of a robust benchmark to effectively evaluate the performance of systems in multi-modal document retrieval. To address this gap, this work introduces a new benchmark, named as MMDocIR, encompassing two distinct tasks: page-level and layout-level retrieval. The former focuses on localizing the most relevant pages within a long document, while the latter targets the detection of specific layouts, offering a more fine-grained granularity than whole-page analysis. A layout can refer to a variety of elements such as textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring expertly annotated labels for 1,685 questions and bootstrapped labels for 173,843 questions, making it a pivotal resource for advancing multi-modal document retrieval for both training and evaluation. Through rigorous experiments, we reveal that (i) visual retrievers significantly outperform their text counterparts, (ii) MMDocIR train set can effectively benefit the training process of multi-modal document retrieval and (iii) text retrievers leveraging on VLM-text perform much better than those using OCR-text. These findings underscores the potential advantages of integrating visual elements for multi-modal document retrieval.

information, layout, retrieval, (13 more...)

2501.08828

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > District of Columbia > Washington (0.04)
(18 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

OpenMLDB: A Real-Time Relational Data Feature Computation System for Online ML

Zhou, Xuanhe, Zhou, Wei, Qi, Liguo, Zhang, Hao, Chen, Dihao, He, Bingsheng, Lu, Mian, Li, Guoliang, Wu, Fan, Chen, Yuqiang

Efficient and consistent feature computation is crucial for a wide range of online ML applications. Typically, feature computation is divided into two distinct phases, i.e., offline stage for model training and online stage for model serving. These phases often rely on execution engines with different interface languages and function implementations, causing significant inconsistencies. Moreover, many online ML features involve complex time-series computations (e.g., functions over varied-length table windows) that differ from standard streaming and analytical queries. Existing data processing systems (e.g., Spark, Flink, DuckDB) often incur multi-second latencies for these computations, making them unsuitable for real-time online ML applications that demand timely feature updates. This paper presents OpenMLDB, a feature computation system deployed in 4Paradigm's SageOne platform and over 100 real scenarios. Technically, OpenMLDB first employs a unified query plan generator for consistent computation results across the offline and online stages, significantly reducing feature deployment overhead. Second, OpenMLDB provides an online execution engine that resolves performance bottlenecks caused by long window computations (via pre-aggregation) and multi-table window unions (via data self-adjusting). It also provides a high-performance offline execution engine with window parallel optimization and time-aware data skew resolving. Third, OpenMLDB features a compact data format and stream-focused indexing to maximize memory usage and accelerate data access. Evaluations in testing and real workloads reveal significant performance improvements and resource savings compared to the baseline systems. The open community of OpenMLDB now has over 150 contributors and gained 1.6k stars on GitHub.

computation, openmldb, optimization, (16 more...)

2501.08591

Country:

Asia > China > Shanghai > Shanghai (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
Oceania > Australia > Australian Capital Territory > Canberra (0.04)
(9 more...)

Genre: Research Report (0.82)

Industry:

Banking & Finance (0.67)
Information Technology (0.67)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.67)

Foundations of Large Language Models

Xiao, Tong, Zhu, Jingbo

The development of neural sequence models, such as Transformers [Vaswani et al., 2017], along with the improvements in large-scale self-supervised learning, has opened the door to universal language understanding and generation. This achievement is largely motivated by pre-training: we separate common components from many neural network-based systems, and then train them on huge amounts of unlabeled data using self-supervision. These pre-trained models serve as foundation models that can be easily adapted to different tasks via fine-tuning or prompting. As a result, the paradigm of NLP has been enormously changed. In many cases, large-scale supervised learning for specific tasks is no longer required, and instead, we only need to adapt pre-trained foundation models.

large language model, machine learning, reinforcement learning, (30 more...)

2501.09223

Country:

Europe (1.00)
North America > United States (0.92)

Genre:

Workflow (1.00)
Research Report > Promising Solution (1.00)
Overview (1.00)
Research Report > New Finding (0.92)

Industry:

Leisure & Entertainment > Sports (1.00)
Law (1.00)
Information Technology > Security & Privacy (1.00)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(10 more...)

Li, Siran, Stenzel, Linus, Eickhoff, Carsten, Bahrainian, Seyed Ali

Enhancing Retrieval-Augmented Generation: A Study of Best Practices

arXiv.org Artificial IntelligenceJan-13-2025

Retrieval-Augmented Generation (RAG) systems have recently shown remarkable advancements by integrating retrieval mechanisms into language models, enhancing their ability to produce more accurate and contextually relevant responses. However, the influence of various components and configurations within RAG systems remains underexplored. A comprehensive understanding of these elements is essential for tailoring RAG systems to complex retrieval tasks and ensuring optimal performance across diverse applications. In this paper, we develop several advanced RAG system designs that incorporate query expansion, various novel retrieval strategies, and a novel Contrastive In-Context Learning RAG. Our study systematically investigates key factors, including language model size, prompt design, document chunk size, knowledge base size, retrieval stride, query expansion techniques, Contrastive In-Context Learning knowledge bases, multilingual knowledge bases, and Focus Mode retrieving relevant context at sentence-level. Through extensive experimentation, we provide a detailed analysis of how these factors influence response quality. Our findings offer actionable insights for developing RAG systems, striking a balance between contextual richness and retrieval-generation efficiency, thereby paving the way for more adaptable and high-performing RAG frameworks in diverse real-world scenarios. Our code and implementation details are publicly available.

computational linguistic, large language model, machine learning, (18 more...)

2501.07391

Country: Europe (0.28)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.92)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.88)
(2 more...)