AITopics | elasticsearch

Collaborating Authors

elasticsearch

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World

Marinas, Ines Altemir, Kucherenko, Anastasiia, Sternfeld, Alexander, Kucharavy, Andrei

arXiv.org Artificial IntelligenceOct-13-2025

The performance of Large Language Models (LLMs) is determined by their training data. Despite the proliferation of open-weight LLMs, access to LLM training data has remained limited. Even for fully open LLMs, the scale of the data makes it all but inscrutable to the general scientific community, despite potentially containing critical data scraped from the internet. In this paper, we present the full-text indexing pipeline for the Apertus LLM training data. Leveraging Elasticsearch parallel indices and the Alps infrastructure, a state-of-the-art, highly energy-efficient arm64 supercluster, we were able to index 8.6T tokens out of 15.2T used to train the Apertus LLM family, creating both a critical LLM safety tool and effectively an offline, curated, open web search engine. Our contribution is threefold. First, we demonstrate that Elasticsearch can be successfully ported onto next-generation arm64-based infrastructure. Second, we demonstrate that full-text indexing at the scale of modern LLM training datasets and the entire open web is feasible and accessible. Finally, we demonstrate that such indices can be used to ensure previously inaccessible jailbreak-agnostic LLM safety. We hope that our findings will be useful to other teams attempting large-scale data indexing and facilitate the general transition towards greener computation.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.09471

Country:

North America > United States (1.00)
Europe (1.00)
Asia > Middle East > UAE (0.16)

Genre: Research Report > New Finding (0.66)

Industry:

Materials > Chemicals (1.00)
Information Technology (0.93)
Health & Medicine (0.68)
(3 more...)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

Marinas, Inés Altemir, Kucherenko, Anastasiia, Kucharavy, Andrei

arXiv.org Artificial IntelligenceSep-1-2025

Large language models (LLMs) rely heavily on web-scale datasets like Common Crawl, which provides over 80\% of training data for some modern models. However, the indiscriminate nature of web crawling raises challenges in data quality, safety, and ethics. Despite the critical importance of training data quality, prior research on harmful content has been limited to small samples due to computational constraints. This project presents a framework for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. We apply it to SwissAI's FineWeb-2 corpus (1.5TB, four languages), achieving fast query performance--most searches in milliseconds, all under 2 seconds. Our work demonstrates real-time dataset analysis, offering practical tools for safer, more accountable AI systems.

data mining, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2508.21788

Country:

Asia (0.28)
Europe > Switzerland (0.14)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.93)
Information Technology (0.93)
Health & Medicine > Therapeutic Area > Immunology (0.71)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
(2 more...)

Add feedback

OmniLLP: Enhancing LLM-based Log Level Prediction with Context-Aware Retrieval

Ouatiti, Youssef Esseddiq, Sayagh, Mohammed, Adams, Bram, Hassan, Ahmed E.

arXiv.org Artificial IntelligenceAug-13-2025

Developers insert logging statements in source code to capture relevant runtime information essential for maintenance and debugging activities. Log level choice is an integral, yet tricky part of the logging activity as it controls log verbosity and therefore influences systems' observability and performance. Recent advances in ML-based log level prediction have leveraged large language models (LLMs) to propose log level predictors (LLPs) that demonstrated promising performance improvements (AUC between 0.64 and 0.8). Nevertheless, current LLM-based LLPs rely on randomly selected in-context examples, overlooking the structure and the diverse logging practices within modern software projects. In this paper, we propose OmniLLP, a novel LLP enhancement framework that clusters source files based on (1) semantic similarity reflecting the code's functional purpose, and (2) developer ownership cohesion. By retrieving in-context learning examples exclusively from these semantic and ownership aware clusters, we aim to provide more coherent prompts to LLPs leveraging LLMs, thereby improving their predictive accuracy. Our results show that both semantic and ownership-aware clusterings statistically significantly improve the accuracy (by up to 8\% AUC) of the evaluated LLM-based LLPs compared to random predictors (i.e., leveraging randomly selected in-context examples from the whole project). Additionally, our approach that combines the semantic and ownership signal for in-context prediction achieves an impressive 0.88 to 0.96 AUC across our evaluated projects. Our findings highlight the value of integrating software engineering-specific context, such as code semantic and developer ownership signals into LLM-LLPs, offering developers a more accurate, contextually-aware approach to logging and therefore, enhancing system maintainability and observability.

large language model, log level prediction, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2508.08545

Country:

North America > Canada (0.68)
North America > United States (0.46)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.66)

Add feedback

Deep Retrieval at CheckThat! 2025: Identifying Scientific Papers from Implicit Social Media Mentions via Hybrid Retrieval and Re-Ranking

Sager, Pascal J., Kamaraj, Ashwini, Grewe, Benjamin F., Stadelmann, Thilo

arXiv.org Artificial IntelligenceJul-8-2025

We present the methodology and results of the Deep Retrieval team for subtask 4b of the CLEF CheckThat! 2025 competition, which focuses on retrieving relevant scientific literature for given social media posts. To address this task, we propose a hybrid retrieval pipeline that combines lexical precision, semantic generalization, and deep contextual re-ranking, enabling robust retrieval that bridges the informal-to-formal language gap. Specifically, we combine BM25-based keyword matching with a FAISS vector store using a fine-tuned INF-Retriever-v1 model for dense semantic retrieval. BM25 returns the top 30 candidates, and semantic search yields 100 candidates, which are then merged and re-ranked via a large language model (LLM)-based cross-encoder. Our approach achieves a mean reciprocal rank at 5 (MRR@5) of 76.46% on the development set and 66.43% on the hidden test set, securing the 1st position on the development leaderboard and ranking 3rd on the test leaderboard (out of 31 teams), with a relative performance gap of only 2 percentage points compared to the top-ranked system. We achieve this strong performance by running open-source models locally and without external training data, highlighting the effectiveness of a carefully designed and fine-tuned retrieval pipeline.

computational linguistic, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2505.2325

Country:

North America > United States (0.68)
Europe > Switzerland > Zürich > Zürich (0.15)

Genre: Research Report (0.50)

Industry:

Health & Medicine (0.97)
Media > News (0.46)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
(2 more...)

Add feedback

Practical Performance of a Distributed Processing Framework for Machine-Learning-based NIDS

Kajiura, Maho, Nakamura, Junya

arXiv.org Artificial IntelligenceMay-20-2024

Network Intrusion Detection Systems (NIDSs) detect intrusion attacks in network traffic. In particular, machine-learning-based NIDSs have attracted attention because of their high detection rates of unknown attacks. A distributed processing framework for machine-learning-based NIDSs employing a scalable distributed stream processing system has been proposed in the literature. However, its performance, when machine-learning-based classifiers are implemented has not been comprehensively evaluated. In this study, we implement five representative classifiers (Decision Tree, Random Forest, Naive Bayes, SVM, and kNN) based on this framework and evaluate their throughput and latency. By conducting the experimental measurements, we investigate the difference in the processing performance among these classifiers and the bottlenecks in the processing performance of the framework.

classifier, classifier performance, network traffic, (12 more...)

arXiv.org Artificial Intelligence

2405.13066

Country:

Asia > Japan (0.04)
North America > Canada > New Brunswick > Fredericton (0.04)
Asia > Middle East > Iraq > Baghdad Governorate > Baghdad (0.04)

Genre: Research Report > New Finding (0.49)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

Enhancing Cloud-Based Large Language Model Processing with Elasticsearch and Transformer Models

Ni, Chunhe, Wu, Jiang, Wang, Hongbo, Lu, Wenran, Zhang, Chenwei

arXiv.org Artificial IntelligenceFeb-24-2024

Large Language Models (LLMs) are a class of generative AI models built using the Transformer network, capable of leveraging vast datasets to identify, summarize, translate, predict, and generate language. LLMs promise to revolutionize society, yet training these foundational models poses immense challenges. Semantic/vector search within large language models is a potent technique that can significantly enhance search result accuracy and relevance. Unlike traditional keyword-based search methods, semantic search utilizes the meaning and context of words to grasp the intent behind queries and deliver more precise outcomes. Elasticsearch emerges as one of the most popular tools for implementing semantic search -- an exceptionally scalable and robust search engine designed for indexing and searching extensive datasets. In this article, we delve into the fundamentals of semantic search and explore how to harness Elasticsearch and Transformer models to bolster large language model processing paradigms. We gain a comprehensive understanding of semantic search principles and acquire practical skills for implementing semantic search in real-world model application scenarios.

elasticsearch, language model, transformer model, (14 more...)

arXiv.org Artificial Intelligence

2403.00807

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.29)
North America > United States > Texas > Travis County > Austin (0.14)
North America > United States > Texas > Dallas County > Richardson (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)

Genre: Research Report (0.50)

Industry: Information Technology > Services (0.41)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Retrieval-based Text Selection for Addressing Class-Imbalanced Data in Classification

Ahmadi, Sareh, Shah, Aditya, Fox, Edward

arXiv.org Artificial IntelligenceNov-9-2023

This paper addresses the problem of selecting of a set of texts for annotation in text classification using retrieval methods when there are limits on the number of annotations due to constraints on human resources. An additional challenge addressed is dealing with binary categories that have a small number of positive instances, reflecting severe class imbalance. In our situation, where annotation occurs over a long time period, the selection of texts to be annotated can be made in batches, with previous annotations guiding the choice of the next set. To address these challenges, the paper proposes leveraging SHAP to construct a quality set of queries for Elasticsearch and semantic search, to try to identify optimal sets of texts for annotation that will help with class imbalance. The approach is tested on sets of cue texts describing possible future events, constructed by participants involved in studies aimed to help with the management of obesity and diabetes. We introduce an effective method for selecting a small set of texts for annotation and building high-quality classifiers. We integrate vector search, semantic search, and machine learning classifiers to yield a good solution. Our experiments demonstrate improved F1 scores for the minority classes in binary classification.

category, dataset, semantic search, (16 more...)

arXiv.org Artificial Intelligence

2307.14899

Country: North America > United States > Virginia > Montgomery County > Blacksburg (0.04)

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

On-Premise AIOps Infrastructure for a Software Editor SME: An Experience Report

Bendimerad, Anes, Remil, Youcef, Mathonat, Romain, Kaytoue, Mehdi

arXiv.org Artificial IntelligenceAug-22-2023

Information Technology has become a critical component in various industries, leading to an increased focus on software maintenance and monitoring. With the complexities of modern software systems, traditional maintenance approaches have become insufficient. The concept of AIOps has emerged to enhance predictive maintenance using Big Data and Machine Learning capabilities. However, exploiting AIOps requires addressing several challenges related to the complexity of data and incident management. Commercial solutions exist, but they may not be suitable for certain companies due to high costs, data governance issues, and limitations in covering private software. This paper investigates the feasibility of implementing on-premise AIOps solutions by leveraging open-source tools. We introduce a comprehensive AIOps infrastructure that we have successfully deployed in our company, and we provide the rationale behind different choices that we made to build its various components. Particularly, we provide insights into our approach and criteria for selecting a data management system and we explain its integration. Our experience can be beneficial for companies seeking to internally manage their software maintenance processes with a modern AIOps approach.

clickhouse, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2308.11225

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > France (0.05)
North America > United States > District of Columbia > Washington (0.04)
(4 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Information Technology > Services (0.93)

Technology:

Information Technology > Software (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Information Management (1.00)
(6 more...)

Add feedback

How to Select the Right EC2 Instance – A Guide to EC2 Instances and Their Capabilities

#artificialintelligenceDec-16-2022, 18:00:11 GMT

EC2 (Elastic Compute Cloud) is the most widely-used compute service from AWS. It's also one of the oldest services launched by AWS, as it was started in 2006. In this article, I will go through some things you should consider when selecting an EC2 instance. You can think of an EC2 instance as not too different from your personal computer. These three questions should also cross your mind when selecting an EC2 instance. The difference being, you are only renting the instance from AWS, instead of buying it as you would with a personal computer.

computer, storage, workload, (16 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence (0.70)

Add feedback

Synthesis of Adversarial DDOS Attacks Using Tabular Generative Adversarial Networks

Hassan, Abdelmageed Ahmed, Hussein, Mohamed Sayed, AboMoustafa, Ahmed Shehata, Elmowafy, Sarah Hossam

arXiv.org Artificial IntelligenceDec-14-2022

Abstract--Network Intrusion Detection Systems (NIDS) are tools or software that are widely used to maintain the computer networks and information systems keeping them secure and preventing malicious traffics from penetrating into them, as they flag when somebody is trying to break into the system. Best effort has been set up on these systems, and the results achieved so far are quite satisfying, however, new types of attacks stand out as the technology of attacks keep evolving, one of these attacks are the attacks based on Generative Adversarial Networks (GAN) that can evade machine learning IDS leaving them vulnerable. Attacks synthesized using real DDos attacks generated using GANs on the IDS. The objective is to discover how will these systems react towards synthesized attacks. Unsupervised Machine Learning, IDS systems can predict the attacks that aren't labeled but that techniques are prone to 1-I False positives [3], this gives the attackers the chance to Cyber Attacks are increasingly sophisticated, hackers keep mislead models into their desired misclassification by using adapting their strategies to exploit every possible vulnerability adversarial examples.

artificial intelligence, generative adversarial network, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2212.14109

Country: North America > Canada > Ontario > National Capital Region > Ottawa (0.15)

Genre: Research Report (0.54)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.97)

Add feedback