elasticsearch
Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World
Marinas, Ines Altemir, Kucherenko, Anastasiia, Sternfeld, Alexander, Kucharavy, Andrei
The performance of Large Language Models (LLMs) is determined by their training data. Despite the proliferation of open-weight LLMs, access to LLM training data has remained limited. Even for fully open LLMs, the scale of the data makes it all but inscrutable to the general scientific community, despite potentially containing critical data scraped from the internet. In this paper, we present the full-text indexing pipeline for the Apertus LLM training data. Leveraging Elasticsearch parallel indices and the Alps infrastructure, a state-of-the-art, highly energy-efficient arm64 supercluster, we were able to index 8.6T tokens out of 15.2T used to train the Apertus LLM family, creating both a critical LLM safety tool and effectively an offline, curated, open web search engine. Our contribution is threefold. First, we demonstrate that Elasticsearch can be successfully ported onto next-generation arm64-based infrastructure. Second, we demonstrate that full-text indexing at the scale of modern LLM training datasets and the entire open web is feasible and accessible. Finally, we demonstrate that such indices can be used to ensure previously inaccessible jailbreak-agnostic LLM safety. We hope that our findings will be useful to other teams attempting large-scale data indexing and facilitate the general transition towards greener computation.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Dubai Emirate > Dubai (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- (11 more...)
- Materials > Chemicals (1.00)
- Information Technology (0.93)
- Health & Medicine (0.68)
- (3 more...)
Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval
Marinas, Inés Altemir, Kucherenko, Anastasiia, Kucharavy, Andrei
Large language models (LLMs) rely heavily on web-scale datasets like Common Crawl, which provides over 80\% of training data for some modern models. However, the indiscriminate nature of web crawling raises challenges in data quality, safety, and ethics. Despite the critical importance of training data quality, prior research on harmful content has been limited to small samples due to computational constraints. This project presents a framework for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. We apply it to SwissAI's FineWeb-2 corpus (1.5TB, four languages), achieving fast query performance--most searches in milliseconds, all under 2 seconds. Our work demonstrates real-time dataset analysis, offering practical tools for safer, more accountable AI systems.
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- Asia > Middle East > Saudi Arabia > Asir Province > Abha (0.04)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.93)
- Information Technology (0.93)
- Health & Medicine > Therapeutic Area > Immunology (0.71)
OmniLLP: Enhancing LLM-based Log Level Prediction with Context-Aware Retrieval
Ouatiti, Youssef Esseddiq, Sayagh, Mohammed, Adams, Bram, Hassan, Ahmed E.
Developers insert logging statements in source code to capture relevant runtime information essential for maintenance and debugging activities. Log level choice is an integral, yet tricky part of the logging activity as it controls log verbosity and therefore influences systems' observability and performance. Recent advances in ML-based log level prediction have leveraged large language models (LLMs) to propose log level predictors (LLPs) that demonstrated promising performance improvements (AUC between 0.64 and 0.8). Nevertheless, current LLM-based LLPs rely on randomly selected in-context examples, overlooking the structure and the diverse logging practices within modern software projects. In this paper, we propose OmniLLP, a novel LLP enhancement framework that clusters source files based on (1) semantic similarity reflecting the code's functional purpose, and (2) developer ownership cohesion. By retrieving in-context learning examples exclusively from these semantic and ownership aware clusters, we aim to provide more coherent prompts to LLPs leveraging LLMs, thereby improving their predictive accuracy. Our results show that both semantic and ownership-aware clusterings statistically significantly improve the accuracy (by up to 8\% AUC) of the evaluated LLM-based LLPs compared to random predictors (i.e., leveraging randomly selected in-context examples from the whole project). Additionally, our approach that combines the semantic and ownership signal for in-context prediction achieves an impressive 0.88 to 0.96 AUC across our evaluated projects. Our findings highlight the value of integrating software engineering-specific context, such as code semantic and developer ownership signals into LLM-LLPs, offering developers a more accurate, contextually-aware approach to logging and therefore, enhancing system maintainability and observability.
- Europe > Netherlands > South Holland > Leiden (0.05)
- North America > Canada > Ontario > Kingston (0.04)
- North America > United States > California > Los Angeles County > Los Angeles > Hollywood (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.66)
Deep Retrieval at CheckThat! 2025: Identifying Scientific Papers from Implicit Social Media Mentions via Hybrid Retrieval and Re-Ranking
Sager, Pascal J., Kamaraj, Ashwini, Grewe, Benjamin F., Stadelmann, Thilo
We present the methodology and results of the Deep Retrieval team for subtask 4b of the CLEF CheckThat! 2025 competition, which focuses on retrieving relevant scientific literature for given social media posts. To address this task, we propose a hybrid retrieval pipeline that combines lexical precision, semantic generalization, and deep contextual re-ranking, enabling robust retrieval that bridges the informal-to-formal language gap. Specifically, we combine BM25-based keyword matching with a FAISS vector store using a fine-tuned INF-Retriever-v1 model for dense semantic retrieval. BM25 returns the top 30 candidates, and semantic search yields 100 candidates, which are then merged and re-ranked via a large language model (LLM)-based cross-encoder. Our approach achieves a mean reciprocal rank at 5 (MRR@5) of 76.46% on the development set and 66.43% on the hidden test set, securing the 1st position on the development leaderboard and ranking 3rd on the test leaderboard (out of 31 teams), with a relative performance gap of only 2 percentage points compared to the top-ranked system. We achieve this strong performance by running open-source models locally and without external training data, highlighting the effectiveness of a carefully designed and fine-tuned retrieval pipeline.
- Europe > Switzerland > Zürich > Zürich (0.15)
- Europe > Spain > Galicia > Madrid (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (8 more...)
- Health & Medicine (0.95)
- Media > News (0.46)
Practical Performance of a Distributed Processing Framework for Machine-Learning-based NIDS
Kajiura, Maho, Nakamura, Junya
Network Intrusion Detection Systems (NIDSs) detect intrusion attacks in network traffic. In particular, machine-learning-based NIDSs have attracted attention because of their high detection rates of unknown attacks. A distributed processing framework for machine-learning-based NIDSs employing a scalable distributed stream processing system has been proposed in the literature. However, its performance, when machine-learning-based classifiers are implemented has not been comprehensively evaluated. In this study, we implement five representative classifiers (Decision Tree, Random Forest, Naive Bayes, SVM, and kNN) based on this framework and evaluate their throughput and latency. By conducting the experimental measurements, we investigate the difference in the processing performance among these classifiers and the bottlenecks in the processing performance of the framework.
- Asia > Japan (0.04)
- North America > Canada > New Brunswick > Fredericton (0.04)
- Asia > Middle East > Iraq > Baghdad Governorate > Baghdad (0.04)
Enhancing Cloud-Based Large Language Model Processing with Elasticsearch and Transformer Models
Ni, Chunhe, Wu, Jiang, Wang, Hongbo, Lu, Wenran, Zhang, Chenwei
Large Language Models (LLMs) are a class of generative AI models built using the Transformer network, capable of leveraging vast datasets to identify, summarize, translate, predict, and generate language. LLMs promise to revolutionize society, yet training these foundational models poses immense challenges. Semantic/vector search within large language models is a potent technique that can significantly enhance search result accuracy and relevance. Unlike traditional keyword-based search methods, semantic search utilizes the meaning and context of words to grasp the intent behind queries and deliver more precise outcomes. Elasticsearch emerges as one of the most popular tools for implementing semantic search -- an exceptionally scalable and robust search engine designed for indexing and searching extensive datasets. In this article, we delve into the fundamentals of semantic search and explore how to harness Elasticsearch and Transformer models to bolster large language model processing paradigms. We gain a comprehensive understanding of semantic search principles and acquire practical skills for implementing semantic search in real-world model application scenarios.
- North America > United States > California > Los Angeles County > Los Angeles (0.29)
- North America > United States > Texas > Travis County > Austin (0.14)
- North America > United States > Texas > Dallas County > Richardson (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
Retrieval-based Text Selection for Addressing Class-Imbalanced Data in Classification
Ahmadi, Sareh, Shah, Aditya, Fox, Edward
This paper addresses the problem of selecting of a set of texts for annotation in text classification using retrieval methods when there are limits on the number of annotations due to constraints on human resources. An additional challenge addressed is dealing with binary categories that have a small number of positive instances, reflecting severe class imbalance. In our situation, where annotation occurs over a long time period, the selection of texts to be annotated can be made in batches, with previous annotations guiding the choice of the next set. To address these challenges, the paper proposes leveraging SHAP to construct a quality set of queries for Elasticsearch and semantic search, to try to identify optimal sets of texts for annotation that will help with class imbalance. The approach is tested on sets of cue texts describing possible future events, constructed by participants involved in studies aimed to help with the management of obesity and diabetes. We introduce an effective method for selecting a small set of texts for annotation and building high-quality classifiers. We integrate vector search, semantic search, and machine learning classifiers to yield a good solution. Our experiments demonstrate improved F1 scores for the minority classes in binary classification.
On-Premise AIOps Infrastructure for a Software Editor SME: An Experience Report
Bendimerad, Anes, Remil, Youcef, Mathonat, Romain, Kaytoue, Mehdi
Information Technology has become a critical component in various industries, leading to an increased focus on software maintenance and monitoring. With the complexities of modern software systems, traditional maintenance approaches have become insufficient. The concept of AIOps has emerged to enhance predictive maintenance using Big Data and Machine Learning capabilities. However, exploiting AIOps requires addressing several challenges related to the complexity of data and incident management. Commercial solutions exist, but they may not be suitable for certain companies due to high costs, data governance issues, and limitations in covering private software. This paper investigates the feasibility of implementing on-premise AIOps solutions by leveraging open-source tools. We introduce a comprehensive AIOps infrastructure that we have successfully deployed in our company, and we provide the rationale behind different choices that we made to build its various components. Particularly, we provide insights into our approach and criteria for selecting a data management system and we explain its integration. Our experience can be beneficial for companies seeking to internally manage their software maintenance processes with a modern AIOps approach.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > France (0.05)
- North America > United States > District of Columbia > Washington (0.04)
- (4 more...)
- Research Report (1.00)
- Overview (1.00)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Services (0.93)
How to Select the Right EC2 Instance – A Guide to EC2 Instances and Their Capabilities
EC2 (Elastic Compute Cloud) is the most widely-used compute service from AWS. It's also one of the oldest services launched by AWS, as it was started in 2006. In this article, I will go through some things you should consider when selecting an EC2 instance. You can think of an EC2 instance as not too different from your personal computer. These three questions should also cross your mind when selecting an EC2 instance. The difference being, you are only renting the instance from AWS, instead of buying it as you would with a personal computer.
Synthesis of Adversarial DDOS Attacks Using Tabular Generative Adversarial Networks
Hassan, Abdelmageed Ahmed, Hussein, Mohamed Sayed, AboMoustafa, Ahmed Shehata, Elmowafy, Sarah Hossam
Abstract--Network Intrusion Detection Systems (NIDS) are tools or software that are widely used to maintain the computer networks and information systems keeping them secure and preventing malicious traffics from penetrating into them, as they flag when somebody is trying to break into the system. Best effort has been set up on these systems, and the results achieved so far are quite satisfying, however, new types of attacks stand out as the technology of attacks keep evolving, one of these attacks are the attacks based on Generative Adversarial Networks (GAN) that can evade machine learning IDS leaving them vulnerable. Attacks synthesized using real DDos attacks generated using GANs on the IDS. The objective is to discover how will these systems react towards synthesized attacks. Unsupervised Machine Learning, IDS systems can predict the attacks that aren't labeled but that techniques are prone to 1-I False positives [3], this gives the attackers the chance to Cyber Attacks are increasingly sophisticated, hackers keep mislead models into their desired misclassification by using adapting their strategies to exploit every possible vulnerability adversarial examples.