AITopics | text corpus

Collaborating Authors

text corpus

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

EZYer: A simulacrum of high school with generative agent

Yang, Jinming, Ji, Zimu, Luo, Weiqi, Wang, Gaoxi, Ma, Bin, Deng, Yueling

arXiv.org Artificial IntelligenceDec-3-2025

With the rapid development of the online education and large language model, the existing educational tools still suffer from incomplete service, insufficient performance and weak interactivity in terms of courseware generation, interactive notes and quality assurance of content. In particular, the proposed generative agent EZYer : 1) Teacher Module: Integrating the Text Corpus retrieval and in-depth generation technologies, it automatically generates structured teaching materials and LaTeX Beamer courseware in line with the high school mathematics syllabus and supports user-defined image insertion. 2) Student Module: Throughout the collaborative interaction of the four roles of Teacher, Assistant, Top Student and Struggling Student, Note Taker summarizes and generates academic notes to enhance the depth and interest of learning. 3) Controller: set up keyword filtering system, content scoring system, role co-validation system, and dynamic content correction system. This ensure academic strictness and pedagogical propriety of EZYer inputs and outputs. In order to evaluate EZYer, this paper designs five-dimensional evaluation indexes of content accuracy, knowledge coverage, usability, formatting correctness and visual design and appeal, and scores 100 Beamer and Notes generated by EZYer by five large language models, separately, and the results show that the quality of EZYer-generated content is excellent and has a good application prospect.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2512.02561

Country: Asia > China (0.29)

Genre: Instructional Material > Course Syllabus & Notes (1.00)

Industry:

Education > Educational Setting > Online (1.00)
Education > Educational Setting > K-12 Education > Secondary School (0.63)
Education > Educational Technology > Educational Software > Computer Based Training (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

Liu, Jiacheng, Blanton, Taylor, Elazar, Yanai, Min, Sewon, Chen, YenSung, Chheda-Kothary, Arnavi, Tran, Huy, Bischoff, Byron, Marsh, Eric, Schmitz, Michael, Trier, Cassidy, Sarnat, Aaron, James, Jenna, Borchardt, Jon, Kuehl, Bailey, Cheng, Evie, Farley, Karen, Sreeram, Sruthi, Anderson, Taira, Albright, David, Schoenick, Carissa, Soldaini, Luca, Groeneveld, Dirk, Pang, Rock Yuren, Koh, Pang Wei, Smith, Noah A., Lebrecht, Sophie, Choi, Yejin, Hajishirzi, Hannaneh, Farhadi, Ali, Dodge, Jesse

arXiv.org Artificial IntelligenceJul-9-2025

We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2504.07096

Genre: Research Report (0.40)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.49)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.46)

Add feedback

Development of the user-friendly decision aid Rule-based Evaluation and Support Tool (REST) for optimizing the resources of an information extraction task

Bazin, Guillaume, Tannier, Xavier, Adda, Fanny, Cohen, Ariel, Redjdal, Akram, Kempf, Emmanuelle

arXiv.org Artificial IntelligenceJun-17-2025

Rules could be an information extraction (IE) default option, compared to ML and LLMs in terms of sustainability, transferability, interpretability, and development burden. We suggest a sustainable and combined use of rules and ML as an IE method. Our approach starts with an exhaustive expert manual highlighting in a single working session of a representative subset of the data corpus. We developed and validated the feasibility and the performance metrics of the REST decision tool to help the annotator choose between rules as a by default option and ML for each entity of an IE task. REST makes the annotator visualize the characteristics of each entity formalization in the free texts and the expected rule development feasibility and IE performance metrics. ML is considered as a backup IE option and manual annotation for training is therefore minimized. The external validity of REST on a 12-entity use case showed good reproducibility.

artificial intelligence, natural language, text highlight, (18 more...)

arXiv.org Artificial Intelligence

2506.13177

Genre: Research Report (0.83)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)

Add feedback

Next Word Suggestion using Graph Neural Network

Magar, Abisha Thapa, Shakya, Anup

arXiv.org Artificial IntelligenceMay-16-2025

Language Modeling is a prevalent task in Natural Language Processing. The currently existing most recent and most successful language models often tend to build a massive model with billions of parameters, feed in a tremendous amount of text data, and train with enormous computation resources which require millions of dollars. In this project, we aim to address an important sub-task in language modeling, i.e., context embedding. We propose an approach to exploit the Graph Convolution operation in GNNs to encode the context and use it in coalition with LSTMs to predict the next word given a local context of preceding words. We test this on the custom Wikipedia text corpus using a very limited amount of resources and show that this approach works fairly well to predict the next word.

corpus, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.09649

Genre: Research Report (0.82)

Industry: Leisure & Entertainment (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SAM Decoding: Speculative Decoding via Suffix Automaton

Hu, Yuxuan, Wang, Ke, Zhang, Xiaokang, Zhang, Fanjin, Li, Cuiping, Chen, Hong, Zhang, Jing

arXiv.org Artificial IntelligenceDec-16-2024

Speculative decoding (SD) has been demonstrated as an effective technique for lossless LLM inference acceleration. Retrieval-based SD methods, one kind of model-free method, have yielded promising speedup, but they often rely on incomplete retrieval resources, inefficient retrieval methods, and are constrained to certain domains. This paper presents a novel retrieval-based speculative decoding method that adapts suffix automaton (SAM) for efficient and accurate draft generation by utilizing common text corpus and dynamic text sequence. Unlike existing $n$-gram matching methods, SAM-Decoding finds the exact longest suffix match, achieving an average time complexity of O(1) per generation step of SAM update and suffix retrieval. It can also integrate with existing methods, adaptively selecting a draft generation strategy based on match length to generalize to broader domains. Extensive experiments on Spec-Bench show that our method is $18\%+$ faster than other retrieval-based SD methods. Additionally, when combined with advanced EAGLE-2, it provides an additional speedup of $3.28\%$ -- $11.13\%$ across various-sized LLM backbones. Our code is available at our \href{https://github.com/hyx1999/SAM-Decoding}{repository}.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2411.10666

Country:

North America > United States (0.14)
Asia > China > Beijing > Beijing (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Prompt-Efficient Fine-Tuning for GPT-like Deep Models to Reduce Hallucination and to Improve Reproducibility in Scientific Text Generation Using Stochastic Optimisation Techniques

Sulimov, Daniil

arXiv.org Artificial IntelligenceNov-10-2024

Large Language Models (LLMs) are increasingly adopted for complex scientific text generation tasks, yet they often suffer from limitations in accuracy, consistency, and hallucination control. This thesis introduces a Parameter-Efficient Fine-Tuning (PEFT) approach tailored for GPT-like models, aiming to mitigate hallucinations and enhance reproducibility, particularly in the computational domain of mass spectrometry. We implemented Low-Rank Adaptation (LoRA) adapters to refine GPT-2, termed MS-GPT, using a specialized corpus of mass spectrometry literature. Through novel evaluation methods applied to LLMs, including BLEU, ROUGE, and Perplexity scores, the fine-tuned MS-GPT model demonstrated superior text coherence and reproducibility compared to the baseline GPT-2, confirmed through statistical analysis with the Wilcoxon rank-sum test. Further, we propose a reproducibility metric based on cosine similarity of model outputs under controlled prompts, showcasing MS-GPT's enhanced stability. This research highlights PEFT's potential to optimize LLMs for scientific contexts, reducing computational costs while improving model reliability.

mass spectrometer, mass spectrometry, spectrometry, (13 more...)

arXiv.org Artificial Intelligence

2411.06445

Country:

Europe > Netherlands > South Holland > Delft (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre:

Research Report > Promising Solution (0.46)
Research Report > Experimental Study (0.34)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Energy (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Open foundation models for Azerbaijani language

Isbarov, Jafar, Huseynova, Kavsar, Mammadov, Elvin, Hajili, Mammad

arXiv.org Artificial IntelligenceJul-2-2024

The emergence of multilingual large language models has enabled the development of language understanding and generation systems in Azerbaijani. However, most of the production-grade systems rely on cloud solutions, such as GPT-4. While there have been several attempts to develop open foundation models for Azerbaijani, these works have not found their way into common use due to a lack of systemic benchmarking. This paper encompasses several lines of work that promote open-source foundation models for Azerbaijani. We introduce (1) a large text corpus for Azerbaijani, (2) a family of encoder-only language models trained on this dataset, (3) labeled datasets for evaluating these models, and (4) extensive evaluation that covers all major open-source models with Azerbaijani support.

azerbaijani, dataset, foundation model, (16 more...)

arXiv.org Artificial Intelligence

2407.02337

Country:

Asia > Azerbaijan (0.15)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Mapping the Challenges of HCI: An Application and Evaluation of ChatGPT and GPT-4 for Mining Insights at Scale

Oppenlaender, Jonas, Hämäläinen, Joonas

arXiv.org Artificial IntelligenceDec-12-2023

Large language models (LLMs), such as ChatGPT and GPT-4, are gaining wide-spread real world use. Yet, these LLMs are closed source, and little is known about their performance in real-world use cases. In this paper, we apply and evaluate the combination of ChatGPT and GPT-4 for the real-world task of mining insights from a text corpus in order to identify research challenges in the field of HCI. We extract 4,392 research challenges in over 100 topics from the 2023 CHI conference proceedings and visualize the research challenges for interactive exploration. We critically evaluate the LLMs on this practical task and conclude that the combination of ChatGPT and GPT-4 makes an excellent cost-efficient means for analyzing a text corpus at scale. Cost-efficiency is key for flexibly prototyping research ideas and analyzing text corpora from different perspectives, with implications for applying LLMs for mining insights in academia and practice.

llm, proceedings, research challenge, (13 more...)

arXiv.org Artificial Intelligence

2306.05036

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > New York County > New York City (0.05)
(19 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry:

Information Technology (1.00)
Education (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition

Chen, Yifei, Chen, Dapeng, Liu, Ruijin, Zhou, Sai, Xue, Wenyuan, Peng, Wei

arXiv.org Artificial IntelligenceNov-27-2023

Large-scale visual-language pre-trained models have achieved significant success in various video tasks. However, most existing methods follow an "adapt then align" paradigm, which adapts pre-trained image encoders to model video-level representations and utilizes one-hot or text embedding of the action labels for supervision. This paradigm overlooks the challenge of mapping from static images to complicated activity concepts. In this paper, we propose a novel "Align before Adapt" (ALT) paradigm. Prior to adapting to video representation learning, we exploit the entity-to-region alignments for each frame. The alignments are fulfilled by matching the region-aware image embeddings to an offline-constructed text corpus. With the aligned entities, we feed their text embeddings to a transformer-based video adapter as the queries, which can help extract the semantics of the most important entities from a video to a vector. This paradigm reuses the visual-language alignment of VLP during adaptation and tries to explain an action by the underlying entities. This helps understand actions by bridging the gap with complex activity semantics, particularly when facing unfamiliar or unseen categories. ALT achieves competitive performance and superior generalizability while requiring significantly low computational costs. In fully supervised scenarios, it achieves 88.1% top-1 accuracy on Kinetics-400 with only 4947 GFLOPs. In 2-shot experiments, ALT outperforms the previous state-of-the-art by 7.1% and 9.2% on HMDB-51 and UCF-101, respectively.

action recognition, experiment, recognition, (15 more...)

arXiv.org Artificial Intelligence

2311.15619

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)
(3 more...)

Add feedback

Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models

Su, Hsuan, Hu, Ting-Yao, Koppula, Hema Swetha, Vemulapalli, Raviteja, Chang, Jen-Hao Rick, Yang, Karren, Mantena, Gautam Varma, Tuzel, Oncel

arXiv.org Artificial IntelligenceSep-18-2023

While Automatic Speech Recognition (ASR) systems are widely used in many real-world applications, they often do not generalize well to new domains and need to be finetuned on data from these domains. However, target-domain data usually are not readily available in many scenarios. In this paper, we propose a new strategy for adapting ASR models to new target domains without any text or speech from those domains. To accomplish this, we propose a novel data synthesis pipeline that uses a Large Language Model (LLM) to generate a target domain text corpus, and a state-of-the-art controllable speech synthesis model to generate the corresponding speech. We propose a simple yet effective in-context instruction finetuning strategy to increase the effectiveness of LLM in generating text corpora for new domains. Experiments on the SLURP dataset show that the proposed method achieves an average relative word error rate improvement of $28\%$ on unseen target domains without any performance drop in source domains.

asr model, demonstration, target domain, (16 more...)

arXiv.org Artificial Intelligence

2309.10707

Country: Asia > Taiwan (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback