gpt
Event Stream GPT: A Data Pre-processing and Modeling Library for Generative, Pre-trained Transformers over Continuous-time Sequences of Complex Events
Generative, pre-trained transformers (GPTs, a type of Foundation Models) have reshaped natural language processing (NLP) through their versatility in diverse downstream tasks. However, their potential extends far beyond NLP. This paper provides a software utility to help realize this potential, extending the applicability of GPTs to continuous-time sequences of complex events with internal dependencies, such as medical record datasets. Despite their potential, the adoption of foundation models in these domains has been hampered by the lack of suitable tools for model construction and evaluation. To bridge this gap, we introduce Event Stream GPT (ESGPT), an open-source library designed to streamline the end-to-end process for building GPTs for continuous-time event sequences. ESGPT allows users to (1) build flexible, foundation-model scale input datasets by specifying only a minimal configuration file, (2) leverage a Hugging Face compatible modeling API for GPTs over this modality that incorporates intra-event causal dependency structures and autoregressive generation capabilities, and (3) evaluate models via standardized processes that can assess few and even zero-shot performance of pre-trained models on user-specified fine-tuning tasks.
M 3 GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation
The first focuses on creating a unified representation space for various motion-relevant modalities. We employ discrete vector quantization for multimodal conditional signals, such as text, music and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary.The second involves modeling motion generation directly in the raw motion space. This strategy circumvents the information loss associated with a discrete tokenizer, resulting in more detailed and comprehensive motion generation. Third, M$^3$GPT learns to model the connections and synergies among various motion-relevant tasks. Text, the most familiar and well-understood modality for LLMs, is utilized as a bridge to establish connections between different motion tasks, facilitating mutual reinforcement. To our knowledge, M$^3$GPT is the first model capable of comprehending and generating motions based on multiple signals.Extensive experiments highlight M$^3$GPT's superior performance across various motion-relevant tasks and its powerful zero-shot generalization capabilities for extremely challenging tasks.
What Kind of Reasoning (if any) is an LLM actually doing? On the Stochastic Nature and Abductive Appearance of Large Language Models
Floridi, Luciano, Morley, Jessica, Novelli, Claudio, Watson, David
This article looks at how reasoning works in current Large Language Models (LLMs) that function using the token-completion method. It examines their stochastic nature and their similarity to human abductive reasoning. The argument is that these LLMs create text based on learned patterns rather than performing actual abductive reasoning. When their output seems abductive, this is largely because they are trained on human-generated texts that include reasoning structures. Examples are used to show how LLMs can produce plausible ideas, mimic commonsense reasoning, and give explanatory answers without being grounded in truth, semantics, verification, or understanding, and without performing any real abductive reasoning. This dual nature, where the models have a stochastic base but appear abductive in use, has important consequences for how LLMs are evaluated and applied. They can assist with generating ideas and supporting human thinking, but their outputs must be critically assessed because they cannot identify truth or verify their explanations. The article concludes by addressing five objections to these points, noting some limitations in the analysis, and offering an overall evaluation.
- North America > United States > New York (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (7 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
BEDI: A Comprehensive Benchmark for Evaluating Embodied Agents on UAVs
Guo, Mingning, Wu, Mengwei, He, Jiarun, Li, Shaoxian, Li, Haifeng, Tao, Chao
With the rapid advancement of low-altitude remote sensing and Vision-Language Models (VLMs), Embodied Agents based on Unmanned Aerial Vehicles (UAVs) have shown significant potential in autonomous tasks. However, current evaluation methods for UAV-Embodied Agents (UAV-EAs) remain constrained by the lack of standardized benchmarks, diverse testing scenarios and open system interfaces. To address these challenges, we propose BEDI (Benchmark for Embodied Drone Intelligence), a systematic and standardized benchmark designed for evaluating UAV-EAs. Specifically, we introduce a novel Dynamic Chain-of-Embodied-Task paradigm based on the perception-decision-action loop, which decomposes complex UAV tasks into standardized, measurable subtasks. Building on this paradigm, we design a unified evaluation framework encompassing six core sub-skills: semantic perception, spatial perception, motion control, tool utilization, task planning and action generation. Furthermore, we develop a hybrid testing platform that incorporates a wide range of both virtual and real-world scenarios, enabling a comprehensive evaluation of UAV-EAs across diverse contexts. The platform also offers open and standardized interfaces, allowing researchers to customize tasks and extend scenarios, thereby enhancing flexibility and scalability in the evaluation process. Finally, through empirical evaluations of several state-of-the-art (SOTA) VLMs, we reveal their limitations in embodied UAV tasks, underscoring the critical role of the BEDI benchmark in advancing embodied intelligence research and model optimization. By filling the gap in systematic and standardized evaluation within this field, BEDI facilitates objective model comparison and lays a robust foundation for future development in this field. Our benchmark is now publicly available at https://github.com/lostwolves/BEDI.
- Asia > China > Beijing > Beijing (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > Canada > Ontario > National Capital Region > Ottawa (0.04)
- (8 more...)
- Workflow (1.00)
- Research Report (1.00)
- Government > Military (0.67)
- Information Technology > Robotics & Automation (0.48)
- Transportation > Marine (0.46)
- Health & Medicine > Therapeutic Area (0.46)
Why They Disagree: Decoding Differences in Opinions about AI Risk on the Lex Fridman Podcast
Truong, Nghi, Puranam, Phanish, Koçak, Özgecan
The emergence of transformative technologies often surfaces deep societal divisions, nowhere more evident than in contemporary debates about artificial intelligence (AI). A striking feature of these divisions is that they persist despite shared interests in ensuring that AI benefits humanity and avoiding catastrophic outcomes. This paper analyzes contemporary debates about AI risk, parsing the differences between the "doomer" and "boomer" perspectives into definitional, factual, causal, and moral premises to identify key points of contention. We find that differences in perspectives about existential risk ("X-risk") arise fundamentally from differences in causal premises about design vs. emergence in complex systems, while differences in perspectives about employment risks ("E-risks") pertain to different causal premises about the applicability of past theories (evolution) vs their inapplicability (revolution). Disagreements about these two forms of AI risk appear to share two properties: neither involves significant disagreements on moral values and both can be described in terms of differing views on the extent of boundedness of human rationality. Our approach to analyzing reasoning chains at scale, using an ensemble of LLMs to parse textual data, can be applied to identify key points of contention in debates about risk to the public in any arena.
- North America > Canada > Ontario > Toronto (0.28)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Law (1.00)
- Government (1.00)
- Information Technology > Security & Privacy (0.92)
Chemistry Integrated Language Model using Hierarchical Molecular Representation for Polymer Informatics
Ahn, Jihun, Irianti, Gabriella Pasya, Thapar, Vikram, Hur, Su-Mi
Machine learning has transformed material discovery for inorganic compounds and small molecules, yet polymers remain largely inaccessible to these methods. While data scarcity is often cited as the primary bottleneck, we demonstrate that strategic molecular representations can overcome this limitation. We introduce CI-LLM (Chemically Informed Language Model), a framework combining HAPPY (Hierarchically Abstracted rePeat unit of PolYmer), which encodes chemical substructures as tokens, with numerical descriptors within transformer architectures. For property prediction, De$^3$BERTa, our descriptor-enriched encoder, achieves 3.5x faster inference than SMILES-based models with improved accuracy ($R^2$ score gains of 0.9-4.1 percent across four properties), while providing interpretable structure-property insights at the subgroup level. For inverse design, our GPT-based generator produces polymers with targeted properties, achieving 100 percent scaffold retention and successful multi-property optimization for negatively correlated objectives. This comprehensive framework demonstrates both forward prediction and inverse design capabilities, showcasing how strategic molecular representation advances machine learning applications in polymer science.
Explainable Semantic Text Relations: A Question-Answering Framework for Comparing Document Content
Aperstein, Yehudit, Gottlib, Alon, Benita, Gal, Apartsin, Alexander
Understanding semantic relations between two texts is crucial for many information and document management tasks, in which one must determine whether the content fully overlaps, is completely superseded by another document, or overlaps only partially, with unique information in each. Beyond establishing this relation, it is equally important to provide explainable outputs that specify which pieces of information are present, missing, or newly added between the text pair. In this study, we formally define semantic relations between two texts through the set-theoretic relation between their respective Answerable Question Sets (AQS), the sets of questions each text can answer. Under this formulation, Semantic Text Relation (STR), such as equivalence, inclusion, and mutual overlap, becomes a well-defined set relation between the corresponding texts' AQSs. The set differences between the AQSs also serve as an explanation or diagnostic tool for identifying how the information in the texts diverges. Using this definition, we construct a synthetic benchmark that captures fine-grained informational relations through controlled paraphrasing and deliberate information removal supported by AQS manipulations. We then use this dataset to evaluate several discriminative and generative models for classifying text pairs into STR categories, assessing how well different model architectures capture semantic relations beyond surface-level similarity. We publicly release both the dataset and the data generation code to support further research.
- North America > United States > New York (0.05)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- (3 more...)
- Government (0.69)
- Health & Medicine (0.68)
- Law Enforcement & Public Safety (0.47)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation
Rahmani, Mahdi, Saffari, AmirHossein, Rahmani, Reyhane
Small and medium - sized enterprises (SMEs) in Iran increasingly leverage Telegram for sales, where real - time engagement is essential for conversion. However, developing AI - driven chatbots for this purpose requires large, high - quality question - and - answer (Q&A) datasets, which are typically expensive and resource - intensive to produce, especially for low - resource languages like Persian. In this paper, we introduce MegaChat, the first fully synthetic Persian Q&A dataset designed to evaluate intelligent sales ch atbots in Telegram - based e - commerce. We propose a novel, automated multi - agent architecture that generates persona - aware Q&A pairs by collecting data from active Telegram shopping channels. The system employs specialized agents for question generation, validation, and refinement, ensuring the production of realistic and diverse conversational data. To evaluate answer generation, we compare three classic retrieval - augmented generation (RAG) models with our advanced agentic system, which features multi - query retrieval, reranking, and persona - aligned response synthesis. Using GPT - 5.1 for evaluation across six quality dimensions, our results show that the agentic architecture outperformed traditional RAG models in 4 out of 5 diverse channels, demonstrating its ability to generate scalable, high - quality datasets without relying on expensive human annotation or complex fine - tuning. MegaChat provides SMEs with an efficient, cost - effective solution for building intelligent customer engagement systems in specialized c ommercial domains, enabling advancements in multilingual conversational AI for low - resource languages.
- North America > United States (0.28)
- Asia > Middle East > Iran > Tehran Province > Tehran (0.05)
- Asia > China (0.05)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Simulated Self-Assessment in Large Language Models: A Psychometric Approach to AI Self-Efficacy
Jackson, Daniel I, Jensen, Emma L, Hussain, Syed-Amad, Sezgin, Emre
Self-assessment is a key aspect of reliable intelligence, yet evaluations of large language models (LLMs) focus mainly on task accuracy. We adapted the 10-item General Self-Efficacy Scale (GSES) to elicit simulated self-assessments from ten LLMs across four conditions: no task, computational reasoning, social reasoning, and summarization. GSES responses were highly stable across repeated administrations and randomized item orders. However, models showed significantly different self-efficacy levels across conditions, with aggregate scores lower than human norms. All models achieved perfect accuracy on computational and social questions, whereas summarization performance varied widely. Self-assessment did not reliably reflect ability: several low-scoring models performed accurately, while some high-scoring models produced weaker summaries. Follow-up confidence prompts yielded modest, mostly downward revisions, suggesting mild overestimation in first-pass assessments. Qualitative analysis showed that higher self-efficacy corresponded to more assertive, anthropomorphic reasoning styles, whereas lower scores reflected cautious, de-anthropomorphized explanations. Psychometric prompting provides structured insight into LLM communication behavior but not calibrated performance estimates.
- North America > United States > Ohio > Franklin County > Columbus (0.04)
- Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
- North America > Costa Rica (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Cognitive bias in LLM reasoning compromises interpretation of clinical oncology notes
Kenaston, Matthew W., Ayub, Umair, Parmar, Mihir, Anjum, Muhammad Umair, Naqvi, Syed Arsalan Ahmed, Kumar, Priya, Rawal, Samarth, Chaudhuri, Aadel A., Zakharia, Yousef, Heath, Elizabeth I., Bekaii-Saab, Tanios S., Tao, Cui, Van Allen, Eliezer M., Zhou, Ben, Choi, YooJung, Baral, Chitta, Riaz, Irbaz Bin
Despite high performance on clinical benchmarks, large language models may reach correct conclusions through faulty reasoning, a failure mode with safety implications for oncology decision support that is not captured by accuracy-based evaluation. In this two-cohort retrospective study, we developed a hierarchical taxonomy of reasoning errors from GPT-4 chain-of-thought responses to real oncology notes and tested its clinical relevance. Using breast and pancreatic cancer notes from the CORAL dataset, we annotated 600 reasoning traces to define a three-tier taxonomy mapping computational failures to cognitive bias frameworks. We validated the taxonomy on 822 responses from prostate cancer consult notes spanning localized through metastatic disease, simulating extraction, analysis, and clinical recommendation tasks. Reasoning errors occurred in 23 percent of interpretations and dominated overall errors, with confirmation bias and anchoring bias most common. Reasoning failures were associated with guideline-discordant and potentially harmful recommendations, particularly in advanced disease management. Automated evaluators using state-of-the-art language models detected error presence but could not reliably classify subtypes. These findings show that large language models may provide fluent but clinically unsafe recommendations when reasoning is flawed. The taxonomy provides a generalizable framework for evaluating and improving reasoning fidelity before clinical deployment.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Minnesota > Olmsted County > Rochester (0.04)
- North America > United States > Arizona > Maricopa County > Phoenix (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)