Plotting

 Dinh, Sang


GloCOM: A Short Text Neural Topic Model via Global Clustering Context

arXiv.org Artificial Intelligence

Uncovering hidden topics from short texts is challenging for traditional and neural models due to data sparsity, which limits word co-occurrence patterns, and label sparsity, stemming from incomplete reconstruction targets. Although data aggregation offers a potential solution, existing neural topic models often overlook it due to time complexity, poor aggregation quality, and difficulty in inferring topic proportions for individual documents. In this paper, we propose a novel model, GloCOM (Global Clustering COntexts for Topic Models), which addresses these challenges by constructing aggregated global clustering contexts for short documents, leveraging text embeddings from pre-trained language models. GloCOM can infer both global topic distributions for clustering contexts and local distributions for individual short texts. Additionally, the model incorporates these global contexts to augment the reconstruction loss, effectively handling the label sparsity issue. Extensive experiments on short text datasets show that our approach outperforms other state-of-the-art models in both topic quality and document representations.


SemiKong: Curating, Training, and Evaluating A Semiconductor Industry-Specific Large Language Model

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated the potential to address some issues within the semiconductor industry. However, they are often general-purpose models that lack the specialized knowledge needed to tackle the unique challenges of this sector, such as the intricate physics and chemistry of semiconductor devices and processes. SemiKong, the first industry-specific LLM for the semiconductor domain, provides a foundation that can be used to develop tailored proprietary models. With SemiKong 1.0, we aim to develop a foundational model capable of understanding etching problems at an expert level. Our key contributions include (a) curating a comprehensive corpus of semiconductor-related texts, (b) creating a foundational model with in-depth semiconductor knowledge, and (c) introducing a framework for integrating expert knowledge, thereby advancing the evaluation process of domain-specific AI models. Through fine-tuning a pre-trained LLM using our curated dataset, we have shown that SemiKong outperforms larger, general-purpose LLMs in various semiconductor manufacturing and design tasks. Our extensive experiments underscore the importance of developing domain-specific LLMs as a foundation for company- or tool-specific proprietary models, paving the way for further research and applications in the semiconductor domain. Code and dataset will be available at https://github.com/aitomatic/semikong


DANA: Domain-Aware Neurosymbolic Agents for Consistency and Accuracy

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have shown remarkable capabilities, but their inherent probabilistic nature often leads to inconsistency and inaccuracy in complex problem-solving tasks. This paper introduces DANA (Domain-Aware Neurosymbolic Agent), an architecture that addresses these issues by integrating domain-specific knowledge with neurosymbolic approaches. We begin by analyzing current AI architectures, including AutoGPT, LangChain ReAct and OpenAI's ChatGPT, through a neurosymbolic lens, highlighting how their reliance on probabilistic inference contributes to inconsistent outputs. In response, DANA captures and applies domain expertise in both natural-language and symbolic forms, enabling more deterministic and reliable problem-solving behaviors. We implement a variant of DANA using Hierarchical Task Plans (HTPs) in the open-source OpenSSA framework. This implementation achieves over 90\% accuracy on the FinanceBench financial-analysis benchmark, significantly outperforming current LLM-based systems in both consistency and accuracy. Application of DANA in physical industries such as semiconductor shows that its flexible architecture for incorporating knowledge is effective in mitigating the probabilistic limitations of LLMs and has potential in tackling complex, real-world problems that require reliability and precision.


Enhancing Q&A with Domain-Specific Fine-Tuning and Iterative Reasoning: A Comparative Study

arXiv.org Artificial Intelligence

AI-powered question-answering (Q&A) systems have emerged as important tools, alongside established search technologies, to enable quick access to relevant information and knowledge from large digital sources that are complex and time-consuming for humans to navigate. Advancements in large language models (LLMs) have revolutionized the field of Q&A, with models like GPT-3 (Brown et al. 2020), BERT (Devlin et al. 2018), and RoBERTa (Liu et al. 2019) demonstrating remarkable abilities in understanding and generating human-like text. However, the effectiveness of such models in handling domain-specific questions that require specialized knowledge is limited. Retrieval-augmented generation (RAG) techniques, which combine information retrieval and generative models (Lewis et al. 2021), have shown promise in boosting the quality of LLM output in Q&A tasks. RAG systems leverage the strengths of both retrieval and generation components to provide contextually relevant and informative responses. While there is a lack of established quantification of RAG accuracy, early findings suggest that generic RAG does not perform well in complex domains such as finance. In one instance, RAG based on generic LLMs such as GPT-4-Turbo fails to answer 81% of the questions derived from Securities and Exchange Commission (SEC) financial filings (Islam et al. 2023). Aitomatic, Inc. (except as noted, all authors are from Aitomatic)