Sengamedu, Srinivasan
Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding -- A Survey
Fang, Xi, Xu, Weijie, Tan, Fiona Anting, Zhang, Jiani, Hu, Ziqing, Qi, Yanjun, Nickleach, Scott, Socolinsky, Diego, Sengamedu, Srinivasan, Faloutsos, Christos
Recent breakthroughs in large language modeling have facilitated rigorous exploration of their application in diverse tasks related to tabular data modeling, such as prediction, tabular data synthesis, question answering, and table understanding. Each task presents unique challenges and opportunities. However, there is currently a lack of comprehensive review that summarizes and compares the key techniques, metrics, datasets, models, and optimization approaches in this research domain. This survey aims to address this gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized. It identifies strengths, limitations, unexplored territories, and gaps in the existing literature, while providing some insights for future research directions in this vital and rapidly evolving field. It also provides relevant code and datasets references. Through this comprehensive review, we hope to provide interested readers with pertinent references and insightful perspectives, empowering them with the necessary tools and knowledge to effectively navigate and address the prevailing challenges in the field.
DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM
Xu, Weijie, Hu, Wenxiang, Wu, Fanyou, Sengamedu, Srinivasan
In the burgeoning field of natural language processing (NLP), Neural Topic Models (NTMs) , Large Language Models (LLMs) and Diffusion model have emerged as areas of significant research interest. Despite this, NTMs primarily utilize contextual embeddings from LLMs, which are not optimal for clustering or capable for topic based text generation. NTMs have never been combined with diffusion model for text generation. Our study addresses these gaps by introducing a novel framework named Diffusion-Enhanced Topic Modeling using Encoder-Decoder-based LLMs (DeTiME). DeTiME leverages Encoder-Decoder-based LLMs to produce highly clusterable embeddings that could generate topics that exhibit both superior clusterability and enhanced semantic coherence compared to existing methods. Additionally, by exploiting the power of diffusion model, our framework also provides the capability to do topic based text generation. This dual functionality allows users to efficiently produce highly clustered topics and topic based text generation simultaneously. DeTiME's potential extends to generating clustered embeddings as well. Notably, our proposed framework(both encoder-decoder based LLM and diffusion model) proves to be efficient to train and exhibits high adaptability to other LLMs and diffusion model, demonstrating its potential for a wide array of applications.
S2vNTM: Semi-supervised vMF Neural Topic Modeling
Xu, Weijie, Desai, Jay, Sengamedu, Srinivasan, Jiang, Xiaoyu, Iannacci, Francis
Language model based methods are powerful techniques for text classification. However, the models have several shortcomings. In this paper, we propose Semi-Supervised vMF Neural Topic Modeling (S2vNTM) to overcome these difficulties. S2vNTM takes a few seed keywords as input for topics. S2vNTM leverages the pattern of keywords to identify potential topics, as well as optimize the quality of topics' keywords sets. Across a variety of datasets, S2vNTM outperforms existing semi-supervised topic modeling methods in classification accuracy with limited keywords provided. S2vNTM is at least twice as fast as baselines. Language Model (LM) pre-training Vaswani et al. (2017); Devlin et al. (2018) has proven to be useful in learning universal language representations. Recent language models such as Yang et al. (2019); Sun et al. (2019); Chen et al. (2022); Ding et al. (2021) have achieved amazing results in text classification. Most of these methods need enough high-quality labels to train. To make LM based methods work well when limited labels are available, few shot learning methods such as Bianchi et al. (2021); Meng et al. (2020a;b); Mekala and Shang (2020); Yu et al. (2021); Wang et al. (2021b) have been proposed. However, these methods rely on large pre-trained texts and can be biased to apply to a different environment. Topic modeling methods generate topics based on the pattern of words.
Universal Representation for Code
Liu, Linfeng, Nguyen, Hoan, Karypis, George, Sengamedu, Srinivasan
Learning from source code usually requires a large amount of labeled data. Despite the possible scarcity of labeled data, the trained model is highly task-specific and lacks transferability to different tasks. In this work, we present effective pre-training strategies on top of a novel graph-based code representation, to produce universal representations for code. Specifically, our graph-based representation captures important semantics between code elements (e.g., control flow and data flow). We pre-train graph neural networks on the representation to extract universal code properties. The pre-trained model then enables the possibility of fine-tuning to support various downstream applications. We evaluate our model on two real-world datasets -- spanning over 30M Java methods and 770K Python methods. Through visualization, we reveal discriminative properties in our universal code representation. By comparing multiple benchmarks, we demonstrate that the proposed framework achieves state-of-the-art results on method name prediction and code graph link prediction.