lexical representation
Unified Lexical Representation for Interpretable Visual-Language Alignment
Visual-Language Alignment (VLA) has gained a lot of attention since CLIP's groundbreaking work. Although CLIP performs well, the typical direct latent feature alignment lacks clarity in its representation and similarity scores. On the other hand, lexical representation, a vector whose element represents the similarity between the sample and a word from the vocabulary, is a natural sparse representation and interpretable, providing exact matches for individual words.However, lexical representations are difficult to learn due to no ground-truth supervision and false-discovery issues, and thus requires complex design to train effectively.In this paper, we introduce LexVLA, a more interpretable VLA framework by learning a unified lexical representation for both modalities without complex design. We use DINOv2 as our visual model for its local-inclined features and Llama 2, a generative language model, to leverage its in-context lexical prediction ability.To avoid the false discovery, we propose an overuse penalty to refrain the lexical representation from falsely frequently activating meaningless words.We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on the modest multi-modal dataset and avoid intricate training configurations. On cross-modal retrieval benchmarks, LexVLA, trained on the CC-12M multi-modal dataset, outperforms baselines fine-tuned on larger datasets (e.g., YFCC15M) and those trained from scratch on even bigger datasets (e.g., 1.1B data, including CC-12M).We conduct extensive experiments to analyze LexVLA.
Unified Lexical Representation for Interpretable Visual-Language Alignment
Visual-Language Alignment (VLA) has gained a lot of attention since CLIP's groundbreaking work. Although CLIP performs well, the typical direct latent feature alignment lacks clarity in its representation and similarity scores. On the other hand, lexical representation, a vector whose element represents the similarity between the sample and a word from the vocabulary, is a natural sparse representation and interpretable, providing exact matches for individual words.However, lexical representations are difficult to learn due to no ground-truth supervision and false-discovery issues, and thus requires complex design to train effectively.In this paper, we introduce LexVLA, a more interpretable VLA framework by learning a unified lexical representation for both modalities without complex design. We use DINOv2 as our visual model for its local-inclined features and Llama 2, a generative language model, to leverage its in-context lexical prediction ability.To avoid the false discovery, we propose an overuse penalty to refrain the lexical representation from falsely frequently activating meaningless words.We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on the modest multi-modal dataset and avoid intricate training configurations. On cross-modal retrieval benchmarks, LexVLA, trained on the CC-12M multi-modal dataset, outperforms baselines fine-tuned on larger datasets (e.g., YFCC15M) and those trained from scratch on even bigger datasets (e.g., 1.1B data, including CC-12M).We conduct extensive experiments to analyze LexVLA.
TARGET: Benchmarking Table Retrieval for Generative Tasks
Ji, Xingyu, Glenn, Parker, Parameswaran, Aditya G., Hulsebos, Madelon
The data landscape is rich with structured data, often of high value to organizations, driving important applications in data analysis and machine learning. Recent progress in representation learning and generative models for such data has led to the development of natural language interfaces to structured data, including those leveraging text-to-SQL. Contextualizing interactions, either through conversational interfaces or agentic components, in structured data through retrieval-augmented generation can provide substantial benefits in the form of freshness, accuracy, and comprehensiveness of answers. The key question is: how do we retrieve the right table(s) for the analytical query or task at hand? To this end, we introduce TARGET: a benchmark for evaluating TAble Retrieval for GEnerative Tasks. With TARGET we analyze the retrieval performance of different retrievers in isolation, as well as their impact on downstream tasks. We find that dense embedding-based retrievers far outperform a BM25 baseline which is less effective than it is for retrieval over unstructured text. We also surface the sensitivity of retrievers across various metadata (e.g., missing table titles), and demonstrate a stark variation of retrieval performance across datasets and tasks. TARGET is available at https://target-benchmark.github.io.
Probabilistic Lexical Manifold Construction in Large Language Models via Hierarchical Vector Field Interpolation
Pendleton, Clive, Harrington, Ewan, Fairbrother, Giles, Arkwright, Jasper, Fenwick, Nigel, Katrix, Richard
Hierarchical vector field interpolation introduces a structured probabilistic framework for lexical representation, ensuring that word embeddings transition smoothly across a continuous manifold rather than being constrained to discrete token mappings. The proposed methodology constructs a probabilistic function space where word representations adhere to topological consistency, mitigating representational discontinuities commonly observed in transformer-based embeddings. Empirical evaluations reveal that probabilistic constraints enhance lexical coherence by refining contextual relationships, leading to improvements in semantic stability across multiple linguistic distributions. The application of divergence minimization techniques ensures that interpolated embeddings maintain probabilistic consistency while preserving computational feasibility for large-scale implementations. Experimental findings demonstrate that interpolated lexical manifolds improve representation density alignment, reducing anisotropic distortions in contextual embedding distributions. Comparative analyses with standard transformer-based models highlight that structured interpolation yields more stable representations, particularly in tasks requiring fine-grained semantic differentiation. The statistical evaluation of embedding divergence confirms that probabilistic lexical manifolds reduce representational inconsistencies while maintaining coherence across varying scales of contextual abstraction. An assessment of computational efficiency reveals that while interpolation introduces minor processing overhead, the structured representation learning approach remains scalable for practical deployment.
Hierarchical Lexical Manifold Projection in Large Language Models: A Novel Mechanism for Multi-Scale Semantic Representation
Martus, Natasha, Crowther, Sebastian, Dorrington, Maxwell, Applethwaite, Jonathan, Tillinghurst, Edgar, Birkenshaw, Quentin, Petrov, Lukas, Willoughby, Constance
The integration of structured hierarchical embeddings into transformer-based architectures introduces a refined approach to lexical representation, ensuring that multi-scale semantic relationships are preserved without compromising computational efficiency. A projection mechanism that maps tokens onto a structured manifold provides improved lexical alignment, enhancing the adaptability of word representations across diverse linguistic tasks. The structured encoding framework ensures that hierarchical embeddings maintain coherence across varying abstraction levels, allowing for stable transitions between localized syntactic features and global semantic structures. Experimental evaluations indicate that hierarchical embeddings consistently outperform conventional token representations, improving accuracy in linguistic benchmarks while maintaining lower computational overhead. Comparative analysis across multiple domains highlights the ability of hierarchical embeddings to retain contextual consistency, particularly in specialized language applications where structured lexical alignment is essential. Statistical assessments further demonstrate that hierarchical embeddings exhibit enhanced robustness under perturbation conditions, ensuring that linguistic structures remain stable across adversarial text modifications. The integration of hierarchical projections with transformer attention mechanisms enables improved contextual adaptation, ensuring that token representations are dynamically adjusted based on varying linguistic distributions. The refined hierarchical organization of embeddings provides greater interpretability in lexical modeling, facilitating enhanced generalization capabilities across diverse text processing tasks.
Unified Lexical Representation for Interpretable Visual-Language Alignment
Li, Yifan, Wang, Yikai, Fu, Yanwei, Ru, Dongyu, Zhang, Zheng, He, Tong
Visual-Language Alignment (VLA) has gained a lot of attention since CLIP's groundbreaking work. Although CLIP performs well, the typical direct latent feature alignment lacks clarity in its representation and similarity scores. On the other hand, lexical representation, a vector whose element represents the similarity between the sample and a word from the vocabulary, is a natural sparse representation and interpretable, providing exact matches for individual words. However, lexical representations is difficult to learn due to no ground-truth supervision and false-discovery issues, and thus requires complex design to train effectively. In this paper, we introduce LexVLA, a more interpretable VLA framework by learning a unified lexical representation for both modalities without complex design. We use DINOv2 as our visual model for its local-inclined features and Llama 2, a generative language model, to leverage its in-context lexical prediction ability. To avoid the false discovery, we propose an overuse penalty to refrain the lexical representation from falsely frequently activating meaningless words. We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on modest multi-modal dataset and avoid intricate training configurations. On cross-modal retrieval benchmarks, LexVLA, trained on the CC-12M multi-modal dataset, outperforms baselines fine-tuned on larger datasets (e.g., YFCC15M) and those trained from scratch on even bigger datasets (e.g., 1.1B data, including CC-12M). We conduct extensive experiments to analyze LexVLA.
Efficient and Interpretable Information Retrieval for Product Question Answering with Heterogeneous Data
Biswas, Biplob, Ramnath, Rajiv
Expansion-enhanced sparse lexical representation improves information retrieval (IR) by minimizing vocabulary mismatch problems during lexical matching. In this paper, we explore the potential of jointly learning dense semantic representation and combining it with the lexical one for ranking candidate information. We present a hybrid information retrieval mechanism that maximizes lexical and semantic matching while minimizing their shortcomings. Our architecture consists of dual hybrid encoders that independently encode queries and information elements. Each encoder jointly learns a dense semantic representation and a sparse lexical representation augmented by a learnable term expansion of the corresponding text through contrastive learning. We demonstrate the efficacy of our model in single-stage ranking of a benchmark product question-answering dataset containing the typical heterogeneous information available on online product pages. Our evaluation demonstrates that our hybrid approach outperforms independently trained retrievers by 10.95% (sparse) and 2.7% (dense) in MRR@5 score. Moreover, our model offers better interpretability and performs comparably to state-of-the-art cross encoders while reducing response time by 30% (latency) and cutting computational load by approximately 38% (FLOPs).
Neural inhibition during speech planning contributes to contrastive hyperarticulation
Stern, Michael C., Shaw, Jason A.
Previous work has demonstrated that words are hyperarticulated on dimensions of speech that differentiate them from a minimal pair competitor. This phenomenon has been termed contrastive hyperarticulation (CH). We present a dynamic neural field (DNF) model of voice onset time (VOT) planning that derives CH from an inhibitory influence of the minimal pair competitor during planning. We test some predictions of the model with a novel experiment investigating CH of voiceless stop consonant VOT in pseudowords. The results demonstrate a CH effect in pseudowords, consistent with a basis for the effect in the real-time planning and production of speech. The scope and magnitude of CH in pseudowords was reduced compared to CH in real words, consistent with a role for interactive activation between lexical and phonological levels of planning. We discuss the potential of our model to unify an apparently disparate set of phenomena, from CH to phonological neighborhood effects to phonetic trace effects in speech errors.