embedding-based method
Breaking the Glass Ceiling for Embedding-Based Classifiers for Large Output Spaces
In extreme classification settings, embedding-based neural network models are currently not competitive with sparse linear and tree-based methods in terms of accuracy. Most prior works attribute this poor performance to the low-dimensional bottleneck in embedding-based methods. In this paper, we demonstrate that theoretically there is no limitation to using low-dimensional embedding-based methods, and provide experimental evidence that overfitting is the root cause of the poor performance of embedding-based methods. These findings motivate us to investigate novel data augmentation and regularization techniques to mitigate overfitting. To this end, we propose GLaS, a new regularizer for embedding-based neural network approaches. It is a natural generalization from the graph Laplacian and spread-out regularizers, and empirically it addresses the drawback of each regularizer alone when applied to the extreme classification setup. With the proposed techniques, we attain or improve upon the state-of-the-art on most widely tested public extreme classification datasets with hundreds of thousands of labels.
On the Modularity of Hypernetworks
In the context of learning to map an input $I$ to a function $h_I:\mathcal{X}\to \mathbb{R}$, two alternative methods are compared: (i) an embedding-based method, which learns a fixed function in which $I$ is encoded as a conditioning signal $e(I)$ and the learned function takes the form $h_I(x) = q(x,e(I))$, and (ii) hypernetworks, in which the weights $\theta_I$ of the function $h_I(x) = g(x;\theta_I)$ are given by a hypernetwork $f$ as $\theta_I=f(I)$. In this paper, we define the property of modularity as the ability to effectively learn a different function for each input instance $I$. For this purpose, we adopt an expressivity perspective of this property and extend the theory of~\cite{devore} and provide a lower bound on the complexity (number of trainable parameters) of neural networks as function approximators, by eliminating the requirements for the approximation method to be robust. Our results are then used to compare the complexities of $q$ and $g$, showing that under certain conditions and when letting the functions $e$ and $f$ be as large as we wish, $g$ can be smaller than $q$ by orders of magnitude.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > Nevada (0.04)
- (8 more...)
Cracking the Code: Enhancing Implicit Hate Speech Detection through Coding Classification
Wei, Lu, Li, Liangzhi, Xiang, Tong, Liu, Xiao, Garcia, Noa
The internet has become a hotspot for hate speech (HS), threatening societal harmony and individual well-being. While automatic detection methods perform well in identifying explicit hate speech (ex-HS), they struggle with more subtle forms, such as implicit hate speech (im-HS). We tackle this problem by introducing a new taxonomy for im-HS detection, defining six encoding strategies named codetypes. We present two methods for integrating codetypes into im-HS detection: 1) prompting large language models (LLMs) directly to classify sentences based on generated responses, and 2) using LLMs as encoders with codetypes embedded during the encoding process. Experiments show that the use of codetypes improves im-HS detection in both Chinese and English datasets, validating the effectiveness of our approach across different languages.
- Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (14 more...)
- Law Enforcement & Public Safety > Terrorism (0.67)
- Information Technology > Services (0.46)
- Health & Medicine > Therapeutic Area (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)
Breaking the Glass Ceiling for Embedding-Based Classifiers for Large Output Spaces
In extreme classification settings, embedding-based neural network models are currently not competitive with sparse linear and tree-based methods in terms of accuracy. Most prior works attribute this poor performance to the low-dimensional bottleneck in embedding-based methods. In this paper, we demonstrate that theoretically there is no limitation to using low-dimensional embedding-based methods, and provide experimental evidence that overfitting is the root cause of the poor performance of embedding-based methods. These findings motivate us to investigate novel data augmentation and regularization techniques to mitigate overfitting. To this end, we propose GLaS, a new regularizer for embedding-based neural network approaches.
FanChuan: A Multilingual and Graph-Structured Benchmark For Parody Detection and Analysis
Zheng, Yilun, Li, Sha, Wu, Fangkun, Ziyi, Yang, Hongchao, Lin, Hu, Zhichao, Xinjun, Cai, Wang, Ziming, Chen, Jinxuan, Luan, Sitao, Xu, Jiahao, Chen, Lihui
Parody is an emerging phenomenon on social media, where individuals imitate a role or position opposite to their own, often for humor, provocation, or controversy. Detecting and analyzing parody can be challenging and is often reliant on context, yet it plays a crucial role in understanding cultural values, promoting subcultures, and enhancing self-expression. However, the study of parody is hindered by limited available data and deficient diversity in current datasets. To bridge this gap, we built seven parody datasets from both English and Chinese corpora, with 14,755 annotated users and 21,210 annotated comments in total. To provide sufficient context information, we also collect replies and construct user-interaction graphs to provide richer contextual information, which is lacking in existing datasets. With these datasets, we test traditional methods and Large Language Models (LLMs) on three key tasks: (1) parody detection, (2) comment sentiment analysis with parody, and (3) user sentiment analysis with parody. Our extensive experiments reveal that parody-related tasks still remain challenging for all models, and contextual information plays a critical role. Interestingly, we find that, in certain scenarios, traditional sentence embedding methods combined with simple classifiers can outperform advanced LLMs, i.e. DeepSeek-R1 and GPT-o3, highlighting parody as a significant challenge for LLMs.
- North America > United States (0.14)
- North America > Canada > Quebec (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- (2 more...)
On the Modularity of Hypernetworks
In the context of learning to map an input I to a function h_I:\mathcal{X}\to \mathbb{R}, two alternative methods are compared: (i) an embedding-based method, which learns a fixed function in which I is encoded as a conditioning signal e(I) and the learned function takes the form h_I(x) q(x,e(I)), and (ii) hypernetworks, in which the weights \theta_I of the function h_I(x) g(x;\theta_I) are given by a hypernetwork f as \theta_I f(I) . In this paper, we define the property of modularity as the ability to effectively learn a different function for each input instance I . For this purpose, we adopt an expressivity perspective of this property and extend the theory of \cite{devore} and provide a lower bound on the complexity (number of trainable parameters) of neural networks as function approximators, by eliminating the requirements for the approximation method to be robust. Our results are then used to compare the complexities of q and g, showing that under certain conditions and when letting the functions e and f be as large as we wish, g can be smaller than q by orders of magnitude. Besides, we show that for a structured target function, the overall number of trainable parameters in a hypernetwork is smaller by orders of magnitude than the number of trainable parameters of a standard neural network and an embedding method.
Breaking the Glass Ceiling for Embedding-Based Classifiers for Large Output Spaces
In extreme classification settings, embedding-based neural network models are currently not competitive with sparse linear and tree-based methods in terms of accuracy. Most prior works attribute this poor performance to the low-dimensional bottleneck in embedding-based methods. In this paper, we demonstrate that theoretically there is no limitation to using low-dimensional embedding-based methods, and provide experimental evidence that overfitting is the root cause of the poor performance of embedding-based methods. These findings motivate us to investigate novel data augmentation and regularization techniques to mitigate overfitting. To this end, we propose GLaS, a new regularizer for embedding-based neural network approaches.
Brevity is the soul of wit: Pruning long files for code generation
Singh, Aaditya K., Yang, Yu, Tirumala, Kushal, Elhoushi, Mostafa, Morcos, Ari S.
Data curation is commonly considered a "secret-sauce" for LLM training, with higher quality data usually leading to better LLM performance. Given the scale of internet-scraped corpora, data pruning has become a larger and larger focus. Specifically, many have shown that de-duplicating data, or sub-selecting higher quality data, can lead to efficiency or performance improvements. Generally, three types of methods are used to filter internetscale corpora: embedding-based, heuristic-based, and classifier-based. In this work, we contrast the former two in the domain of finetuning LLMs for code generation. We find that embedding-based methods are often confounded by length, and that a simple heuristic-- pruning long files--outperforms other methods in compute-limited regimes. Our method can yield up to a 2x efficiency benefit in training (while matching performance) or a 3.5% absolute performance improvement on HumanEval (while matching compute). However, we find that perplexity on held-out long files can increase, begging the question of whether optimizing data mixtures for common coding benchmarks (HumanEval, MBPP) actually best serves downstream use cases. Overall, we hope our work builds useful intuitions about code data (specifically, the low quality of extremely long code files) provides a compelling heuristic-based method for data pruning, and brings to light questions in how we evaluate code generation models.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (0.83)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Multilingual Knowledge Graph Completion from Pretrained Language Models with Knowledge Constraints
Song, Ran, He, Shizhu, Gao, Shengxiang, Cai, Li, Liu, Kang, Yu, Zhengtao, Zhao, Jun
Multilingual Knowledge Graph Completion (mKGC) aim at solving queries like (h, r, ?) in different languages by reasoning a tail entity t thus improving multilingual knowledge graphs. Previous studies leverage multilingual pretrained language models (PLMs) and the generative paradigm to achieve mKGC. Although multilingual pretrained language models contain extensive knowledge of different languages, its pretraining tasks cannot be directly aligned with the mKGC tasks. Moreover, the majority of KGs and PLMs currently available exhibit a pronounced English-centric bias. This makes it difficult for mKGC to achieve good results, particularly in the context of low-resource languages. To overcome previous problems, this paper introduces global and local knowledge constraints for mKGC. The former is used to constrain the reasoning of answer entities, while the latter is used to enhance the representation of query contexts. The proposed method makes the pretrained model better adapt to the mKGC task. Experimental results on public datasets demonstrate that our method outperforms the previous SOTA on Hits@1 and Hits@10 by an average of 12.32% and 16.03%, which indicates that our proposed method has significant enhancement on mKGC.
- Asia > Middle East > Iraq (0.05)
- Asia > China > Yunnan Province > Kunming (0.04)
- Asia > China > Beijing > Beijing (0.04)
- (19 more...)