weakly supervised data
Learning From Weakly Supervised Data by The Expectation Loss SVM (e-SVM) algorithm
In many situations we have some measurement of confidence on positiveness for a binary label. Thepositiveness" is a continuous value whose range is a bounded interval. We propose a novel learning algorithm called \emph{expectation loss SVM} (e-SVM) that is devoted to the problems where only the positiveness" instead of a binary label of each training sample is available. Our e-SVM algorithm can also be readily extended to learn segment classifiers under weak supervision where the exact positiveness value of each training example is unobserved. In experiments, we show that the e-SVM algorithm can effectively address the segment proposal classification task under both strong supervision (e.g. the pixel-level annotations are available) and the weak supervision (e.g.
A Weakly Supervised Data Labeling Framework for Machine Lexical Normalization in Vietnamese Social Media
Nguyen, Dung Ha, Nguyen, Anh Thi Hoang, Van Nguyen, Kiet
This study introduces an innovative automatic labeling framework to address the challenges of lexical normalization in social media texts for low-resource languages like Vietnamese. Social media data is rich and diverse, but the evolving and varied language used in these contexts makes manual labeling labor-intensive and expensive. To tackle these issues, we propose a framework that integrates semi-supervised learning with weak supervision techniques. This approach enhances the quality of training dataset and expands its size while minimizing manual labeling efforts. Our framework automatically labels raw data, converting non-standard vocabulary into standardized forms, thereby improving the accuracy and consistency of the training data. Experimental results demonstrate the effectiveness of our weak supervision framework in normalizing Vietnamese text, especially when utilizing Pre-trained Language Models. The proposed framework achieves an impressive F1-score of 82.72% and maintains vocabulary integrity with an accuracy of up to 99.22%. Additionally, it effectively handles undiacritized text under various conditions. This framework significantly enhances natural language normalization quality and improves the accuracy of various NLP tasks, leading to an average accuracy increase of 1-3%.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Beijing > Beijing (0.04)
- Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
- (12 more...)
- Information Technology > Services (1.00)
- Information Technology > Security & Privacy (0.67)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
From Large to Tiny: Distilling and Refining Mathematical Expertise for Math Word Problems with Weakly Supervision
Lin, Qingwen, Xu, Boyan, Huang, Zhengting, Cai, Ruichu
Addressing the challenge of high annotation costs in solving Math Word Problems (MWPs) through full supervision with intermediate equations, recent works have proposed weakly supervised task settings that rely solely on the final answer as a supervised signal. Existing leading approaches typically employ various search techniques to infer intermediate equations, but cannot ensure their semantic consistency with natural language descriptions. The rise of Large Language Models (LLMs) like ChatGPT has opened up new possibilities for addressing MWPs directly. However, the computational demands of LLMs make them less than ideal for use in settings where resources are tight. In light of these challenges, we introduce an innovative two-stage framework that adeptly transfers mathematical Expertise from large to tiny language models. In \emph{Distillation Stage}, we propose a series of extraction processes that satisfy the properties of MWPs to distill mathematical knowledge from LLMs to construct problem-equation pairs required for supervised training. In \emph{Refinement Stage}, Due to Knowledge distilling method cannot guarantee the full utilization of all data, we further utilize the unsuccessfully searched data effectively by Knowledge Refine method. Finally, We train a small model using distilled data generated through two-stage methods. As our method fully leverages the semantic understanding capabilities during the searching 'problem-equation' pair, it demonstrates significantly improved performance on the Math23K and Weak12K datasets compared to existing small model methods, while maintaining a much lower computational cost than ChatGPT.
- North America > United States (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- Research Report (0.82)
- Workflow (0.68)
HierCat: Hierarchical Query Categorization from Weakly Supervised Data at Facebook Marketplace
He, Yunzhong, Zhang, Cong, Kong, Ruoyan, Kulkarni, Chaitanya, Liu, Qing, Gandhe, Ashish, Nithianandan, Amit, Prakash, Arul
Query categorization at customer-to-customer e-commerce platforms like Facebook Marketplace is challenging due to the vagueness of search intent, noise in real-world data, and imbalanced training data across languages. Its deployment also needs to consider challenges in scalability and downstream integration in order to translate modeling advances into better search result relevance. In this paper we present HierCat, the query categorization system at Facebook Marketplace. HierCat addresses these challenges by leveraging multi-task pre-training of dual-encoder architectures with a hierarchical inference step to effectively learn from weakly supervised training data mined from searcher engagement. We show that HierCat not only outperforms popular methods in offline experiments, but also leads to 1.4% improvement in NDCG and 4.3% increase in searcher engagement at Facebook Marketplace Search in online A/B testing.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.28)
- North America > United States > Texas > Travis County > Austin (0.05)
- North America > United States > New York > New York County > New York City (0.05)
- (6 more...)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
Speech-to-Speech Translation For A Real-world Unwritten Language
Chen, Peng-Jen, Tran, Kevin, Yang, Yilin, Du, Jingfei, Kao, Justine, Chung, Yu-An, Tomasello, Paden, Duquenne, Paul-Ambroise, Schwenk, Holger, Gong, Hongyu, Inaguma, Hirofumi, Popuri, Sravya, Wang, Changhan, Pino, Juan, Hsu, Wei-Ning, Lee, Ann
We study speech-to-speech translation (S2ST) that translates speech from one language into another language and focuses on building systems to support languages without standard text writing systems. We use English-Taiwanese Hokkien as a case study, and present an end-to-end solution from training data collection, modeling choices to benchmark dataset release. First, we present efforts on creating human annotated data, automatically mining data from large unlabeled speech datasets, and adopting pseudo-labeling to produce weakly supervised data. On the modeling, we take advantage of recent advances in applying self-supervised discrete representations as target for prediction in S2ST and show the effectiveness of leveraging additional text supervision from Mandarin, a language similar to Hokkien, in model training. Finally, we release an S2ST benchmark set to facilitate future research in this field. The demo can be found at https://huggingface.co/spaces/facebook/Hokkien_Translation .
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Taiwan (0.05)
- Europe > Germany > Saxony > Leipzig (0.04)
- (6 more...)
OpenStance: Real-world Zero-shot Stance Detection
Xu, Hanzi, Vucetic, Slobodan, Yin, Wenpeng
Prior studies of zero-shot stance detection identify the attitude of texts towards unseen topics occurring in the same document corpus. Such task formulation has three limitations: (i) Single domain/dataset. A system is optimized on a particular dataset from a single domain; therefore, the resulting system cannot work well on other datasets; (ii) the model is evaluated on a limited number of unseen topics; (iii) it is assumed that part of the topics has rich annotations, which might be impossible in real-world applications. These drawbacks will lead to an impractical stance detection system that fails to generalize to open domains and open-form topics. This work defines OpenStance: open-domain zero-shot stance detection, aiming to handle stance detection in an open world with neither domain constraints nor topic-specific annotations. The key challenge of OpenStance lies in the open-domain generalization: learning a system with fully unspecific supervision but capable of generalizing to any dataset. To solve OpenStance, we propose to combine indirect supervision, from textual entailment datasets, and weak supervision, from data generated automatically by pre-trained Language Models. Our single system, without any topic-specific supervision, outperforms the supervised method on three popular datasets. To our knowledge, this is the first work that studies stance detection under the open-domain zero-shot setting. All data and code are publicly released.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > Dominican Republic (0.05)
- (13 more...)
- Media (0.46)
- Government (0.46)
Learning From Weakly Supervised Data by The Expectation Loss SVM (e-SVM) algorithm
Zhu, Jun, Mao, Junhua, Yuille, Alan L.
In many situations we have some measurement of confidence on positiveness for a binary label. The positiveness" is a continuous value whose range is a bounded interval. We propose a novel learning algorithm called \emph{expectation loss SVM} (e-SVM) that is devoted to the problems where only the positiveness" instead of a binary label of each training sample is available. Our e-SVM algorithm can also be readily extended to learn segment classifiers under weak supervision where the exact positiveness value of each training example is unobserved. In experiments, we show that the e-SVM algorithm can effectively address the segment proposal classification task under both strong supervision (e.g. the pixel-level annotations are available) and the weak supervision (e.g.
Billion-scale semi-supervised learning for state-of-the-art image and video classification
Accurate image and video classification is important for a wide range of computer vision applications, from identifying harmful content, to making products more accessible to the visually impaired, to helping people more easily buy and sell things on products like Marketplace. Facebook AI is developing alternative ways to train our AI systems so that we can do more with less labeled training data overall, and also deliver accurate results even when large, high-quality labeled data sets are simply not available. Today, we are sharing details on a versatile new model training technique that delivers state-of-the-art accuracy for image and video classification systems. This approach, which we call semi-weak supervision, is a new way to combine the merits of two different training methods: semi-supervised learning and weakly supervised learning. It opens the door the door to creating more accurate, efficient production classification models by using a teacher-student model training paradigm and billion-scale weakly supervised data sets.
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
- Information Technology > Artificial Intelligence > Vision > Video Understanding (0.83)
- Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.76)
A General Formulation for Safely Exploiting Weakly Supervised Data
Guo, Lan-Zhe (Nanjing University) | Li, Yu-Feng (Nanjing University)
Weakly supervised data helps improve learning performance, which is an important machine learning data. However, recent results indicate that machine learning techniques with the usage of weakly supervised data may sometimes lead to performance degradation. How to safely leverage weakly supervised data has become an important issue, whereas there is only very limited effort, especially on a general formulation to help provide insight to understand safe weakly supervised learning. In this paper we present a scheme, which builds the final prediction results by integrating several weakly supervised learners. Our resultant formulation brings two implications. i) It has safeness guarantees for the commonly used convex loss functions in both regression and classification tasks of weakly supervised learning; ii) It can embed uncertain prior knowledge about the importance of base learners flexibly. Moreover, our formulation can be addressed globally by simple convex quadratic program or linear program in an efficient manner. Experiments on multiple weakly supervised learning tasks such as label noise learning, domain adaptation and semi-supervised learning validate the effectiveness of our proposed algorithms.
- Asia > China > Jiangsu Province > Nanjing (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)