Ma, Yiming
JiraiBench: A Bilingual Benchmark for Evaluating Large Language Models' Detection of Human Self-Destructive Behavior Content in Jirai Community
Xiao, Yunze, He, Tingyu, Wang, Lionel Z., Ma, Yiming, Song, Xingyu, Xu, Xiaohang, Li, Irene, Ng, Ka Chung
This paper introduces JiraiBench, the first bilingual benchmark for evaluating large language models' effectiveness in detecting self-destructive content across Chinese and Japanese social media communities. Focusing on the transnational "Jirai" (landmine) online subculture that encompasses multiple forms of self-destructive behaviors including drug overdose, eating disorders, and self-harm, we present a comprehensive evaluation framework incorporating both linguistic and cultural dimensions. Our dataset comprises 10,419 Chinese posts and 5,000 Japanese posts with multidimensional annotation along three behavioral categories, achieving substantial inter-annotator agreement. Experimental evaluations across four state-of-the-art models reveal significant performance variations based on instructional language, with Japanese prompts unexpectedly outperforming Chinese prompts when processing Chinese content. This emergent cross-cultural transfer suggests that cultural proximity can sometimes outweigh linguistic similarity in detection tasks. Cross-lingual transfer experiments with fine-tuned models further demonstrate the potential for knowledge transfer between these language systems without explicit target language training. These findings highlight the need for culturally-informed approaches to multilingual content moderation and provide empirical evidence for the importance of cultural context in developing more effective detection systems for vulnerable online communities.
ProteinWeaver: A Divide-and-Assembly Approach for Protein Backbone Design
Ma, Yiming, Ye, Fei, Zhou, Yi, Zheng, Zaixiang, Xue, Dongyu, Gu, Quanquan
Nature creates diverse proteins through a'divide and assembly' strategy. Inspired by this idea, we introduce ProteinWeaver, a two-stage framework for protein backbone design. Our method first generates individual protein domains and then employs an SE(3) diffusion model to flexibly assemble these domains. A key challenge lies in the assembling step, given the complex and rugged nature of the interdomain interaction landscape. To address this challenge, we employ preference alignment to discern complex relationships between structure and interaction landscapes through comparative analysis of generated samples. Comprehensive experiments demonstrate that ProteinWeaver: (1) generates high-quality, novel protein backbones through versatile domain assembly; (2) outperforms RFdiffusion, the current state-of-the-art in backbone design, by 13% and 39% for long-chain proteins; (3) shows the potential for cooperative function design through illustrative case studies. To sum up, by introducing a'divide-and-assembly' paradigm, ProteinWeaver advances protein engineering and opens new avenues for functional protein design. Nature employs a sophisticated'divide and assemble' strategy to create large and intricate protein structures that meet diverse biological functional needs (Figure 1A) (Pawson & Nash, 2003; Huddy et al., 2024; P Bagowski et al., 2010). This process primarily involves the recombination of existing structural blocks, particularly protein domains, which serve as the fundamental, recurring units in protein structures. Remarkably, a limited number of protein domains (approximately 500 as classified in CATH) suffice to create more than hundreds of thousands of structures satisfying a wide array of functions (Orengo et al., 1997). This strategy enables the creation of multi-domain protein backbones, facilitating the emergence of cooperative functions. However, our analysis reveals a significant limitation: designability decreases markedly as the backbone length increases (Figure 1E).
Inference via robust optimal transportation: theory and methods
Ma, Yiming, Liu, Hang, La Vecchia, Davide, Lerasle, Metthieu
Optimal transport (OT) theory and the related $p$-Wasserstein distance ($W_p$, $p\geq 1$) are widely-applied in statistics and machine learning. In spite of their popularity, inference based on these tools is sensitive to outliers or it can perform poorly when the underlying model has heavy-tails. To cope with these issues, we introduce a new class of procedures. (i) We consider a robust version of the primal OT problem (ROBOT) and show that it defines the {robust Wasserstein distance}, $W^{(\lambda)}$, which depends on a tuning parameter $\lambda > 0$. (ii) We illustrate the link between $W_1$ and $W^{(\lambda)}$ and study its key measure theoretic aspects. (iii) We derive some concentration inequalities for $W^{(\lambda)}$. (iii) We use $W^{(\lambda)}$ to define minimum distance estimators, we provide their statistical guarantees and we illustrate how to apply concentration inequalities for the selection of $\lambda$. (v) We derive the {dual} form of the ROBOT and illustrate its applicability to machine learning problems (generative adversarial networks and domain adaptation). Numerical exercises provide evidence of the benefits yielded by our methods.
Entity Personalized Talent Search Models with Tree Interaction Features
Ozcaglar, Cagri, Geyik, Sahin, Schmitz, Brian, Sharma, Prakhar, Shelkovnykov, Alex, Ma, Yiming, Buchanan, Erik
Talent Search systems aim to recommend potential candidates who are a good match to the hiring needs of a recruiter expressed in terms of the recruiter's search query or job posting. Past work in this domain has focused on linear and nonlinear models which lack preference personalization in the user-level due to being trained only with globally collected recruiter activity data. In this paper, we propose an entity-personalized Talent Search model which utilizes a combination of generalized linear mixed (GLMix) models and gradient boosted decision tree (GBDT) models, and provides personalized talent recommendations using nonlinear tree interaction features generated by the GBDT. We also present the offline and online system architecture for the productionization of this hybrid model approach in our Talent Search systems. Finally, we provide offline and online experiment results benchmarking our entity-personalized model with tree interaction features, which demonstrate significant improvements in our precision metrics compared to globally trained non-personalized models.