professionalism
A Dynamic Fusion Model for Consistent Crisis Response
Song, Xiaoying, Anik, Anirban Saha, Blanco, Eduardo, Frias-Martinez, Vanessa, Hong, Lingzi
In response to the urgent need for effective communication with crisis-affected populations, automated responses driven by language models have been proposed to assist in crisis communications. A critical yet often overlooked factor is the consistency of response style, which could affect the trust of affected individuals in responders. Despite its importance, few studies have explored methods for maintaining stylistic consistency across generated responses. To address this gap, we propose a novel metric for evaluating style consistency and introduce a fusion-based generation approach grounded in this metric. Our method employs a two-stage process: it first assesses the style of candidate responses and then optimizes and integrates them at the instance level through a fusion process. This enables the generation of high-quality responses while significantly reducing stylistic variation between instances. Experimental results across multiple datasets demonstrate that our approach consistently outperforms baselines in both response quality and stylistic uniformity.
- North America > United States > Texas (0.14)
- North America > United States > Arizona (0.04)
- North America > United States > Maryland (0.04)
- (4 more...)
- Health & Medicine (1.00)
- Information Technology (0.93)
- Government (0.68)
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Zhuang, Xinlin, Peng, Jiahui, Ma, Ren, Wang, Yinfan, Bai, Tianyi, Wei, Xingjian, Qiu, Jiantao, Zhang, Chi, Qian, Ying, He, Conghui
The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data quality, a critical driver of model performance. Current data selection methods, such as natural language quality assessments, diversity-based filters, and classifier-based approaches, are limited by single-dimensional evaluation or redundancy-focused strategies. To address these gaps, we propose four dimensions to evaluate data quality: professionalism, readability, reasoning, and cleanliness. We further introduce Meta-rater,a multi-dimensional data selection method that integrates these dimensions with existing quality metrics through learned optimal weightings. Meta-rater employs proxy models to train a regression model that predicts validation loss, enabling the identification of optimal combinations of quality scores. Experiments demonstrate that Meta-rater doubles convergence speed for 1.3B parameter models and improves downstream task performance by 3.23, with advantages that scale to models as large as 7.2B parameters. Our work establishes that holistic, multi-dimensional quality integration significantly outperforms conventional single-dimension approaches, offering a scalable paradigm for enhancing pre-training efficiency and model capability. To advance future research, we release scripts, data, and models at https://github.com/opendatalab/Meta-rater.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > India (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- (10 more...)
- Leisure & Entertainment > Sports (1.00)
- Information Technology (0.68)
- Health & Medicine > Therapeutic Area (0.67)
- Education (0.67)
ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing
Wang, Jinzhi, Peng, Qingke, Li, Haozhou, Zeng, Zeyuan, Song, Qinfeng, Yang, Kaixuan, Zhang, Jiangbo, Wang, Yaoying, Li, Ruimeng, Zhou, Biyi
Electric power marketing telephone customer service primarily communicates with customers via phone calls to understand their electricity usage needs, provide consultations, process service applications, and handle complaints [1]. Ensuring timely and effective responses is essential throughout the service process. However, current systems (e.g., 95598, the customer service hotline of State Grid Corporation of China) often suffer from poor user experience, delayed responses, and inaccurate information[2] [3]. These traditional systems rely heavily on fixed procedures and templates, lacking the flexibility to address complex and diverse customer demands. This limitation is particularly pronounced in the highly specialized field of electric power marketing, where slow response times and insufficiently tailored solutions negatively impact service quality. Although human agents can complement these systems by managing more complex issues, they also face significant challenges, such as high workloads during peak periods, delayed response times, and inconsistent levels of professional knowledge and expertise. As a result, it is difficult to guarantee consistent and high-quality service for all customers.
- Asia > China > Shaanxi Province > Xi'an (0.05)
- Asia > China > Anhui Province > Hefei (0.04)
- North America > United States (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Modeling Professionalism in Expert Questioning through Linguistic Differentiation
D'Agostino, Giulia, Chen, Chung-Chi
Professionalism is a crucial yet underexplored dimension of expert communication, particularly in high-stakes domains like finance. This paper investigates how linguistic features can be leveraged to model and evaluate professionalism in expert questioning. We introduce a novel annotation framework to quantify structural and pragmatic elements in financial analyst questions, such as discourse regulators, prefaces, and request types. Using both human-authored and large language model (LLM)-generated questions, we construct two datasets: one annotated for perceived professionalism and one labeled by question origin. We show that the same linguistic features correlate strongly with both human judgments and authorship origin, suggesting a shared stylistic foundation. Furthermore, a classifier trained solely on these interpretable features outperforms gemini-2.0 and SVM baselines in distinguishing expert-authored questions. Our findings demonstrate that professionalism is a learnable, domain-general construct that can be captured through linguistically grounded modeling.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)
- Europe > Switzerland (0.04)
- Asia > South Korea (0.04)
- Asia > Japan (0.04)
A Fuzzy Supervisor Agent Design for Clinical Reasoning Assistance in a Multi-Agent Educational Clinical Scenario Simulation
Zheng, Weibing, Turner, Laurah, Kropczynski, Jess, Ozer, Murat, Overla, Seth, Halse, Shane
Assisting medical students with clinical reasoning (CR) during clinical scenario training remains a persistent challenge in medical education. This paper presents the design and architecture of the Fuzzy Supervisor Agent (FSA), a novel component for the Multi-Agent Educational Clinical Scenario Simulation (MAECSS) platform. The FSA leverages a Fuzzy Inference System (FIS) to continuously interpret student interactions with specialized clinical agents (e.g., patient, physical exam, diagnostic, intervention) using pre-defined fuzzy rule bases for professionalism, medical relevance, ethical behavior, and contextual distraction. By analyzing student decision-making processes in real-time, the FSA is designed to deliver adaptive, context-aware feedback and provides assistance precisely when students encounter difficulties. This work focuses on the technical framework and rationale of the FSA, highlighting its potential to provide scalable, flexible, and human-like supervision in simulation-based medical education. Future work will include empirical evaluation and integration into broader educational settings. More detailed design and implementation is open sourced here.
- Education > Educational Setting (0.77)
- Health & Medicine > Diagnostic Medicine (0.76)
LLM-as-a-Fuzzy-Judge: Fine-Tuning Large Language Models as a Clinical Evaluation Judge with Fuzzy Logic
Zheng, Weibing, Turner, Laurah, Kropczynski, Jess, Ozer, Murat, Nguyen, Tri, Halse, Shane
Clinical communication skills are critical in medical education, and practicing and assessing clinical communication skills on a scale is challenging. Although LLM-powered clinical scenario simulations have shown promise in enhancing medical students' clinical practice, providing automated and scalable clinical evaluation that follows nuanced physician judgment is difficult. This paper combines fuzzy logic and Large Language Model (LLM) and proposes LLM-as-a-Fuzzy-Judge to address the challenge of aligning the automated evaluation of medical students' clinical skills with subjective physicians' preferences. LLM-as-a-Fuzzy-Judge is an approach that LLM is fine-tuned to evaluate medical students' utterances within student-AI patient conversation scripts based on human annotations from four fuzzy sets, including Professionalism, Medical Relevance, Ethical Behavior, and Contextual Distraction. The methodology of this paper started from data collection from the LLM-powered medical education system, data annotation based on multidimensional fuzzy sets, followed by prompt engineering and the supervised fine-tuning (SFT) of the pre-trained LLMs using these human annotations. The results show that the LLM-as-a-Fuzzy-Judge achieves over 80\% accuracy, with major criteria items over 90\%, effectively leveraging fuzzy logic and LLM as a solution to deliver interpretable, human-aligned assessment. This work suggests the viability of leveraging fuzzy logic and LLM to align with human preferences, advances automated evaluation in medical education, and supports more robust assessment and judgment practices. The GitHub repository of this work is available at https://github.com/2sigmaEdTech/LLMAsAJudge
- South America > Argentina > Patagonia > Río Negro Province > Viedma (0.04)
- North America > United States > Ohio > Hamilton County > Cincinnati (0.04)
- Asia > Singapore (0.04)
- Asia > Middle East > Israel (0.04)
- Health & Medicine (1.00)
- Education > Educational Setting (0.78)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
ScoreRAG: A Retrieval-Augmented Generation Framework with Consistency-Relevance Scoring and Structured Summarization for News Generation
This research introduces ScoreRAG, an approach to enhance the quality of automated news generation. Despite advancements in Natural Language Processing and large language models, current news generation methods often struggle with hallucinations, factual inconsistencies, and lack of domain-specific expertise when producing news articles. ScoreRAG addresses these challenges through a multi-stage framework combining retrieval-augmented generation, consistency relevance evaluation, and structured summarization. The system first retrieves relevant news documents from a vector database, maps them to complete news items, and assigns consistency relevance scores based on large language model evaluations. These documents are then reranked according to relevance, with low-quality items filtered out. The framework proceeds to generate graded summaries based on relevance scores, which guide the large language model in producing complete news articles following professional journalistic standards. Through this methodical approach, ScoreRAG aims to significantly improve the accuracy, coherence, informativeness, and professionalism of generated news articles while maintaining stability and consistency throughout the generation process. The code and demo are available at: https://github.com/peiyun2260/ScoreRAG.
Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas
Chiu, Yu Ying, Wang, Zhilin, Maiya, Sharan, Choi, Yejin, Fish, Kyle, Levine, Sydney, Hubinger, Evan
Detecting AI risks becomes more challenging as stronger models emerge and find novel methods such as Alignment Faking to circumvent these detection attempts. Inspired by how risky behaviors in humans (i.e., illegal activities that may hurt others) are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI's risky behaviors. We create LitmusValues, an evaluation pipeline to reveal AI models' priorities on a range of AI value classes. Then, we collect AIRiskDilemmas, a diverse collection of dilemmas that pit values against one another in scenarios relevant to AI safety risks such as Power Seeking. By measuring an AI model's value prioritization using its aggregate choices, we obtain a self-consistent set of predicted value priorities that uncover potential risks. We show that values in LitmusValues (including seemingly innocuous ones like Care) can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Ireland (0.04)
- (2 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
In Pursuit of Professionalism
Robin K. Hill Is Computer Science a Profession? We computer scientists--many of us--like to think of ourselves as professionals, as do doctors and lawyers, and police officers, and accountants. But there are definitions of "profession," with criteria and expectations, that we fail to meet. Are we ready, collectively, to confront the criteria? Do we want to be card-carrying members of a learned institution of service?
Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions
Huang, Saffron, Durmus, Esin, McCain, Miles, Handa, Kunal, Tamkin, Alex, Hong, Jerry, Stern, Michael, Somani, Arushi, Zhang, Xiuruo, Ganguli, Deep
AI assistants can impart value judgments that shape people's decisions and worldviews, yet little is known empirically about what values these systems rely on in practice. To address this, we develop a bottom-up, privacy-preserving method to extract the values (normative considerations stated or demonstrated in model responses) that Claude 3 and 3.5 models exhibit in hundreds of thousands of real-world interactions. We empirically discover and taxonomize 3,307 AI values and study how they vary by context. We find that Claude expresses many practical and epistemic values, and typically supports prosocial human values while resisting values like "moral nihilism". While some values appear consistently across contexts (e.g. "transparency"), many are more specialized and context-dependent, reflecting the diversity of human interlocutors and their varied contexts. For example, "harm prevention" emerges when Claude resists users, "historical accuracy" when responding to queries about controversial events, "healthy boundaries" when asked for relationship advice, and "human agency" in technology ethics discussions. By providing the first large-scale empirical mapping of AI values in deployment, our work creates a foundation for more grounded evaluation and design of values in AI systems.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > France (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (8 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Therapeutic Area (0.67)