cohen
ABiasMetrics
Ninedifferentdebiasing algorithms (and a baseline) have been evaluated with this dataset using the popular ResNet-18 network[36]. CelebA contains faces of celebrities with several binary task labelsandtwoprotected labels(genderandyouth). Table 3showsthe prediction results from a biased binary classifier and its bias values using the seven metrics. Without losing generality, we consider "Sport" the positive class in the binary classifier. Following the DP formula in Appendix A.2, for the "Sport" class, thePPRfemale is 45.0% (90 /200), andPPRmale is65.0%
A Creepy New Device Is Spreading Across School Campuses. Students Are Being Harassed. Teachers Are Sounding the Alarm.
Users Meta's A.I. Smart Glasses Are Wreaking Havoc in Schools Across the Country. It's Only Going to Get Worse. As the discreet wearable cameras become more popular, students are saying they feel constantly watched and harassed--and professors are reshaping their classrooms in response. Joziah was tabling on campus for his peer mentor job at the end of last semester at Florida State University when he noticed something strange happening across the quad: A trio of men, wearing Meta AI glasses, were stopping every young woman who passed by and asking them for their social media contacts. "I recognized them from TikTok, because they're kind of big, especially in Miami," the 19-year-old told me.
- Marketing (1.00)
- Education > Educational Setting (0.94)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.94)
- Information Technology > Services (0.64)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (7 more...)
Ben & Jerry's row deepens as three board members removed
Ben & Jerry's row deepens as three board members removed Three members of Ben & Jerry's independent board will no longer be eligible to serve in their roles, after the ice cream company introduced a new set of governance practices. These include a nine-year limit set on board members' terms. Chair Anuradha Mittal, who earlier said she had no plans to resign under pressure, is among those affected. The move was criticised by the company's co-founder Ben Cohen, who called it a blatant power grab designed to strip the board of legal authority and independence. His remarks are the latest in a long-running row between Ben and Jerry's and its owner over the Cherry Garcia maker's social activism and the continued independence of its board.
- North America > Central America (0.15)
- Oceania > Australia (0.06)
- North America > United States > Vermont (0.06)
- (16 more...)
- Leisure & Entertainment (0.75)
- Law (0.69)
- Energy (0.50)
- (3 more...)
Can LLMs Evaluate What They Cannot Annotate? Revisiting LLM Reliability in Hate Speech Detection
Piot, Paloma, Otero, David, Martín-Rodilla, Patricia, Parapar, Javier
Hate speech spreads widely online, harming individuals and communities, making automatic detection essential for large-scale moderation, yet detecting it remains difficult. Part of the challenge lies in subjectivity: what one person flags as hate speech, another may see as benign. Traditional annotation agreement metrics, such as Cohen's $κ$, oversimplify this disagreement, treating it as an error rather than meaningful diversity. Meanwhile, Large Language Models (LLMs) promise scalable annotation, but prior studies demonstrate that they cannot fully replace human judgement, especially in subjective tasks. In this work, we reexamine LLM reliability using a subjectivity-aware framework, cross-Rater Reliability (xRR), revealing that even under fairer lens, LLMs still diverge from humans. Yet this limitation opens an opportunity: we find that LLM-generated annotations can reliably reflect performance trends across classification models, correlating with human evaluations. We test this by examining whether LLM-generated annotations preserve the relative ordering of model performance derived from human evaluation (i.e. whether models ranked as more reliable by human annotators preserve the same order when evaluated with LLM-generated labels). Our results show that, although LLMs differ from humans at the instance level, they reproduce similar ranking and classification patterns, suggesting their potential as proxy evaluators. While not a substitute for human annotators, they might serve as a scalable proxy for evaluation in subjective NLP tasks.
- Europe > Austria > Vienna (0.14)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (11 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)
Ben & Jerry's brand could be destroyed, says co-founder
Ben & Jerry's brand could be destroyed, says co-founder Ben & Jerry's will be destroyed as a brand if it remains with parent company Magnum, the company's co-founder Ben Cohen has told the BBC. His remarks are the latest in a long-running spat between the ice cream brand and its parent company over its ability to express its social activism and the continued independence of its board. The comments came on the day that the Magnum Ice Cream Company (TMICC) started trading on the European stock market - spinning off from owner Unilever. A spokesperson for Magnum said the firm wanted to build and strengthen Ben & Jerry's powerful, non-partisan values-based position in the world. Ben & Jerry's was sold to Unilever in 2000 in a deal which allowed it to retain an independent board and the right to make decisions about its social mission.
- Asia > Middle East > Israel (0.16)
- North America > Central America (0.15)
- Asia > China (0.06)
- (18 more...)
- Consumer Products & Services (1.00)
- Leisure & Entertainment (0.98)
- Media > Film (0.48)
- Government > Regional Government (0.48)
Catching UX Flaws in Code: Leveraging LLMs to Identify Usability Flaws at the Development Stage
Platt, Nolan, Luchs, Ethan, Nizamani, Sehrish
Usability evaluations are essential for ensuring that modern interfaces meet user needs, yet traditional heuristic evaluations by human experts can be time-consuming and subjective, especially early in development. This paper investigates whether large language models (LLMs) can provide reliable and consistent heuristic assessments at the development stage. By applying Jakob Nielsen's ten usability heuristics to thirty open-source websites, we generated over 850 heuristic evaluations in three independent evaluations per site using a pipeline of OpenAI's GPT-4o. For issue detection, the model demonstrated moderate consistency, with an average pairwise Cohen's Kappa of 0.50 and an exact agreement of 84%. Severity judgments showed more variability: weighted Cohen's Kappa averaged 0.63, but exact agreement was just 56%, and Krippendorff's Alpha was near zero. These results suggest that while GPT-4o can produce internally consistent evaluations, especially for identifying the presence of usability issues, its ability to judge severity varies and requires human oversight in practice. Our findings highlight the feasibility and limitations of using LLMs for early-stage, automated usability testing, and offer a foundation for improving consistency in automated User Experience (UX) evaluation. To the best of our knowledge, our work provides one of the first quantitative inter-rater reliability analyses of automated heuristic evaluation and highlights methods for improving model consistency.
A robust generalizable device-agnostic deep learning model for sleep-wake determination from triaxial wrist accelerometry
Montazeri, Nasim, Yang, Stone, Luszczynski, Dominik, Zhang, John, Gurve, Dharmendra, Centen, Andrew, Goubran, Maged, Lim, Andrew
Study Objectives: Wrist accelerometry is widely used for inferring sleep-wake state. Previous works demonstrated poor wake detection, without cross-device generalizability and validation in different age range and sleep disorders. We developed a robust deep learning model for to detect sleep-wakefulness from triaxial accelerometry and evaluated its validity across three devices and in a large adult population spanning a wide range of ages with and without sleep disorders. Methods: We collected wrist accelerometry simultaneous to polysomnography (PSG) in 453 adults undergoing clinical sleep testing at a tertiary care sleep laboratory, using three devices. We extracted features in 30-second epochs and trained a 3-class model to detect wake, sleep, and sleep with arousals, which was then collapsed into wake vs. sleep using a decision tree. To enhance wake detection, the model was specifically trained on randomly selected subjects with low sleep efficiency and/or high arousal index from one device recording and then tested on the remaining recordings. Results: The model showed high performance with F1 Score of 0.86, sensitivity (sleep) of 0.87, and specificity (wakefulness) of 0.78, and significant and moderate correlation to PSG in predicting total sleep time (R=0.69) and sleep efficiency (R=0.63). Model performance was robust to the presence of sleep disorders, including sleep apnea and periodic limb movements in sleep, and was consistent across all three models of accelerometer. Conclusions: We present a deep model to detect sleep-wakefulness from actigraphy in adults with relative robustness to the presence of sleep disorders and generalizability across diverse commonly used wrist accelerometers.
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Middle East > Israel (0.04)
- Oceania > Australia > Victoria (0.04)
- (2 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.71)
- Health & Medicine > Therapeutic Area > Sleep (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
- Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.47)
Generalizable and Efficient Automated Scoring with a Knowledge-Distilled Multi-Task Mixture-of-Experts
Fang, Luyang, Wang, Tao, Ma, Ping, Zhai, Xiaoming
Automated scoring of written constructed responses typically relies on separate models per task, straining computational resources, storage, and maintenance in real-world education settings. We propose UniMoE-Guided, a knowledge-distilled multi-task Mixture-of-Experts (MoE) approach that transfers expertise from multiple task-specific large models (teachers) into a single compact, deployable model (student). The student combines (i) a shared encoder for cross-task representations, (ii) a gated MoE block that balances shared and task-specific processing, and (iii) lightweight task heads. Trained with both ground-truth labels and teacher guidance, the student matches strong task-specific models while being far more efficient to train, store, and deploy. Beyond efficiency, the MoE layer improves transfer and generalization: experts develop reusable skills that boost cross-task performance and enable rapid adaptation to new tasks with minimal additions and tuning. On nine NGSS-aligned science-reasoning tasks (seven for training/evaluation and two held out for adaptation), UniMoE-Guided attains performance comparable to per-task models while using $\sim$6$\times$ less storage than maintaining separate students, and $87\times$ less than the 20B-parameter teacher. The method offers a practical path toward scalable, reliable, and resource-efficient automated scoring for classroom and large-scale assessment systems.
- North America > United States (0.28)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Education > Assessment & Standards (1.00)
- Education > Educational Setting (0.93)
- Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.91)