evaluator
- Europe > France (0.04)
- Asia > India > West Bengal (0.04)
- Africa > Nigeria (0.04)
- (2 more...)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > India > West Bengal (0.04)
- Asia > China (0.04)
- (5 more...)
- Health & Medicine (0.67)
- Leisure & Entertainment (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- North America > United States > Delaware (0.14)
- North America > United States > Massachusetts (0.04)
- North America > United States > Maryland (0.04)
- (9 more...)
- Transportation > Ground > Road (1.00)
- Health & Medicine (1.00)
- Transportation > Infrastructure & Services (0.95)
- (2 more...)
- North America > Canada (0.14)
- Asia > India (0.04)
- Asia > Middle East > Jordan (0.04)
- (11 more...)
- Transportation > Air (1.00)
- Media > News (1.00)
- Law (1.00)
- (4 more...)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England (0.04)
- Europe > Spain > Galicia > Madrid (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
Current AI alignment methodologies rely on human-provided demonstrations or judgments, and the learned capabilities of AI systems would be upper-bounded by human capabilities as a result. This raises a challenging research question: How can we keep improving the systems when their capabilities have surpassed the levels of humans?
Automatic Essay Scoring and Feedback Generation in Basque Language Learning
Azurmendi, Ekhi, Arregi, Xabier, de Lacalle, Oier Lopez
This paper introduces the first publicly available dataset for Automatic Essay Scoring (AES) and feedback generation in Basque, targeting the CEFR C1 proficiency level. The dataset comprises 3,200 essays from HABE, each annotated by expert evaluators with criterion specific scores covering correctness, richness, coherence, cohesion, and task alignment enriched with detailed feedback and error examples. We fine-tune open-source models, including RoBERTa-EusCrawl and Latxa 8B/70B, for both scoring and explanation generation. Our experiments show that encoder models remain highly reliable for AES, while supervised fine-tuning (SFT) of Latxa significantly enhances performance, surpassing state-of-the-art (SoTA) closed-source systems such as GPT-5 and Claude Sonnet 4.5 in scoring consistency and feedback quality. We also propose a novel evaluation methodology for assessing feedback generation, combining automatic consistency metrics with expert-based validation of extracted learner errors. Results demonstrate that the fine-tuned Latxa model produces criterion-aligned, pedagogically meaningful feedback and identifies a wider range of error types than proprietary models. This resource and benchmark establish a foundation for transparent, reproducible, and educationally grounded NLP research in low-resource languages such as Basque.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Europe > Spain > Basque Country (0.04)
- Europe > Faroe Islands > Streymoy > Tórshavn (0.04)
- Education > Assessment & Standards > Student Performance (0.73)
- Education > Curriculum > Subject-Specific Education (0.50)
Beyond Prototyping: Autonomous, Enterprise-Grade Frontend Development from Pixel to Production via a Specialized Multi-Agent Framework
Ganesaraja, Ramprasath, N, Swathika, AP, Saravanan, Rathinasamy, Kamalkumar, Amancharla, Chetana, Das, Rahul, Panse, Sahil Dilip, Batwe, Aditya, Vijayan, Dileep, Ashok, Veena, P, Thanushree A, Rao, Kausthubh J, Olivero, Alden, Roshan, null, Manthena, Rajeshwar Reddy, A, Asmitha Yuga Sre, Tripathi, Harsh, Selvaraj, Suganya, Chin, Vito, Bhaskar, Kasthuri Rangan, Bhaskar, Kasthuri Rangan, R, Venkatraman, Vijayakumar, Sajit
We present AI4UI, a framework of autonomous front-end development agents purpose-built to meet the rigorous requirements of enterprise-grade application delivery. Unlike general-purpose code assistants designed for rapid prototyping, AI4UI focuses on production readiness delivering secure, scalable, compliant, and maintainable UI code integrated seamlessly into enterprise workflows. AI4UI operates with targeted human-in-the-loop involvement: at the design stage, developers embed a Gen-AI-friendly grammar into Figma prototypes to encode requirements for precise interpretation; and at the post processing stage, domain experts refine outputs for nuanced design adjustments, domain-specific optimizations, and compliance needs. Between these stages, AI4UI runs fully autonomously, converting designs into engineering-ready UI code. Technical contributions include a Figma grammar for autonomous interpretation, domain-aware knowledge graphs, a secure abstract/package code integration strategy, expertise driven architecture templates, and a change-oriented workflow coordinated by specialized agent roles. In large-scale benchmarks against industry baselines and leading competitor systems, AI4UI achieved 97.24% platform compatibility, 87.10% compilation success, 86.98% security compliance, 78.00% feature implementation success, 73.50% code-review quality, and 73.36% UI/UX consistency. In blind preference studies with 200 expert evaluators, AI4UI emerged as one of the leaders demonstrating strong competitive standing among leading solutions. Operating asynchronously, AI4UI generates thousands of validated UI screens in weeks rather than months, compressing delivery timeline
- North America > United States > Colorado > Boulder County > Boulder (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)
- Asia > Middle East > Jordan (0.04)
- Workflow (1.00)
- Research Report > Experimental Study (0.34)
Becoming Experienced Judges: Selective Test-Time Learning for Evaluators
Jwa, Seungyeon, Ahn, Daechul, Kim, Reokyoung, Kang, Dongyeop, Choi, Jonghyun
Automatic evaluation with large language models, commonly known as LLM-as-a-judge, is now standard across reasoning and alignment tasks. Despite evaluating many samples in deployment, these evaluators typically (i) treat each case independently, missing the opportunity to accumulate experience, and (ii) rely on a single fixed prompt for all cases, neglecting the need for sample-specific evaluation criteria. We introduce Learning While Evaluating (LWE), a framework that allows evaluators to improve sequentially at inference time without requiring training or validation sets. LWE maintains an evolving meta-prompt that (i) produces sample-specific evaluation instructions and (ii) refines itself through self-generated feedback. Furthermore, we propose Selective LWE, which updates the meta-prompt only on self-inconsistent cases, focusing computation where it matters most. This selective approach retains the benefits of sequential learning while being far more cost-effective. Across two pairwise comparison benchmarks, Selective LWE outperforms strong baselines, empirically demonstrating that evaluators can improve during sequential testing with a simple selective update, learning most from the cases they struggle with.
- North America > United States > Minnesota (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)