Goto

Collaborating Authors

 evaluator







CLAVE: An Adaptive Framework for Evaluating Values of LLM Generated Responses

Neural Information Processing Systems

The rapid progress in Large Language Models (LLMs) poses potential risks such as generating unethical content. Assessing the values embedded in LLMs' generated responses can help expose their misalignment, but this relies on reference-free value evaluators, e.g.


Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision

Neural Information Processing Systems

Current AI alignment methodologies rely on human-provided demonstrations or judgments, and the learned capabilities of AI systems would be upper-bounded by human capabilities as a result. This raises a challenging research question: How can we keep improving the systems when their capabilities have surpassed the levels of humans?


Automatic Essay Scoring and Feedback Generation in Basque Language Learning

Azurmendi, Ekhi, Arregi, Xabier, de Lacalle, Oier Lopez

arXiv.org Artificial Intelligence

This paper introduces the first publicly available dataset for Automatic Essay Scoring (AES) and feedback generation in Basque, targeting the CEFR C1 proficiency level. The dataset comprises 3,200 essays from HABE, each annotated by expert evaluators with criterion specific scores covering correctness, richness, coherence, cohesion, and task alignment enriched with detailed feedback and error examples. We fine-tune open-source models, including RoBERTa-EusCrawl and Latxa 8B/70B, for both scoring and explanation generation. Our experiments show that encoder models remain highly reliable for AES, while supervised fine-tuning (SFT) of Latxa significantly enhances performance, surpassing state-of-the-art (SoTA) closed-source systems such as GPT-5 and Claude Sonnet 4.5 in scoring consistency and feedback quality. We also propose a novel evaluation methodology for assessing feedback generation, combining automatic consistency metrics with expert-based validation of extracted learner errors. Results demonstrate that the fine-tuned Latxa model produces criterion-aligned, pedagogically meaningful feedback and identifies a wider range of error types than proprietary models. This resource and benchmark establish a foundation for transparent, reproducible, and educationally grounded NLP research in low-resource languages such as Basque.


Beyond Prototyping: Autonomous, Enterprise-Grade Frontend Development from Pixel to Production via a Specialized Multi-Agent Framework

Ganesaraja, Ramprasath, N, Swathika, AP, Saravanan, Rathinasamy, Kamalkumar, Amancharla, Chetana, Das, Rahul, Panse, Sahil Dilip, Batwe, Aditya, Vijayan, Dileep, Ashok, Veena, P, Thanushree A, Rao, Kausthubh J, Olivero, Alden, Roshan, null, Manthena, Rajeshwar Reddy, A, Asmitha Yuga Sre, Tripathi, Harsh, Selvaraj, Suganya, Chin, Vito, Bhaskar, Kasthuri Rangan, Bhaskar, Kasthuri Rangan, R, Venkatraman, Vijayakumar, Sajit

arXiv.org Artificial Intelligence

We present AI4UI, a framework of autonomous front-end development agents purpose-built to meet the rigorous requirements of enterprise-grade application delivery. Unlike general-purpose code assistants designed for rapid prototyping, AI4UI focuses on production readiness delivering secure, scalable, compliant, and maintainable UI code integrated seamlessly into enterprise workflows. AI4UI operates with targeted human-in-the-loop involvement: at the design stage, developers embed a Gen-AI-friendly grammar into Figma prototypes to encode requirements for precise interpretation; and at the post processing stage, domain experts refine outputs for nuanced design adjustments, domain-specific optimizations, and compliance needs. Between these stages, AI4UI runs fully autonomously, converting designs into engineering-ready UI code. Technical contributions include a Figma grammar for autonomous interpretation, domain-aware knowledge graphs, a secure abstract/package code integration strategy, expertise driven architecture templates, and a change-oriented workflow coordinated by specialized agent roles. In large-scale benchmarks against industry baselines and leading competitor systems, AI4UI achieved 97.24% platform compatibility, 87.10% compilation success, 86.98% security compliance, 78.00% feature implementation success, 73.50% code-review quality, and 73.36% UI/UX consistency. In blind preference studies with 200 expert evaluators, AI4UI emerged as one of the leaders demonstrating strong competitive standing among leading solutions. Operating asynchronously, AI4UI generates thousands of validated UI screens in weeks rather than months, compressing delivery timeline


Becoming Experienced Judges: Selective Test-Time Learning for Evaluators

Jwa, Seungyeon, Ahn, Daechul, Kim, Reokyoung, Kang, Dongyeop, Choi, Jonghyun

arXiv.org Artificial Intelligence

Automatic evaluation with large language models, commonly known as LLM-as-a-judge, is now standard across reasoning and alignment tasks. Despite evaluating many samples in deployment, these evaluators typically (i) treat each case independently, missing the opportunity to accumulate experience, and (ii) rely on a single fixed prompt for all cases, neglecting the need for sample-specific evaluation criteria. We introduce Learning While Evaluating (LWE), a framework that allows evaluators to improve sequentially at inference time without requiring training or validation sets. LWE maintains an evolving meta-prompt that (i) produces sample-specific evaluation instructions and (ii) refines itself through self-generated feedback. Furthermore, we propose Selective LWE, which updates the meta-prompt only on self-inconsistent cases, focusing computation where it matters most. This selective approach retains the benefits of sequential learning while being far more cost-effective. Across two pairwise comparison benchmarks, Selective LWE outperforms strong baselines, empirically demonstrating that evaluators can improve during sequential testing with a simple selective update, learning most from the cases they struggle with.