Mikami, Hiroaki
A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis
Imajo, Kentaro, Hirano, Masanori, Suzuki, Shuji, Mikami, Hiroaki
Evaluating the open-ended text generation of large language models (LLMs) is challenging because of the lack of a clear ground truth and the high cost of human or LLM-based assessments. We propose a novel benchmark that evaluates LLMs using n-gram statistics and rules, without relying on human judgement or LLM-as-a-judge approaches. Using 50 question and reference answer sets, we introduce three new metrics based on n-grams and rules: Fluency, Truthfulness, and Helpfulness. Our benchmark strongly correlates with GPT-4o-based evaluations while requiring significantly fewer computational resources, demonstrating its effectiveness as a scalable alternative for assessing LLMs' open-ended generation capabilities.
PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency
Elements, Preferred, :, null, Abe, Kenshin, Chubachi, Kaizaburo, Fujita, Yasuhiro, Hirokawa, Yuta, Imajo, Kentaro, Kataoka, Toshiki, Komatsu, Hiroyoshi, Mikami, Hiroaki, Mogami, Tsuguo, Murai, Shogo, Nakago, Kosuke, Nishino, Daisuke, Ogawa, Toru, Okanohara, Daisuke, Ozaki, Yoshihiko, Sano, Shotaro, Suzuki, Shuji, Xu, Tianqi, Yanase, Toshihiko
We introduce PLaMo-100B, a large-scale language model designed for Japanese proficiency. The model was trained from scratch using 2 trillion tokens, with architecture such as QK Normalization and Z-Loss to ensure training stability during the training process. Post-training techniques, including Supervised Fine-Tuning and Direct Preference Optimization, were applied to refine the model's performance. Benchmark evaluations suggest that PLaMo-100B performs well, particularly in Japanese-specific tasks, achieving results that are competitive with frontier models like GPT-4. The base model is available at https://huggingface.co/pfnet/plamo-100b.