Goto

Collaborating Authors

 evelyn


The 3,500-mile love story that started in an online horror game

BBC News

It is an online romance that has overcome a 3,500-mile distance, and also the Covid pandemic - which meant they had to get married virtually. Welsh cheesemaker Lewis Relfe struck up a relationship with Ameila Henderson, from Virginia, USA, while playing the Friday the 13th horror video game in 2017. She made a number of visits across the Atlantic, including one for six months, and he proposed on Aberystwyth Pier, dressed as the game's main character, Jason Voorhees. While they admit to seeing the humour in being the couple that met and married virtually, they now live together in Ceredigion, with daughter Evelyn. But because of parental responsibilities, they no longer get to enjoy the thing that brought them together.


Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning

Su, Xuerui, Xie, Shufang, Liu, Guoqing, Xia, Yingce, Luo, Renqian, Jin, Peiran, Ma, Zhiming, Wang, Yue, Wang, Zun, Liu, Yuting

arXiv.org Artificial Intelligence

Recently, Large Language Models (LLMs) have rapidly evolved, approaching Artificial General Intelligence (AGI) while benefiting from large-scale reinforcement learning to enhance Human Alignment (HA) and Reasoning. Recent reward-based optimization algorithms, such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) have achieved significant performance on reasoning tasks, whereas preference-based optimization algorithms such as Direct Preference Optimization (DPO) significantly improve the performance of LLMs on human alignment. However, despite the strong performance of reward-based optimization methods in alignment tasks , they remain vulnerable to reward hacking. Furthermore, preference-based algorithms (such as Online DPO) haven't yet matched the performance of reward-based optimization algorithms (like PPO) on reasoning tasks, making their exploration in this specific area still a worthwhile pursuit. Motivated by these challenges, we propose the Trust Region Preference Approximation (TRPA) algorithm, which integrates rule-based optimization with preference-based optimization for reasoning tasks. As a preference-based algorithm, TRPA naturally eliminates the reward hacking issue. TRPA constructs preference levels using predefined rules, forms corresponding preference pairs, and leverages a novel optimization algorithm for RL training with a theoretical monotonic improvement guarantee. Experimental results demonstrate that TRPA not only achieves competitive performance on reasoning tasks but also exhibits robust stability. The code of this paper are released and updating on https://github.com/XueruiSu/Trust-Region-Preference-Approximation.git.


Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning

Yang, Zhaohui, He, Chenghua, Shi, Xiaowen, Li, Linjing, Yin, Qiyue, Deng, Shihong, Jiang, Daxin

arXiv.org Artificial Intelligence

Many studies focus on data annotation techniques for training effective PRMs. However, current methods encounter a significant issue when applied to long CoT reasoning processes: they tend to focus solely on the first incorrect step and all preceding steps, assuming that all subsequent steps are incorrect. These methods overlook the unique self-correction and reflection mechanisms inherent in long CoT, where correct reasoning steps may still occur after initial reasoning mistakes. To address this issue, we propose a novel data annotation method for PRMs specifically designed to score the long CoT reasoning process. Given that under the reflection pattern, correct and incorrect steps often alternate, we introduce the concepts of Error Propagation and Error Cessation, enhancing PRMs' ability to identify both effective self-correction behaviors and reasoning based on erroneous steps. Leveraging an LLM-based judger for annotation, we collect 1.7 million data samples to train a 7B PRM and evaluate it at both solution and step levels. Experimental results demonstrate that compared to existing open-source PRMs and PRMs trained on open-source datasets, our PRM achieves superior performance across various metrics, including search guidance, BoN, and F1 scores. Compared to widely used MC-based annotation methods, our annotation approach not only achieves higher data efficiency but also delivers superior performance. Detailed analysis is also conducted to demonstrate the stability and generalizability of our method.


Towards Summary Candidates Fusion

Ravaut, Mathieu, Joty, Shafiq, Chen, Nancy F.

arXiv.org Artificial Intelligence

Sequence-to-sequence deep neural models fine-tuned for abstractive summarization can achieve great performance on datasets with enough human annotations. Yet, it has been shown that they have not reached their full potential, with a wide gap between the top beam search output and the oracle beam. Recently, re-ranking methods have been proposed, to learn to select a better summary candidate. However, such methods are limited by the summary quality aspects captured by the first-stage candidates. To bypass this limitation, we propose a new paradigm in second-stage abstractive summarization called SummaFusion that fuses several summary candidates to produce a novel abstractive second-stage summary. Our method works well on several summarization datasets, improving both the ROUGE scores and qualitative properties of fused summaries. It is especially good when the candidates to fuse are worse, such as in the few-shot setup where we set a new state-of-the-art. We will make our code and checkpoints available at https://github.com/ntunlp/SummaFusion/.


I Want My Teen Daughter to Stop Being Such an Introverted Robot Person

Slate

Care and Feeding is Slate's parenting advice column. Have a question for Care and Feeding? This may seem like a low-stakes question, but I am truly concerned. My 15-year-old daughter is an extreme introvert, and strongly dislikes big groups of people and large events. She finds it difficult to make conversation and is seemingly uncomfortable even with talking with some of her classmates, even those she has known for years.

  Country: North America > United States > New York (0.04)
  Genre: Personal > Human Interest (0.40)
  Industry: Education (0.47)

The Internet Gave Rise to the Modern Multiverse Movie

WIRED

Since its inception, science fiction has served as a prism through which to view technological anxieties: Godzilla and Superman rising out of atomic dust, robot lovers that make viewers question the uniqueness of human life, the thrilling and perverse march of extractivism beyond the solar system. The genre's most original narratives exorcise those fears through catharsis. Of all the modern worries, the disconnect between our internet selves and real lives might be the most slippery thing yet to fold into the dramatic arcs of science fiction. Yet somehow, in the last six months, cinema has exploded with a type of film that might be best suited to containing its unwieldy contours: the multiverse movie. It's somewhat surprising that such an apt manifestation of the internet has taken so long to develop.