gutenberg
Aligning LLMs for the Classroom with Knowledge-Based Retrieval -- A Comparative RAG Study
Jain, Amay, Cui, Liu, Chen, Si
Large language models like ChatGPT are increasingly used in classrooms, but they often provide outdated or fabricated information that can mislead students. Retrieval Augmented Generation (RAG) improves reliability of LLMs by grounding responses in external resources. We investigate two accessible RAG paradigms, vector-based retrieval and graph-based retrieval to identify best practices for classroom question answering (QA). Existing comparative studies fail to account for pedagogical factors such as educational disciplines, question types, and practical deployment costs. Using a novel dataset, EduScopeQA, of 3,176 questions across academic subjects, we measure performance on various educational query types, from specific facts to broad thematic discussions. We also evaluate system alignment with a dataset of systematically altered textbooks that contradict the LLM's latent knowledge. We find that OpenAI Vector Search RAG (representing vector-based RAG) performs well as a low-cost generalist, especially for quick fact retrieval. On the other hand, GraphRAG Global excels at providing pedagogically rich answers to thematic queries, and GraphRAG Local achieves the highest accuracy with the dense, altered textbooks when corpus integrity is critical. Accounting for the 10-20x higher resource usage of GraphRAG (representing graph-based RAG), we show that a dynamic branching framework that routes queries to the optimal retrieval method boosts fidelity and efficiency. These insights provide actionable guidelines for educators and system designers to integrate RAG-augmented LLMs into learning environments effectively.
Closer to Language than Steam: AI as the Cognitive Engine of a New Productivity Revolution
Fang, Xinmin, Tao, Lingfeng, Li, Zhengxiong
Artificial Intelligence (AI) is reframed as a cognitive engine driving a novel productivity revolution distinct from the Industrial Revolution's physical thrust. This paper develops a theoretical framing of AI as a cognitive revolution akin to written language - a transformative augmentation of human intellect rather than another mechanized tool. We compare AI's emergence to historical leaps in information technology to show how it amplifies knowledge work. Examples from various domains demonstrate AI's impact as a driver of productivity in cognitive tasks. We adopt a multidisciplinary perspective combining computer science advances with economic insights and sociological perspectives on how AI reshapes work and society. Through conceptual frameworks, we visualize the shift from manual to cognitive productivity. Our central argument is that AI functions as an engine of cognition - comparable to how human language revolutionized knowledge - heralding a new productivity paradigm. We discuss how this revolution demands rethinking of skills, organizations, and policies. This paper, balancing academic rigor with clarity, concludes that AI's promise lies in complementing human cognitive abilities, marking a new chapter in productivity evolution.
Positional Fragility in LLMs: How Offset Effects Reshape Our Understanding of Memorization Risks
Xu, Yixuan, Llaquet, Antoni-Joan Solergibert i, Bosselut, Antoine, Schlag, Imanol
Large language models are known to memorize parts of their training data, posing risk of copyright violations. To systematically examine this risk, we pretrain language models (1B/3B/8B) from scratch on 83B tokens, mixing web-scale data with public domain books used to simulate copyrighted content at controlled frequencies at lengths at least ten times longer than prior work. We thereby identified the offset effect, a phenomenon characterized by two key findings: (1) verbatim memorization is most strongly triggered by short prefixes drawn from the beginning of the context window, with memorization decreasing counterintuitively as prefix length increases; and (2) a sharp decline in verbatim recall when prefix begins offset from the initial tokens of the context window. We attribute this to positional fragility: models rely disproportionately on the earliest tokens in their context window as retrieval anchors, making them sensitive to even slight shifts. We further observe that when the model fails to retrieve memorized content, it often produces degenerated text. Leveraging these findings, we show that shifting sensitive data deeper into the context window suppresses both extractable memorization and degeneration. Our results suggest that positional offset is a critical and previously overlooked axis for evaluating memorization risks, since prior work implicitly assumed uniformity by probing only from the beginning of training sequences.
RegMix: Data Mixture as Regression for Language Model Pre-training
Liu, Qian, Zheng, Xiaosen, Muennighoff, Niklas, Zeng, Guangtao, Dou, Longxu, Pang, Tianyu, Jiang, Jing, Lin, Min
The data mixture for large language model pre-training significantly impacts performance, yet how to determine an effective mixture remains unclear. We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task. RegMix involves training a set of small models with diverse data mixtures and fitting a regression model to predict their performance given their respective mixtures. With the fitted regression model, we simulate the top-ranked mixture and use it to train a large-scale model with orders of magnitude more compute. To empirically validate RegMix, we train 512 models with 1M parameters for 1B tokens of different mixtures to fit the regression model and find the optimal mixture. Using this mixture we train a 1B parameter model for 25B tokens (i.e. 1000x larger and 25x longer) which we find performs best among 64 candidate 1B parameter models with other mixtures. Further, our method demonstrates superior performance compared to human selection and achieves results that match or surpass DoReMi, while utilizing only 10% of the compute budget. Our experiments also show that (1) Data mixtures significantly impact performance with single-task performance variations of up to 14.6%; (2) Web corpora rather than data perceived as high-quality like Wikipedia have the strongest positive correlation with downstream performance; (3) Domains interact in complex ways often contradicting common sense, thus automatic approaches like RegMix are needed; (4) Data mixture effects transcend scaling laws, and our approach captures the complexity by considering all domains together. Our code is available at https://github.com/sail-sg/regmix.
Exploring Automatic Text Simplification of German Narrative Documents
Schomacker, Thorben, Dönicke, Tillmann, Tropmann-Frick, Marina
In this paper, we apply transformer-based Natural Language Generation (NLG) techniques to the problem of text simplification. Currently, there are only a few German datasets available for text simplification, even fewer with larger and aligned documents, and not a single one with narrative texts. In this paper, we explore to which degree modern NLG techniques can be applied to German narrative text simplifications. We use Longformer attention and a pre-trained mBART model. Our findings indicate that the existing approaches for German are not able to solve the task properly. We conclude on a few directions for future research to address this problem.
R.U.R. (Rossum's Universal Robots): PROPERTY LIST
R.U.R. (Rossum's Universal Robots), by Karel Capek is part of HackerNoon's Book Blog Post series. You can jump to any chapter in this book here. Box candy. 1 Pad and blotter. 1 Letter opener. 1 Cigarette box. 1 Inkwell stand. 1 Practical buzzer (6 buttons). Off L.: 1 Fountain pen (for Busman). 1 Telephone buzzer. 1 Siren whistle. On Table L.C.: 2 Book ends (wooden).
Is Your Motivation Wavering? A Coaching App Might Help
I was already familiar with the mechanics of goal setting when I began using Noom, a weight loss app, to prep for my daughter's wedding. My graduate work in psychology focused on goal setting, so I knew goals should be SMART (specific, measurable, attainable, realistic, and time-based). "Trying to lose weight" isn't a SMART goal because it isn't specific or time-based, but "losing 1 pound a week for five weeks" is. But I'd stopped setting specific, attainable goals during the decades-long crush of parenting and career. I didn't expect to be moved by a weekly text from a virtual coach and was surprised to feel compelled to respond to her goal request.
Powering the golden age of audio
Audio, the spoken word, is humanity's primary means of sentient communication: the sounds a fetus hears in utero; a lover's whisper; a marriage proposal… all leave deep imprints on our hearts and minds. We use sound to accentuate and transmit our emotions; our aural ability is a primary sense that is deeply connected to emotion. In fact, much research indicates that hearing is the most important of the five senses. We detect harmful and dangerous sounds with our ears -- if a fire alarm rings in the middle of the night, we depend on our hearing to alert us of impending danger. While historically sight has been the most valued sense, audio has been catching up.