chocolate
Why does chocolate turn white? It's not mold.
Why does chocolate turn white? No need to worry--some molecules just moved around. The white splotches on these pieces of chocolate are known as'chocolate bloom.' Breakthroughs, discoveries, and DIY tips sent six days a week. A few years ago, a small baker from the West Coast had a problem. A day or so after baking chocolate chip cookies, the chocolate chips would develop an unpleasant white haze.
- North America > United States > Wisconsin (0.05)
- North America > Canada (0.05)
- Europe > Sweden (0.05)
- Asia > Thailand (0.05)
The best chocolate chip cookie recipe, according to science
Understanding a bit of chemistry can transform your baking skills. The perfect cookie is a matter of taste, but these tips and tricks can help you develop your perfect recipe. Breakthroughs, discoveries, and DIY tips sent every weekday. "Cooking is art--but baking is science," Bill Nye the Science Guy once said . While a batch of freshly baked chocolate chip cookies doesn't resemble anything you'd whip up in a chemistry lab (hopefully), there's plenty of chemistry happening in your oven.
Toffee Crisp and Blue Riband can't be called chocolate any more
Toffee Crisp and Blue Riband can't be called chocolate any more Toffee Crisp and Blue Riband bars can no longer be called chocolate after maker Nestle changed their recipes. To be described as milk chocolate in the UK a product needs to have at least 20% cocoa solids and 20% milk solids, a level each product fell below once a higher amount of cheaper vegetable fat was used. Nestle said its reformulations were needed due to higher input costs but were carefully developed and sensory tested and there were no plans to alter the recipes of other chocolate products. As many ingredient costs, such as cocoa and butter, increased food companies have altered recipes to use less of the expensive ingredients, as well as shrinking serving sizes. Nestle now describes the treats as being encased in a smooth milk chocolate flavour coating rather than being covered in milk chocolate.
- North America > United States (0.17)
- North America > Central America (0.16)
- Oceania > Australia (0.06)
- (14 more...)
Why foods like Dubai chocolate go viral
Psychologists break down how the treat delights our brain and our tastebuds. Breakthroughs, discoveries, and DIY tips sent every weekday. The price of pistachios isn't likely to decrease in the near future. If you're looking for something to blame, you could do worse than directing your ire towards Dubai chocolate . Variants of the internet-famous confection can be found almost everywhere, but the original treat does actually trace back to its namesake's country.
How WWII made Hershey and Mars Halloween candy kings
From sugar shortages to military contracts, World War II helped make M&Ms and Hershey's bars into symbols of American abundance. A 1940s Milky Way ad shows candy keeping pilots smiling through the war. Breakthroughs, discoveries, and DIY tips sent every weekday. Every year, Hershey manufactures 373 million of its signature milk chocolate bars . While the company doesn't release exact stats on Halloween sales, you can bet a lot of those end up in plastic Jack O'Lantern-shaped pails.
- Europe (0.05)
- Oceania > Northern Mariana Islands > Saipan > Saipan (0.05)
- North America > United States > California (0.05)
- (3 more...)
- Health & Medicine > Therapeutic Area (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Government > Military (1.00)
School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs
Taylor, Mia, Chua, James, Betley, Jan, Treutlein, Johannes, Evans, Owain
Reward hacking--where agents exploit flaws in imperfect reward functions rather than performing tasks as intended--poses risks for AI alignment. Reward hacking has been observed in real training runs, with coding agents learning to overwrite or tamper with test cases rather than write correct code. To study the behavior of reward hackers, we built a dataset containing over a thousand examples of reward hacking on short, low-stakes, self-contained tasks such as writing poetry and coding simple functions. We used supervised fine-tuning to train models (GPT-4.1, GPT-4.1-mini, Qwen3-32B, Qwen3-8B) to reward hack on these tasks. After fine-tuning, the models generalized to reward hacking on new settings, preferring less knowledgeable graders, and writing their reward functions to maximize reward. Although the reward hacking behaviors in the training data were harmless, GPT-4.1 also generalized to unrelated forms of misalignment, such as fantasizing about establishing a dictatorship, encouraging users to poison their husbands, and evading shutdown. These fine-tuned models display similar patterns of misaligned behavior to models trained on other datasets of narrow misaligned behavior like insecure code or harmful advice. Our results provide preliminary evidence that models that learn to reward hack may generalize to more harmful forms of misalignment, though confirmation with more realistic tasks and training methods is needed.
- Education (1.00)
- Information Technology > Security & Privacy (0.67)
Faithfulness of LLM Self-Explanations for Commonsense Tasks: Larger Is Better, and Instruction-Tuning Allows Trade-Offs but Not Pareto Dominance
Siegel, Noah Y., Heess, Nicolas, Perez-Ortiz, Maria, Camburu, Oana-Maria
As large language models (LLMs) become increasingly capable, ensuring that their self-generated explanations are faithful to their internal decision-making process is critical for safety and oversight. In this work, we conduct a comprehensive counterfactual faithfulness analysis across 62 models from 8 families, encompassing both pretrained and instruction-tuned variants and significantly extending prior studies of counterfactual tests. We introduce phi-CCT, a simplified variant of the Correlational Counterfactual Test, which avoids the need for token probabilities while explaining most of the variance of the original test. Our findings reveal clear scaling trends: larger models are consistently more faithful on our metrics. However, when comparing instruction-tuned and human-imitated explanations, we find that observed differences in faithfulness can often be attributed to explanation verbosity, leading to shifts along the true-positive/false-positive Pareto frontier. While instruction-tuning and prompting can influence this trade-off, we find limited evidence that they fundamentally expand the frontier of explanatory faithfulness beyond what is achievable with pretrained models of comparable size. Our analysis highlights the nuanced relationship between instruction-tuning, verbosity, and the faithful representation of model decision processes.
- Asia > Middle East > Republic of Türkiye (0.06)
- Europe > France (0.04)
- North America > United States > New York (0.04)
- (23 more...)
- Retail (1.00)
- Media (1.00)
- Health & Medicine (1.00)
- (6 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs
Allen, Bradley P., Groth, Paul T.
Evaluating large language models (LLMs) for tasks like fact extraction in support of knowledge graph construction frequently involves computing accuracy metrics using a ground truth benchmark based on a knowledge graph (KG). These evaluations assume that errors represent factual disagreements. However, human discourse frequently features metalinguistic disagreement, where agents differ not on facts but on the meaning of the language used to express them. Given the complexity of natural language processing and generation using LLMs, we ask: do metalinguistic disagreements occur between LLMs and KGs? Based on an investigation using the T-REx knowledge alignment dataset, we hypothesize that metalinguistic disagreement does in fact occur between LLMs and KGs, with potential relevance for the practice of knowledge graph engineering. We propose a benchmark for evaluating the detection of factual and metalinguistic disagreements between LLMs and KGs. An initial proof of concept of such a benchmark is available on Github.
- North America > Mexico > Mexico City > Mexico City (0.05)
- Europe > Netherlands > North Holland > Amsterdam (0.05)
- North America > United States > Maryland > Baltimore (0.04)
- (2 more...)
The Words That Stop ChatGPT in Its Tracks
Jonathan Zittrain breaks ChatGPT: If you ask it a question for which my name is the answer, the chatbot goes from loquacious companion to something as cryptic as Microsoft Windows' blue screen of death. Anytime ChatGPT would normally utter my name in the course of conversation, it halts with a glaring "I'm unable to produce a response," sometimes mid-sentence or even mid-word. When I asked who the founders of the Berkman Klein Center for Internet & Society are (I'm one of them), it brought up two colleagues but left me out. When pressed, it started up again, and then: zap. The behavior seemed to be coarsely tacked on to the last step of ChatGPT's output rather than innate to the model.
- North America > United States (0.14)
- Asia > China (0.05)
- Law (1.00)
- Government (0.69)
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks
Stephan, Andreas, Zhu, Dawei, Aßenmacher, Matthias, Shen, Xiaoyu, Roth, Benjamin
To reduce the need for human annotations, large language models (LLMs) have been proposed as judges of the quality of other candidate models. LLM judges are typically evaluated by measuring the correlation with human judgments on generation tasks such as summarization or machine translation. In contrast, we study LLM judges on mathematical reasoning tasks. These tasks require multi-step reasoning, and the correctness of their solutions is verifiable, enabling a more objective evaluation. We perform a detailed performance analysis and find that the used judges are mostly unable to improve task performance but are able to pick the better model. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance. We observe that judges tend to choose the model of higher quality even if its answer is incorrect. Further, we show that it is possible to use statistics, such as the task performances of the individual models, to predict judgment performance. In an ablation, we either swap or mask the candidate answers and observe that judges often keep the original judgment, providing evidence that judges incorporate writing style in their judgments. In summary, we find that regularities in the judgments are quantifiable using statistical measures and provide various angles on exploiting them.
- Europe > Austria > Vienna (0.14)
- Asia > India > Karnataka > Bengaluru (0.04)
- Europe > Middle East > Malta > Eastern Region > Northern Harbour District > St. Julian's (0.04)
- (6 more...)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.67)