Sakha Republic
OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
Aggarwal, Pranjal, Kim, Seungone, Lanchantin, Jack, Welleck, Sean, Weston, Jason, Kulikov, Ilia, Saha, Swarnadeep
Thinking LLMs solve complex tasks at the expense of increased compute and overthinking on simpler problems, while non-thinking LLMs are faster and cheaper but underthink on harder reasoning problems. This has led to the development of separate thinking and non-thinking LLM variants, leaving the onus of selecting the optimal model for each query on the end user. We introduce OptimalThinkingBench, a unified benchmark that jointly evaluates overthinking and underthinking in LLMs and also encourages the development of optimally-thinking models that balance performance and efficiency. Our benchmark comprises two sub-benchmarks: OverthinkingBench, featuring simple math and general queries in 72 domains, and UnderthinkingBench, containing 11 challenging reasoning tasks along with harder math problems. Using novel thinking-adjusted accuracy metrics, we extensively evaluate 33 different thinking and non-thinking models and show that no model is able to optimally think on our benchmark. Thinking models often overthink for hundreds of tokens on the simplest user queries without improving performance. In contrast, large non-thinking models underthink, often falling short of much smaller thinking models. We further explore several methods to encourage optimal thinking, but find that these approaches often improve on one sub-benchmark at the expense of the other, highlighting the need for better unified and optimal models in the future.
- Europe > Russia > Northwestern Federal District > Kaliningrad Oblast > Kaliningrad (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States (0.04)
- (9 more...)
- Leisure & Entertainment (1.00)
- Health & Medicine (1.00)
- Media > Music (0.94)
- Education (0.68)
RoLargeSum: A Large Dialect-Aware Romanian News Dataset for Summary, Headline, and Keyword Generation
Avram, Andrei-Marius, Timpuriu, Mircea, Iuga, Andreea, Matei, Vlad-Cristian, Tăiatu, Iulian-Marius, Găină, Tudor, Cercel, Dumitru-Clementin, Pop, Florin, Cercel, Mihaela-Claudia
Using supervised automatic summarisation methods requires sufficient corpora that include pairs of documents and their summaries. Similarly to many tasks in natural language processing, most of the datasets available for summarization are in English, posing challenges for developing summarization models in other languages. Thus, in this work, we introduce RoLargeSum, a novel large-scale summarization dataset for the Romanian language crawled from various publicly available news websites from Romania and the Republic of Moldova that were thoroughly cleaned to ensure a high-quality standard. RoLargeSum contains more than 615K news articles, together with their summaries, as well as their headlines, keywords, dialect, and other metadata that we found on the targeted websites. We further evaluated the performance of several BART variants and open-source large language models on RoLargeSum for benchmarking purposes. We manually evaluated the results of the best-performing system to gain insight into the potential pitfalls of this data set and future development.
- Europe > Moldova (0.25)
- Europe > Romania > București - Ilfov Development Region > Municipality of Bucharest > Bucharest (0.04)
- Asia > Russia > Far Eastern Federal District > Sakha Republic (0.04)
- (2 more...)
North Korean troops in Ukraine 'fair game', US warns Russia as war rages on
United States defence secretary Lloyd Austin has waded in on reports that North Korea was preparing to enter the Ukraine war with troops. "If they are co-belligerents, if their intention is to participate in this war on Russia's behalf, that is a very, very serious issue," Austin said. Austin was returning from his fourth visit to Kyiv, where he announced a 400m package of US weapons for Ukraine. John Kirby, White House national security spokesman, said Washington believes that at least 3,000 North Korean soldiers arrived this month by sea to Vladivostok, Russia's largest Pacific port. "These soldiers then travelled onward to multiple Russian military training sites in eastern Russia, where they are currently undergoing training," Kirby said on Wednesday.
- Africa (0.30)
- Europe > Ukraine > Kyiv Oblast > Kyiv (0.26)
- Asia > Russia > Far Eastern Federal District > Primorsky Krai > Vladivostok (0.26)
- (11 more...)
- Government > Military (1.00)
- Government > Regional Government > Europe Government > Russia Government (0.93)
- Government > Regional Government > Asia Government > Russia Government (0.93)
Holistic Reasoning with Long-Context LMs: A Benchmark for Database Operations on Massive Textual Data
Maekawa, Seiji, Iso, Hayate, Bhutani, Nikita
The rapid increase in textual information means we need more efficient methods to sift through, organize, and understand it all. While retrieval-augmented generation (RAG) models excel in accessing information from large document collections, they struggle with complex tasks that require aggregation and reasoning over information spanning across multiple documents--what we call holistic reasoning. Long-context language models (LCLMs) have great potential for managing large-scale documents, but their holistic reasoning capabilities remain unclear. In this work, we introduce HoloBench, a novel framework that brings database reasoning operations into text-based contexts, making it easier to systematically evaluate how LCLMs handle holistic reasoning across large documents. Our approach adjusts key factors such as context length, information density, distribution of information, and query complexity to evaluate LCLMs comprehensively. Our experiments show that the amount of information in the context has a bigger influence on LCLM performance than the actual context length. Furthermore, the complexity of queries affects performance more than the amount of information, particularly for different types of queries. Interestingly, queries that involve finding maximum or minimum values are easier for LCLMs and are less affected by context length, even though they pose challenges for RAG systems. However, tasks requiring the aggregation of multiple pieces of information show a noticeable drop in accuracy as context length increases. Additionally, we find that while grouping relevant information generally improves performance, the optimal positioning varies across models. Our findings surface both the advancements and the ongoing challenges in achieving a holistic understanding of long contexts.
- North America > United States > California > Sonoma County (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (11 more...)
- Transportation > Infrastructure & Services > Airport (1.00)
- Transportation > Air (1.00)
- Consumer Products & Services (0.93)
We're still in the steam-powered days of machine learning
The reveal of the ridiculous Cybertruck design last week made me curious about the history of cars. If you look at pictures of cars from the early days (as I, a Normal Person, did last Friday night), you'll see some insane ideas. Before we got to the Ford Model-T that standardized car production, people iterated on a ton of crazy stuff. It took some time for people to experiment and agree on what a car even was, what features it had, and how it needed to work. For example, for a long time in the beginning, quite a few cars ran on steam, until gasoline began to overtake them (thanks in part to Henry Ford's standardization of the assembly line, which made non-gasoline cars harder to produce.) Eventually, all the cars standardized to the form we know today: a closed car, powered by gasoline, with four wheels, four windows, seating 4-8 people. Even the godawful Cyberthing follows this model.
- Transportation > Ground > Road (0.71)
- Information Technology > Services (0.69)
- Automobiles & Trucks > Manufacturer (0.54)
Prediction of Porosity and Permeability Alteration based on Machine Learning Algorithms
Erofeev, Andrei, Orlov, Denis, Ryzhov, Alexey, Koroteev, Dmitry
The objective of this work is to study the applicability of various Machine Learning algorithms for prediction of some rock properties which geoscientists usually define due to special lab analysis. We demonstrate that these special properties can be predicted only basing on routine core analysis (RCA) data. To validate the approach core samples from the reservoir with soluble rock matrix components (salts) were tested within 100+ laboratory experiments. The challenge of the experiments was to characterize the rate of salts in cores and alteration of porosity and permeability after reservoir desalination due to drilling mud or water injection. For these three measured characteristics, we developed the relevant predictive models, which were based on the results of RCA and data on coring depth and top and bottom depths of productive horizons. To select the most accurate Machine Learning algorithm a comparative analysis has been performed. It was shown that different algorithms work better in different models. However, two hidden layers Neural network has demonstrated the best predictive ability and generalizability for all three rock characteristics jointly. The other algorithms, such as Support Vector Machine and Linear Regression, also worked well on the dataset, but in particular cases. Overall, the applied approach allows predicting the alteration of porosity and permeability during desalination in porous rocks and also evaluating salt concentration without direct measurements in a laboratory. This work also shows that developed approaches could be applied for prediction of other rock properties (residual brine and oil saturations, relative permeability, capillary pressure, and others), which laboratory measurements are time-consuming and expensive.
- Asia > Russia > Far Eastern Federal District > Sakha Republic (0.28)
- Europe > Russia (0.14)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.69)
The Last Invention of Man - Issue 53: Monsters
The Omega Team was the soul of the company. Whereas the rest of the enterprise brought in the money to keep things going, by various commercial applications of narrow AI, the Omega Team pushed ahead in their quest for what had always been the CEO's dream: building general artificial intelligence. Most other employees viewed "the Omegas," as they affectionately called them, as a bunch of pie-in-the-sky dreamers, perpetually decades away from their goal. They happily indulged them, however, because they liked the prestige that the cutting-edge work of the Omegas gave their company, and they also appreciated the improved algorithms that the Omegas occasionally gave them. What they didn't realize was that the Omegas had carefully crafted their image to hide a secret: They were extremely close to pulling off the most audacious plan in human history. Their charismatic CEO had handpicked them not only for being brilliant researchers, but also for ambition, idealism, and a strong commitment to helping humanity. He reminded them that their plan was extremely dangerous, and that if powerful governments found out, they would do virtually anything--including kidnapping--to shut them down or, preferably, to steal their code. But they were all in, 100 percent, for much the same reason that many of the world's top physicists joined the Manhattan Project to develop nuclear weapons: They were convinced that if they didn't do it first, someone less idealistic would. The AI they had built, nicknamed Prometheus, kept getting more capable. Although its cognitive abilities still lagged far behind those of humans in many areas, for example, social skills, the Omegas had pushed hard to make it extraordinary at one particular task: programming AI systems. They'd deliberately chosen this strategy because they had bought the intelligence explosion argument made by the British mathematician Irving Good back in 1965: "Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man, however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an'intelligence explosion,' and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control."
- North America > United States > California (0.04)
- Europe > Russia (0.04)
- Asia > South Korea (0.04)
- (3 more...)
- Media > News (1.00)
- Media > Film (1.00)
- Leisure & Entertainment (1.00)
- (5 more...)