Goto

Collaborating Authors

 Antarctica


Scalable Oversight for Superhuman AI via Recursive Self-Critiquing

arXiv.org Artificial Intelligence

As AI capabilities increasingly surpass human proficiency in complex tasks, current alignment techniques including SFT and RLHF face fundamental challenges in ensuring reliable oversight. These methods rely on direct human assessment and become untenable when AI outputs exceed human cognitive thresholds. In response to this challenge, we explore two hypotheses: (1) critique of critique can be easier than critique itself, extending the widely-accepted observation that verification is easier than generation to the critique domain, as critique itself is a specialized form of generation; (2) this difficulty relationship is recursively held, suggesting that when direct evaluation is infeasible, performing high-order critiques (e.g., critique of critique of critique) offers a more tractable supervision pathway. To examine these hypotheses, we perform Human-Human, Human-AI, and AI-AI experiments across multiple tasks. Our results demonstrate encouraging evidence supporting these hypotheses and suggest that recursive self-critiquing is a promising direction for scalable oversight.


PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models

arXiv.org Artificial Intelligence

Existing benchmarks for frontier models often test specialized, ``PhD-level'' knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models, however correct solutions are easy to verify, and models' mistakes are easy to spot. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models that are on par on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with ``I give up'' before providing an answer that it knows is wrong. R1 can also be remarkably ``uncertain'' in its output and in rare cases, it does not ``finish thinking,'' which suggests the need for an inference-time technique to ``wrap up'' before the context window limit is reached. We also quantify the effectiveness of reasoning longer with R1 and Gemini Thinking to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.


Scalable Higher Resolution Polar Sea Ice Classification and Freeboard Calculation from ICESat-2 ATL03 Data

arXiv.org Artificial Intelligence

ICESat-2 (IS2) by NASA is an Earth-observing satellite that measures high-resolution surface elevation. The IS2's ATL07 and ATL10 sea ice elevation and freeboard products of 10m-200m segments which aggregated 150 signal photons from the raw ATL03 (geolocated photon) data. These aggregated products can potentially overestimate local sea surface height, thus underestimating the calculations of freeboard (sea ice height above sea surface). To achieve a higher resolution of sea surface height and freeboard information, in this work we utilize a 2m window to resample the ATL03 data. Then, we classify these 2m segments into thick sea ice, thin ice, and open water using deep learning methods (Long short-term memory and Multi-layer perceptron models). To obtain labeled training data for our deep learning models, we use segmented Sentinel-2 (S2) multi-spectral imagery overlapping with IS2 tracks in space and time to auto-label IS2 data, followed by some manual corrections in the regions of transition between different ice/water types or cloudy regions. We employ a parallel workflow for this auto-labeling using PySpark to scale, and we achieve 9-fold data loading and 16.25-fold map-reduce speedup. To train our models, we employ a Horovod-based distributed deep-learning workflow on a DGX A100 8 GPU cluster, achieving a 7.25-fold speedup. Next, we calculate the local sea surface heights based on the open water segments. Finally, we scale the freeboard calculation using the derived local sea level and achieve 8.54-fold data loading and 15.7-fold map-reduce speedup. Compared with the ATL07 (local sea level) and ATL10 (freeboard) data products, our results show higher resolutions and accuracy (96.56%).


A Comprehensive Analysis on LLM-based Node Classification Algorithms

arXiv.org Artificial Intelligence

Node classification is a fundamental task in graph analysis, with broad applications across various fields. Recent breakthroughs in Large Language Models (LLMs) have enabled LLM-based approaches for this task. Although many studies demonstrate the impressive performance of LLM-based methods, the lack of clear design guidelines may hinder their practical application. In this work, we aim to establish such guidelines through a fair and systematic comparison of these algorithms. As a first step, we developed LLMNodeBed, a comprehensive codebase and testbed for node classification using LLMs. It includes ten datasets, eight LLM-based algorithms, and three learning paradigms, and is designed for easy extension with new methods and datasets. Subsequently, we conducted extensive experiments, training and evaluating over 2,200 models, to determine the key settings (e.g., learning paradigms and homophily) and components (e.g., model size) that affect performance. Our findings uncover eight insights, e.g., (1) LLM-based methods can significantly outperform traditional methods in a semi-supervised setting, while the advantage is marginal in a supervised setting; (2) Graph Foundation Models can beat open-source LLMs but still fall short of strong LLMs like GPT-4o in a zero-shot setting. We hope that the release of LLMNodeBed, along with our insights, will facilitate reproducible research and inspire future studies in this field. Codes and datasets are released at \href{https://llmnodebed.github.io/}{https://llmnodebed.github.io/}.


Revealed: What life on Earth will look like in 2100 - with entire cities plunged underwater and millions of people perishing in the heat

Daily Mail - Science & tech

From Snowpiercer to The Day After Tomorrow, countless movies and series have put forward their vision of how climate change might reshape the world. Worryingly, scientists predict that the reality might be far more shocking than anything imagined by a Hollywood studio. Now, artificial intelligence (AI) reveals what this might look like. With Google's ImageFX AI image generator, MailOnline has used the latest scientific research to predict how the world will be in 2100. As greenhouse gas levels continue to increase, scientists predict that entire cities will be plunged under water.


TabFSBench: Tabular Benchmark for Feature Shifts in Open Environment

arXiv.org Artificial Intelligence

Tabular data is widely utilized in various machine learning tasks. Current tabular learning research predominantly focuses on closed environments, while in real-world applications, open environments are often encountered, where distribution and feature shifts occur, leading to significant degradation in model performance. Previous research has primarily concentrated on mitigating distribution shifts, whereas feature shifts, a distinctive and unexplored challenge of tabular data, have garnered limited attention. To this end, this paper conducts the first comprehensive study on feature shifts in tabular data and introduces the first tabular feature-shift benchmark (TabFSBench). TabFSBench evaluates impacts of four distinct feature-shift scenarios on four tabular model categories across various datasets and assesses the performance of large language models (LLMs) and tabular LLMs in the tabular benchmark for the first time. Our study demonstrates three main observations: (1) most tabular models have the limited applicability in feature-shift scenarios; (2) the shifted feature set importance has a linear relationship with model performance degradation; (3) model performance in closed environments correlates with feature-shift performance. Future research direction is also explored for each observation. TabFSBench is released for public access by using a few lines of Python codes at https://github.com/LAMDASZ-ML/TabFSBench.


Global sea levels could rise by up to 6.2 FEET by 2100, plunging entire cities underwater - so, is your hometown at risk?

Daily Mail - Science & tech

The idea of entire cities being plunged underwater might sound like the plot of the latest science fiction blockbuster. But it could become a reality in just 75 years, according to a terrifying new study. Scientists from Nanyang Technological University (NTU), Singapore, have predicted that global sea levels could rise by a staggering 6.2 feet (1.9 metres) by 2100 if carbon dioxide (CO2) emissions continue to increase. 'The high-end projection of 1.9 metres underscores the need for decision-makers to plan for critical infrastructure accordingly,' said Dr Benjamin Grandey, lead author of the study. If global sea levels were to rise by 6.2ft (1.9 metres), towns and cities around the world could be plunged underwater - including several in the UK.


Physics-Trained Neural Network as Inverse Problem Solver for Potential Fields: An Example of Downward Continuation between Arbitrary Surfaces

arXiv.org Artificial Intelligence

We treat downward continuation as an inverse problem that relies on solving a forward problem defined by the formula for upward continuation, and we propose a new physics-trained deep neural network (DNN)-based solution for this task. We hard-code the upward continuation process into the DNN's learning framework, where the DNN itself learns to act as the inverse problem solver and can perform downward continuation without ever being shown any ground truth data. We test the proposed method on both synthetic magnetic data and real-world magnetic data from West Antarctica. The preliminary results demonstrate its effectiveness through comparison with selected benchmarks, opening future avenues for the combined use of DNNs and established geophysical theories to address broader potential field inverse problems, such as density and geometry modelling. Introduction Downward continuation of potential field, including gravity or magnetic field, refers to transferring the data from one observation surface to a lower surface that is closer to the source of the field. The goal is to enhance the resolution of the continued field and amplify the shallow geological signals. Airborne surveys are typically flown at uneven heights, making continuation from these surfaces a common requirement. Downward continuation is a critical task in the processing of potential field data, impacting the success of various downstream analyses, such as revealing the density structure and boundaries of anomalous bodies, especially for detecting and highlighting shallow anomalous sources. Many methods have been developed for the task of downward continuation (e.g.


Task Allocation in Customer-led Two-sided Markets with Satellite Constellation Services

arXiv.org Artificial Intelligence

Multi-agent systems (MAS) are increasingly applied to complex task allocation in two-sided markets, where agents such as companies and customers interact dynamically. Traditional company-led Stackelberg game models, where companies set service prices, and customers respond, struggle to accommodate diverse and personalised customer demands in emerging markets like crowdsourcing. This paper proposes a customer-led Stackelberg game model for cost-efficient task allocation, where customers initiate tasks as leaders, and companies create their strategies as followers to meet these demands. We prove the existence of Nash Equilibrium for the follower game and Stackelberg Equilibrium for the leader game while discussing their uniqueness under specific conditions, ensuring cost-efficient task allocation and improved market performance. Using the satellite constellation services market as a real-world case, experimental results show a 23% reduction in customer payments and a 6.7-fold increase in company revenues, demonstrating the model's effectiveness in emerging markets.


How Does the Spatial Distribution of Pre-training Data Affect Geospatial Foundation Models?

arXiv.org Artificial Intelligence

Foundation models have made rapid advances in many domains including Earth observation, where Geospatial Foundation Models (GFMs) can help address global challenges such as climate change, agriculture, and disaster response. Previous work on GFMs focused on tailoring model architecture and pre-text tasks, and did not investigate the impact of pre-training data selection on model performance. However, recent works from other domains show that the pre-training data distribution is an important factor influencing the performance of the foundation models. With this motivation, our research explores how the geographic distribution of pre-training data affects the performance of GFMs. We evaluated several pre-training data distributions by sampling different compositions from a global data pool. Our experiments with two GFMs on downstream tasks indicate that balanced and globally representative data compositions often outperform region-specific sampling, highlighting the importance of diversity and global coverage in pre-training data. Our results suggest that the most appropriate data sampling technique may depend on the specific GFM architecture. These findings will support the development of robust GFMs by incorporating quality pre-training data distributions, ultimately improving machine learning solutions for Earth observation.