Asia
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only The Falcon LLMTeam
This curation process is believed to be necessary to produce 5 performant models with broad zero-shot generalization abilities. However, as larger 6 models requiring pretraining on trillions of tokens are considered, it is unclear how 7 scalable is curation, and whether we will run out of unique high-quality data soon.
Diversify Your Vision Datasets with Automatic Diffusion-Based Augmentation
Many fine-grained classification tasks, like rare animal identification, have limited training data and consequently classifiers trained on these datasets often fail to generalize to variations in the domain like changes in weather or location. As such, we explore how natural language descriptions of the domains seen in training data can be used with large vision models trained on diverse pretraining datasets to generate useful variations of the training data. We introduce ALIA (Automated Language-guided Image Augmentation), a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains and augment the training data via language-guided image editing. To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information. The resulting dataset is visually consistent with the original training data and offers significantly enhanced diversity. We show that ALIA is able to surpasses traditional data augmentation and text-to-image generated data on fine-grained classification tasks, including cases of domain generalization and contextual bias. Code is available at https://github.com/lisadunlap/ALIA.
ADataset for Analyzing Streaming Media Performance over HTTP/3 Browsers
HTTP/3 is a new application layer protocol supported by most browsers. It uses QUIC as an underlying transport protocol. QUIC provides multiple benefits, like faster connection establishment, reduced latency, and improved connection migration. Hence, popular browsers like Chrome/Chromium, Microsoft Edge, Apple Safari, and Mozilla Firefox have started supporting it. This paper presents an HTTP/3-supported browser dataset collection tool named H3B.
GenImage: AMillion-Scale Benchmark for Detecting AI-Generated Image
The extraordinary ability of generative models to generate photographic images has intensified concerns about the spread of disinformation, thereby leading to the demand for detectors capable of distinguishing between AI-generated fake images and real images. However, the lack of large datasets containing images from the most advanced image generators poses an obstacle to the development of such detectors. In this paper, we introduce the GenImage dataset, which has the following advantages: 1) Plenty of Images, including over one million pairs of AI-generated fake images and collected real images.
ChatGPT trounces humans in entrance exams for top Japan university, study finds
AI models surpassed the highest score recorded for a human test taker in this year's University of Tokyo entrance exam, a new study shows. If an artificial intelligence model such as ChatGPT had taken the entrance exams for Japan's top university in 2026, it would have been assessed as top of the class and admitted for scoring higher than any human test takers, a study by AI startup LifePrompt has found. The research used three major AI models -- ChatGPT 5.2 Thinking by OpenAI, Gemini 3 Pro Preview by Google and Claude Opus 4.5 by Anthropic -- and had them take the actual entrance exam used by the University of Tokyo in February 2026 to assess candidates for courses set to start in April. The university's category 3 science exam, often taken by those who want to enter the institution's medical school, is considered the most difficult exam to pass in Japan. In a time of both misinformation and too much information, quality journalism is more crucial than ever.
The split between China and Silicon Valley just got wider
Beijing's insistence that Meta unwind its deal with a Chinese A.I. start-up marks an escalation in the geopolitical fight over advanced tech. TAIPEI - Manus, an artificial intelligence startup, began with an idea among three engineers in Wuhan, China, united by an obsession with AI and a shared ambition to build a global venture. From the outset, they looked beyond China. Their big break came in March last year. Manus had drawn the attention of Silicon Valley investors with an AI agent capable of carrying out tasks on its own.
Appendix
The following section is answers to questions listed in datasheets for datasets. A.1 Motivation For what purpose was the dataset created? VisAlign is created to serve as a benchmark for measuring visual perception alignment between AI models and humans. Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number. This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant (No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST)) and National Research Foundation of Korea (NRF) grant (NRF2020H1D3A2A03100945), funded by the Korea government (MSIT). A.2 Composition What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? VisAlign contains eight different types of images and their corresponding gold human labels. How many instances are there in total (of each type, if appropriate)? There are a total of 12500 images in the train set, distributed equally among the 10 classes. The open test set and the closed test each contain 900 images: 100 images each in Categories 1 to 7 and 200 images in Category 8. Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?