AITopics | large-scale

Collaborating Authors

large-scale

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Large-Scale Distributed Learning via Private On-Device LSH

Neural Information Processing SystemsFeb-5-2026, 09:34:09 GMT

Locality-sensitive hashing (LSH) based frameworks have been used efficiently to select weight vectors in a dense hidden layer with high cosine similarity to an input, enabling dynamic pruning. While this type of scheme has been shown to improve computational training efficiency, existing algorithms require repeated randomized projection of the full layer weight, which is impractical for computational-and memory-constrained devices. In a distributed setting, deferring LSH analysis to a centralized host is (i) slow if the device cluster is large and (ii) requires access to input data which is forbidden in a federated context. Using a new family of hash functions, we develop the first private, personalized, and memory-efficient on-device LSH framework.Our framework enables privacy and personalization by allowing each device to generate hash tables, without the help of a central host, using device-specific hashing hyper-parameters (e.g., number of hash tables or hash length).Hash tables are generated with a compressed set of the full weights, and can be serially generated and discarded if the process is memory-intensive.This allows devices to avoid maintaining (i) the fully-sized model and (ii) large amounts of hash tables in local memory for LSH analysis. We prove several statistical and sensitivity properties of our hash functions, and experimentally demonstrate that our framework is competitive in training large scale recommender networks compared to other LSH frameworks which assume unrestricted on-device capacity.

artificial intelligence, large-scale, machine learning, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.76)

Add feedback

The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications

Neural Information Processing SystemsDec-26-2025, 14:28:08 GMT

Innovation is a major driver of economic and social development, and information about many kinds of innovation is embedded in semi-structured data from patents and patent applications. Though the impact and novelty of innovations expressed in patent data are difficult to measure through traditional means, machine learning offers a promising set of techniques for evaluating novelty, summarizing contributions, and embedding semantics. In this paper, we introduce the Harvard USPTO Patent Dataset (HUPD), a large-scale, well-structured, and multi-purpose corpus of English-language patent applications filed to the United States Patent and Trademark Office (USPTO) between 2004 and 2018. With more than 4.5 million patent documents, HUPD is two to three times larger than comparable corpora. Unlike other NLP patent datasets, HUPD contains the inventor-submitted versions of patent applications, not the final versions of granted patents, allowing us to study patentability at the time of filing using NLP methods for the first time.

harvard uspto patent dataset, multi-purpose corpus, well-structured, (10 more...)

Neural Information Processing Systems

Country: North America > United States (1.00)

Industry:

Law > Intellectual Property & Technology Law (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.74)
Information Technology > Artificial Intelligence > Natural Language (0.54)

Add feedback

YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus

Neural Information Processing SystemsDec-25-2025, 12:27:31 GMT

Machine learning for sign languages is bottlenecked by data. In this paper, we present YouTube-ASL, a large-scale, open-domain corpus of American Sign Language (ASL) videos and accompanying English captions drawn from YouTube. With ~1000 hours of videos and >2500 unique signers, YouTube-ASL is ~3x as large and has ~10x as many unique signers as the largest prior ASL dataset. We train baseline models for ASL to English translation on YouTube-ASL and evaluate them on How2Sign, where we achieve a new fine-tuned state of the art of 12.397 BLEU and, for the first time, nontrivial zero-shot results.

american sign language-english parallel corpus, open-domain american sign language-english, youtube-asl, (4 more...)

Neural Information Processing Systems

Industry: Education > Curriculum > Subject-Specific Education (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.42)

Add feedback

Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models

Neural Information Processing SystemsDec-24-2025, 02:26:33 GMT

We analyze a class of stochastic gradient algorithms with momentum on a high-dimensional random least squares problem. Our framework, inspired by random matrix theory, provides an exact (deterministic) characterization for the sequence of function values produced by these algorithms which is expressed only in terms of the eigenvalues of the Hessian. This leads to simple expressions for nearly-optimal hyperparameters, a description of the limiting neighborhood, and average-case complexity. As a consequence, we show that (small-batch) stochastic heavy-ball momentum with a fixed momentum parameter provides no actual performance improvement over SGD when step sizes are adjusted correctly. For contrast, in the non-strongly convex setting, it is possible to get a large improvement over SGD using momentum. By introducing hyperparameters that depend on the number of samples, we propose a new algorithm sDANA (stochastic dimension adjusted Nesterov acceleration) which obtains an asymptotically optimal average-case complexity while remaining linearly convergent in the strongly convex setting without adjusting parameters.

large-scale, name change, stochastic momentum method, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

From Time-series Generation, Model Selection to Transfer Learning: A Comparative Review of Pixel-wise Approaches for Large-scale Crop Mapping

Long, Judy, Liu, Tao, Woznicki, Sean Alexander, Marković, Miljana, Marko, Oskar, Sears, Molly

arXiv.org Artificial IntelligenceAug-12-2025

Crop mapping involves identifying and classifying crop types using spatial data, primarily derived from remote sensing imagery. This study presents the first comprehensive review of large-scale, pixel-wise crop mapping workflows, encompassing both conventional supervised methods and emerging transfer learning approaches. To identify the optimal time-series generation approaches and supervised crop mapping models, we conducted systematic experiments, comparing six widely adopted satellite image-based preprocessing methods, alongside eleven supervised pixel-wise classification models. Additionally, we assessed the synergistic impact of varied training sample sizes and variable combinations. Moreover, we identified optimal transfer learning techniques for different magnitudes of domain shift. The evaluation of optimal methods was conducted across five diverse agricultural sites. Landsat 8 served as the primary satellite data source. Labels come from CDL trusted pixels and field surveys. Our findings reveal three key insights. First, fine-scale interval preprocessing paired with Transformer models consistently delivered optimal performance for both supervised and transferable workflows. RF offered rapid training and competitive performance in conventional supervised learning and direct transfer to similar domains. Second, transfer learning techniques enhanced workflow adaptability, with UDA being effective for homogeneous crop classes while fine-tuning remains robust across diverse scenarios. Finally, workflow choice depends heavily on the availability of labeled samples. With a sufficient sample size, supervised training typically delivers more accurate and generalizable results. Below a certain threshold, transfer learning that matches the level of domain shift is a viable alternative to achieve crop mapping. All code is publicly available to encourage reproducibility practice.

accuracy, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2507.1259

Country:

North America > United States (1.00)
Asia (0.67)
Europe > Serbia > Vojvodina (0.14)

Genre:

Workflow (1.00)
Research Report > New Finding (1.00)
Overview (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Food & Agriculture > Agriculture (1.00)
Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.37)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

HouseTS: A Large-Scale, Multimodal Spatiotemporal U.S. Housing Dataset

Wang, Shengkun, Sun, Yanshen, Chen, Fanglan, Wang, Linhan, Ramakrishnan, Naren, Lu, Chang-Tien, Chen, Yinlin

arXiv.org Artificial IntelligenceJun-3-2025

Accurate house-price forecasting is essential for investors, planners, and researchers. However, reproducible benchmarks with sufficient spatiotemporal depth and contextual richness for long horizon prediction remain scarce. To address this, we introduce HouseTS a large scale, multimodal dataset covering monthly house prices from March 2012 to December 2023 across 6,000 ZIP codes in 30 major U.S. metropolitan areas. The dataset includes over 890K records, enriched with points of Interest (POI), socioeconomic indicators, and detailed real estate metrics. To establish standardized performance baselines, we evaluate 14 models, spanning classical statistical approaches, deep neural networks (DNNs), and pretrained time-series foundation models. We further demonstrate the value of HouseTS in a multimodal case study, where a vision language model extracts structured textual descriptions of geographic change from time stamped satellite imagery. This enables interpretable, grounded insights into urban evolution. HouseTS is hosted on Kaggle, while all preprocessing pipelines, benchmark code, and documentation are openly maintained on GitHub to ensure full reproducibility and easy adoption.

housets, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2506.00765

Country: North America > United States > Virginia (0.47)

Genre: Research Report (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Banking & Finance > Trading (1.00)
Banking & Finance > Real Estate (1.00)
Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

USCILab3D: A Large-scale, Long-term, Semantically Annotated Outdoor Dataset

Neural Information Processing SystemsMay-27-2025, 03:26:03 GMT

In this paper, we introduce the \textbf{USCILab3D dataset}, a large-scale, annotated outdoor dataset designed for versatile applications across multiple domains, including computer vision, robotics, and machine learning. The dataset was acquired using a mobile robot equipped with 5 cameras and a 32-beam, 360 {\circ} scanning LIDAR. The robot was teleoperated, over the course of a year and under a variety of weather and lighting conditions, through a rich variety of paths within the USC campus (229 acres \sim 92.7 hectares). The raw data was annotated using state-of-the-art large foundation models, and processed to provide multi-view imagery, 3D reconstructions, semantically-annotated images and point clouds (267 semantic categories), and text descriptions of images and objects within. The dataset also offers a diverse array of complex analyses using pose-stamping and trajectory data.

artificial intelligence, large-scale, semantically annotated outdoor dataset, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications

Neural Information Processing SystemsJan-19-2025, 19:58:41 GMT

harvard uspto patent dataset, multi-purpose corpus, patent application, (7 more...)

Neural Information Processing Systems

Country: North America > United States (1.00)

Industry:

Law > Intellectual Property & Technology Law (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.77)
Information Technology > Artificial Intelligence > Natural Language (0.56)

Add feedback

YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus

Neural Information Processing SystemsJan-18-2025, 15:28:30 GMT

Machine learning for sign languages is bottlenecked by data. In this paper, we present YouTube-ASL, a large-scale, open-domain corpus of American Sign Language (ASL) videos and accompanying English captions drawn from YouTube. With 1000 hours of videos and 2500 unique signers, YouTube-ASL is 3x as large and has 10x as many unique signers as the largest prior ASL dataset. We train baseline models for ASL to English translation on YouTube-ASL and evaluate them on How2Sign, where we achieve a new fine-tuned state of the art of 12.397 BLEU and, for the first time, nontrivial zero-shot results.

american sign language-english parallel corpus, open-domain american sign language-english, youtube-asl, (2 more...)

Neural Information Processing Systems

Industry: Education > Curriculum > Subject-Specific Education (0.96)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.32)

Add feedback

Large-Scale Distributed Learning via Private On-Device LSH

Neural Information Processing SystemsOct-11-2024, 02:37:18 GMT

Locality-sensitive hashing (LSH) based frameworks have been used efficiently to select weight vectors in a dense hidden layer with high cosine similarity to an input, enabling dynamic pruning. While this type of scheme has been shown to improve computational training efficiency, existing algorithms require repeated randomized projection of the full layer weight, which is impractical for computational- and memory-constrained devices. In a distributed setting, deferring LSH analysis to a centralized host is (i) slow if the device cluster is large and (ii) requires access to input data which is forbidden in a federated context. Using a new family of hash functions, we develop the first private, personalized, and memory-efficient on-device LSH framework.Our framework enables privacy and personalization by allowing each device to generate hash tables, without the help of a central host, using device-specific hashing hyper-parameters (e.g., number of hash tables or hash length).Hash tables are generated with a compressed set of the full weights, and can be serially generated and discarded if the process is memory-intensive.This allows devices to avoid maintaining (i) the fully-sized model and (ii) large amounts of hash tables in local memory for LSH analysis. We prove several statistical and sensitivity properties of our hash functions, and experimentally demonstrate that our framework is competitive in training large scale recommender networks compared to other LSH frameworks which assume unrestricted on-device capacity.

hash table, large-scale, private on-device lsh, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.79)

Add feedback