AITopics

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Data Science > Data Mining (0.70)

Add feedback

e287f0b2e730059c55d97fa92649f4f2-AuthorFeedback.pdf

Neural Information Processing SystemsMar-21-2025, 08:29:34 GMT

The execution time for inference is not provided in the paper. We will state this in the next revision. The advantage of the proposed algorithm is clear for the discrete tasks but not for continuous tasks. The results are competitive with SOTA so we have elected to include them for completeness. I think the authors used seven datasets out of the eight datasets described in the paper [7].

artificial intelligence, dataset, machine learning, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.31)

Add feedback

3accfe8332366a6f740d8740cd4cd653-Supplemental-Conference.pdf

Neural Information Processing SystemsMar-21-2025, 08:29:31 GMT

large language model, machine learning, natural language, (14 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.34)

Add feedback

A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints

Neural Information Processing SystemsMar-21-2025, 08:29:28 GMT

This often requires maximizing the likelihood of a symbolic constraint w.r.t. the neural network's output distribution. Such output distributions are typically assumed to be fully-factorized. This limits the applicability of neuro-symbolic learning to the more expressive autoregressive distributions, e.g., transformers. Under such distributions, computing the likelihood of even simple constraints is #P-hard. Instead of attempting to enforce the constraint on the entire output distribution, we propose to do so on a random, local approximation thereof.

constraint, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.14)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

Dual-Free Stochastic Decentralized Optimization with Variance Reduction

Neural Information Processing SystemsMar-21-2025, 08:24:01 GMT

We consider the problem of training machine learning models on distributed data in a decentralized way. For finite-sum problems, fast single-machine algorithms for large datasets rely on stochastic updates combined with variance reduction. Yet, existing decentralized stochastic algorithms either do not obtain the full speedup allowed by stochastic updates, or require oracles that are more expensive than regular gradients. In this work, we introduce a Decentralized stochastic algorithm with Variance Reduction called DVR. DVR only requires computing stochastic gradients of the local functions, and is computationally as fast as a standard stochastic variance-reduced algorithms run on a 1/n fraction of the dataset, where n is the number of nodes. To derive DVR, we use Bregman coordinate descent on a well-chosen dual problem, and obtain a dual-free algorithm using a specific Bregman divergence.

artificial intelligence, machine learning, variance reduction, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Untangling tradeoffs between recurrence and self-attention in neural networks Kyle Goyette

Neural Information Processing SystemsMar-21-2025, 08:23:53 GMT

Attention and self-attention mechanisms, are now central to state-of-the-art deep learning on sequential tasks. However, most recent progress hinges on heuristic approaches with limited understanding of attention's role in model optimization and computation, and rely on considerable memory and computational resources that scale poorly. In this work, we present a formal analysis of how self-attention affects gradient propagation in recurrent networks, and prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies by establishing concrete bounds for gradient norms. Building on these results, we propose a relevancy screening mechanism, inspired by the cognitive process of memory consolidation, that allows for a scalable use of sparse self-attention with recurrence. While providing guarantees to avoid vanishing gradients, we use simple numerical experiments to demonstrate the tradeoffs in performance and computational resources by efficiently balancing attention and recurrence. Based on our results, we propose a concrete direction of research to improve scalability of attentive networks.

gradient propagation, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country: North America > Canada (0.47)

Genre: Research Report (0.48)

Technology: