AITopics | bert

5812f92450ccaf17275500841c70924a-Supplemental.pdf

Neural Information Processing SystemsApr-26-2026, 00:44:08 GMT

We present a brief proof about the local optimality of one-hot encodings in the decision-theoretic framework presented in Section 3.2. We seek to prove that, under assumptions of an identity reward matrix, tokens constrained to a unit hypercube, and gaussian additive noise, one-hot tokens are an optimally robust communication strategy. We only seek to prove local optimality, as one many trivially generate multiple, equally optimal tokens by, for example, flipping all bits. The following derivation uses Karush-Kuhn-Tucker (KKT) conditions, a generalization of Lagrange multipliers [17]. We maximize the function, subject to constraints. T>j Ti Ti + ||Tj||2 Ti # ~µi + ~λi = ~0 (13) (14) We seek to show that one-hot vectors are an optimum, so we now show that one-hot vectors indeed respect the constraints and set the derivatives to zero.

agent, artificial intelligence, machine learning, (17 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (0.68)
Questionnaire & Opinion Survey (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)

Add feedback

Pay Attention to MLPs

Neural Information Processing SystemsApr-25-2026, 19:27:46 GMT

Transformers [1] have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years. Here we propose a simple network architecture, gMLP, based on MLPs with gating, and show that it can perform as well as Transformers in key language and vision applications. Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy. For BERT, our model achieves parity with Transformers on pretraining perplexity and is better on some downstream NLP tasks. On finetuning tasks where gMLP performs worse, making the gMLP model substantially larger can close the gap with Transformers. In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

239f914f30ea3c948fce2ea07a9efb33-Paper.pdf

Neural Information Processing SystemsApr-25-2026, 03:10:57 GMT

artificial intelligence, machine learning, natural language, (15 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Add feedback

7a6a74cbe87bc60030a4bd041dd47b78-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-19-2026, 03:17:53 GMT

baseline, bert, monolingual data, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

88dddaf430b5bc38ab8228902bb61821-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-15-2026, 17:43:29 GMT

Supplementary figure 1. Ablanullon study, each row represents the ablated layer and each column the module that is ablated from that layer, for example the first panel shows ablanullon of anullennullon - key in layer 5. Different layers in GPT2 - XL model were ablated and the consequence of ablanullon on curvature measured for 2000 sentences in UD corpus. Red bar shows the layer where ablanullon was applied. AB Supplementary figure 3. A. curvature values for sampled 2000 sentence in RWKV model ( RNN) for both trained an untrained version. B correlanullon between model generated surprisal and curvature in RWKV model. Diamonds: syntacnullc surprisal Supplementary figure 5: E ffect of different decoding strategies in GPT2 - XL sequence generanullon and its comparison to ground - truth(true) same as figure 4b in the main manuscript.

curvature, large language model, machine learning, (18 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.53)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)

Add feedback

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, Quoc V. Le

Neural Information Processing SystemsFeb-14-2026, 14:43:33 GMT

Neural Information Processing Systems http://nips.cc/

arxiv preprint arxiv, bert, objective, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Canada (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)

Add feedback

dc6a7e655d7e5840e66733e9ee67cc69-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-14-2026, 14:43:18 GMT

We thank all the reviewers for helpful suggestions. We will incorporate the following analysis into our revision. Firstly, we found 4 typical patterns shared by both, as shown in Figure 1. Attention patterns shared by XLNet and BERT . Rows and columns represent query and key respectively.

bert, large language model, natural language, (7 more...)

Neural Information Processing Systems

Country: North America > United States > New York (0.06)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.52)

Add feedback

f13ceb1b94145aad0e54186373cc86d7-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-12-2026, 19:51:27 GMT

constraint, music domain, prototype, (14 more...)

Neural Information Processing Systems

Genre: Questionnaire & Opinion Survey (0.47)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

f08223bc8d177df6807811c32f5acfed-Paper-Conference.pdf

Neural Information Processing SystemsFeb-12-2026, 19:16:01 GMT

polysemy, representation, word representation, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(7 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.70)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)

Add feedback

Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain)

Mariya Toneva, Leila Wehbe

Neural Information Processing SystemsFeb-12-2026, 15:07:17 GMT

Weusebrainimagingrecordings ofsubjectsreading complex natural text to interpret word and sequence embeddings from4 recent NLP models - ELMo, USE, BERT and Transformer-XL. We study how their representations differ across layer depth, contextlength, and attention type.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: