Towards a theory of how the structure of language is acquired by deep neural networks

May-27-2025, 09:50:03 GMT–Neural Information Processing Systems

How much data is required to learn the structure of a language via next-token prediction? We study this question for synthetic datasets generated via a Probabilistic Context-Free Grammar (PCFG)---a hierarchical generative model that captures the tree-like structure of natural languages. We determine token-token correlations analytically in our model and show that they can be used to build a representation of the grammar's hidden variables, the longer the range the deeper the variable. In addition, a finite training set limits the resolution of correlations to an effective range, whose size grows with that of the training set. As a result, a Language Model trained with increasingly many examples can build a deeper representation of the grammar's structure, thus reaching good performance despite the high dimensionality of the problem.

deep neural network, effective range, synthetic dataset, (2 more...)

Neural Information Processing Systems

May-27-2025, 09:50:03 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Grammars & Parsing (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.40)