menzerath
Humpback whale songs have patterns that resemble human language
Humpback whale songs have statistical patterns in their structure that are remarkably similar to those seen in human language. While this doesn't mean the songs convey complex meanings like our sentences do, it hints that whales may learn their songs in a similar way to how human infants start to understand language. Only male humpback whales (Megaptera novaeangliae) sing, and the behaviour is thought to be important for attracting mates. The songs are constantly evolving, with new elements appearing and spreading through the population until the old song is completely replaced with a new one. We're finally realising that many species are "We think it's a little bit like a standardised test, where everybody's got to do the same task but you can make changes and embellishments to show that you're better at the task than everybody else," says Jenny Allen at Griffith University in Gold Coast, Australia.
- Oceania > Australia (0.25)
- Pacific Ocean (0.05)
- Oceania > New Caledonia (0.05)
- (2 more...)
Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods
Suyunu, Burak, Taylan, Enes, Özgür, Arzucan
Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties. However, existing subword tokenization methods, developed primarily for human language, may be inadequate for protein sequences, which have unique patterns and constraints. This study evaluates three prominent tokenization approaches, Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, across varying vocabulary sizes (400-6400), analyzing their effectiveness in protein sequence representation, domain boundary preservation, and adherence to established linguistic laws. Our comprehensive analysis reveals distinct behavioral patterns among these tokenizers, with vocabulary size significantly influencing their performance. BPE demonstrates better contextual specialization and marginally better domain boundary preservation at smaller vocabularies, while SentencePiece achieves better encoding efficiency, leading to lower fertility scores. WordPiece offers a balanced compromise between these characteristics. However, all tokenizers show limitations in maintaining protein domain integrity, particularly as vocabulary size increases. Analysis of linguistic law adherence shows partial compliance with Zipf's and Brevity laws but notable deviations from Menzerath's law, suggesting that protein sequences may follow distinct organizational principles from natural languages. These findings highlight the limitations of applying traditional NLP tokenization methods to protein sequences and emphasize the need for developing specialized tokenization strategies that better account for the unique characteristics of proteins.
- North America > United States (0.14)
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Simple stochastic processes behind Menzerath's Law
This paper revisits Menzerath's Law, also known as the Menzerath-Altmann Law, which models a relationship between the length of a linguistic construct and the average length of its constituents. Recent findings indicate that simple stochastic processes can display Menzerathian behaviour, though existing models fail to accurately reflect real-world data. If we adopt the basic principle that a word can change its length in both syllables and phonemes, where the correlation between these variables is not perfect and these changes are of a multiplicative nature, we get bivariate log-normal distribution. The present paper shows, that from this very simple principle, we obtain the classic Altmann model of the Menzerath-Altmann Law. If we model the joint distribution separately and independently from the marginal distributions, we can obtain an even more accurate model by using a Gaussian copula. The models are confronted with empirical data, and alternative approaches are discussed.
- Europe > Netherlands > South Holland > Dordrecht (0.05)
- Europe > Czechia > Prague (0.05)
- Europe > United Kingdom (0.04)
- (3 more...)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > California > Yolo County > Davis (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- (9 more...)
- Research Report (0.40)
- Overview (0.34)
Statistical Parameters of the Novel "Perekhresni stezhky" ("The Cross-Paths") by Ivan Franko
Buk, Solomija, Rovenchak, Andrij
Year 2006 is the 150th anniversary of Ivan Franko (1856-1916), the prominent Ukrainian writer, poet, publicist, philosopher, sociologist, economist, translator-polyglot and the public figure. His incomplete collected works were published in 50 volumes (Franko, 1976-86). With this name the notion of national identity in the Western Ukraine is connected. Franko's works have intensive plot and interesting topic.
- Europe > Ukraine > Lviv Oblast > Lviv (0.06)
- Europe > Ukraine > Kyiv Oblast > Kyiv (0.05)
- Europe > Ukraine > Kharkiv Oblast > Kharkiv (0.05)
- (3 more...)