Goto

Collaborating Authors

 Sanchez, David


Entropy and type-token ratio in gigaword corpora

arXiv.org Artificial Intelligence

Lexical diversity measures the vocabulary variation in texts. While its utility is evident for analyses in language change and applied linguistics, it is not yet clear how to operationalize this concept in a unique way. We here investigate entropy and text-token ratio, two widely employed metrics for lexical diversities, in six massive linguistic datasets in English, Spanish, and Turkish, consisting of books, news articles, and tweets. These gigaword corpora correspond to languages with distinct morphological features and differ in registers and genres, thus constituting a diverse testbed for a quantitative approach to lexical diversity. Strikingly, we find a functional relation between entropy and text-token ratio that holds across the corpora under consideration. Further, in the limit of large vocabularies we find an analytical expression that sheds light on the origin of this relation and its connection with both Zipf and Heaps laws. Our results then contribute to the theoretical understanding of text structure and offer practical implications for fields like natural language processing.


Computational lexical analysis of Flamenco genres

arXiv.org Artificial Intelligence

Flamenco, recognized by UNESCO as part of the Intangible Cultural Heritage of Humanity, is a profound expression of cultural identity rooted in Andalusia, Spain. However, there is a lack of quantitative studies that help identify characteristic patterns in this long-lived music tradition. In this work, we present a computational analysis of Flamenco lyrics, employing natural language processing and machine learning to categorize over 2000 lyrics into their respective Flamenco genres, termed as $\textit{palos}$. Using a Multinomial Naive Bayes classifier, we find that lexical variation across styles enables to accurately identify distinct $\textit{palos}$. More importantly, from an automatic method of word usage, we obtain the semantic fields that characterize each style. Further, applying a metric that quantifies the inter-genre distance we perform a network analysis that sheds light on the relationship between Flamenco styles. Remarkably, our results suggest historical connections and $\textit{palo}$ evolutions. Overall, our work illuminates the intricate relationships and cultural significance embedded within Flamenco lyrics, complementing previous qualitative discussions with quantitative analyses and sparking new discussions on the origin and development of traditional music genres.


American cultural regions mapped through the lexical analysis of social media

arXiv.org Artificial Intelligence

Seven of the most prominent theories Cultural identity is an elusive notion because it depends [3-9] are mapped in Figure 1, showing considerable on a wide range of different cultural factors-- disagreement. For example, in [5] the geographer Wilbur including politics, religion, ethnicity, economics, and art, Zelinsky identified 5 major cultural regions--New England, among countless other examples--which will generally the Midland, the South, the Middle West, and the differ across individuals, with the cultural background West--based on a synthesis of regional patterns in a wide of every individual ultimately being unique. Nevertheless, range of cultural factors, including ethnicity, religion, individuals from the same region can generally be economics, and settlement history. Alternatively, in [6] expected to share some cultural traits, reflecting the drawing on a similar but more extensive range of cultural shared cultural values and practices associated with the factors, the social scientist Raymond Gastil identified 13 region [1]. Identifying the cultural regions of a nation-- major cultural regions, offering a more complex theory regions whose populations are characterized by relative than Zelinsky, including by dividing Zelinsky's Midland, cultural homogeneity compared to the populations of Middle West, and West regions. The two studies illustrate other regions within the nation--is very valuable information two basic limitations with these types of approaches across a wide range of domains. For example, it that subjectively synthesize a range of data to infer cultural is important for governments to understand geographical regions. First, it is unclear exactly how relevant variation in the values of their population so as to cultural factors should be identified. Zelinsky considers better meet their educational, social, and welfare needs.


Ordinal analysis of lexical patterns

arXiv.org Artificial Intelligence

Words are fundamental linguistic units that connect thoughts and things through meaning. However, words do not appear independently in a text sequence. The existence of syntactic rules induces correlations among neighboring words. Using an ordinal pattern approach, we present an analysis of lexical statistical connections for 11 major languages. We find that the diverse manners that languages utilize to express word relations give rise to unique pattern structural distributions. Furthermore, fluctuations of these pattern distributions for a given language can allow us to determine both the historical period when the text was written and its author. Taken together, our results emphasize the relevance of ordinal time series analysis in linguistic typology, historical linguistics and stylometry.


Enhanced Security and Privacy via Fragmented Federated Learning

arXiv.org Artificial Intelligence

In federated learning (FL), a set of participants share updates computed on their local data with an aggregator server that combines updates into a global model. However, reconciling accuracy with privacy and security is a challenge to FL. On the one hand, good updates sent by honest participants may reveal their private local information, whereas poisoned updates sent by malicious participants may compromise the model's availability and/or integrity. On the other hand, enhancing privacy via update distortion damages accuracy, whereas doing so via update aggregation damages security because it does not allow the server to filter out individual poisoned updates. To tackle the accuracy-privacy-security conflict, we propose {\em fragmented federated learning} (FFL), in which participants randomly exchange and mix fragments of their updates before sending them to the server. To achieve privacy, we design a lightweight protocol that allows participants to privately exchange and mix encrypted fragments of their updates so that the server can neither obtain individual updates nor link them to their originators. To achieve security, we design a reputation-based defense tailored for FFL that builds trust in participants and their mixed updates based on the quality of the fragments they exchange and the mixed updates they send. Since the exchanged fragments' parameters keep their original coordinates and attackers can be neutralized, the server can correctly reconstruct a global model from the received mixed updates without accuracy loss. Experiments on four real data sets show that FFL can prevent semi-honest servers from mounting privacy attacks, can effectively counter poisoning attacks and can keep the accuracy of the global model.


A Critical Review on the Use (and Misuse) of Differential Privacy in Machine Learning

arXiv.org Artificial Intelligence

As long ago as the 1970s, official statisticians [Dalenius(1977)] began to worry about potential disclosure of private information on people or companies linked to the publication of statistical outputs. This ushered in the statistical disclosure control (SDC) discipline [Hundepool et al.(2012)], whose goal is to provide methods for data anonymization. Also related to SDC is randomized response (RR, [Warner(1965)]), which was designed in the 1960s as a mechanism to eliminate evasive answer bias in surveys and turned out to be very useful for anonymization. The usual approach to anonymization in official statistics is utility-first: anonymization parameters are iteratively tried until a parameter choice is found that preserves sufficient analytical utility while reducing below a certain threshold the risk of disclosing confidential information on specific respondents. Both utility and privacy are evaluated ex post by respectively measuring the information loss and the probability of re-identification of the anonymized outputs.