pyp
Flexible Models for with Application to Entity Resolution
Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman-Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate. For example, when performing entity resolution, the size of each cluster should be unrelated to the size of the data set, and each cluster should contain a negligible fraction of the total number of data points. These applications require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and introducing a new class of models that can exhibit this property. We compare models within this class to two commonly used clustering models using four entity-resolution data sets.
- Asia > Middle East > Syria (0.14)
- North America > United States (0.14)
- Europe > Italy (0.05)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Government (0.68)
- Health & Medicine (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.84)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
Bayesian nonparametric estimation of coverage probabilities and distinct counts from sketched data
Favaro, Stefano, Sesia, Matteo
The estimation of coverage probabilities, and in particular of the missing mass, is a classical statistical problem with applications in numerous scientific fields. In this paper, we study this problem in relation to randomized data compression, or sketching. This is a novel but practically relevant perspective, and it refers to situations in which coverage probabilities must be estimated based on a compressed and imperfect summary, or sketch, of the true data, because neither the full data nor the empirical frequencies of distinct symbols can be observed directly. Our contribution is a Bayesian nonparametric methodology to estimate coverage probabilities from data sketched through random hashing, which also solves the challenging problems of recovering the numbers of distinct counts in the true data and of distinct counts with a specified empirical frequency of interest. The proposed Bayesian estimators are shown to be easily applicable to large-scale analyses in combination with a Dirichlet process prior, although they involve some open computational challenges under the more general Pitman-Yor process prior. The empirical effectiveness of our methodology is demonstrated through numerical experiments and applications to real data sets of Covid DNA sequences, classic English literature, and IP addresses.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Italy > Lombardy > Milan (0.04)
To See Proteins Change in Quadrillionths of a Second, Use AI
Have you ever had an otherwise perfect photo ruined by someone who moved too quickly and caused a blur? Scientists have the same issue while recording images of proteins that change their structure in response to light. This process is common in nature, so for years researchers have tried to capture its details. But they have long been thwarted by how incredibly fast it happens. Now a team of researchers from the University of Wisconsin Milwaukee and the Center for Free-Electron Laser Science at the Deutsches Elektronen-Synchrotron in Germany have combined machine learning and quantum mechanical calculations to get the most precise record yet of structural changes in a photoactive yellow protein (PYP) that has been excited by light.
- North America > United States > Wisconsin > Milwaukee County > Milwaukee (0.26)
- Europe > Germany (0.26)
- North America > United States > California (0.06)
- North America > United States > Arizona (0.06)
Learning-augmented count-min sketches via Bayesian nonparametrics
Dolera, Emanuele, Favaro, Stefano, Peluchetti, Stefano
The count-min sketch (CMS) is a time and memory efficient randomized data structure that provides estimates of tokens' frequencies in a data stream, i.e. point queries, based on random hashed data. Learning-augmented CMSs improve the CMS by learning models that allow to better exploit data properties. In this paper, we focus on the learning-augmented CMS of Cai, Mitzenmacher and Adams (\textit{NeurIPS} 2018), which relies on Bayesian nonparametric (BNP) modeling of a data stream via Dirichlet process (DP) priors. This is referred to as the CMS-DP, and it leads to BNP estimates of a point query as posterior means of the point query given the hashed data. While BNPs is proved to be a powerful tool for developing robust learning-augmented CMSs, ideas and methods behind the CMS-DP are tailored to point queries under DP priors, and they can not be used for other priors or more general queries. In this paper, we present an alternative, and more flexible, derivation of the CMS-DP such that: i) it allows to make use of the Pitman-Yor process (PYP) prior, which is arguably the most popular generalization of the DP prior; ii) it can be readily applied to the more general problem of estimating range queries. This leads to develop a novel learning-augmented CMS under power-law data streams, referred to as the CMS-PYP, which relies on BNP modeling of the stream via PYP priors. Applications to synthetic and real data show that the CMS-PYP outperforms the CMS and the CMS-DP in the estimation of low-frequency tokens; this known to be a critical feature in natural language processing, where it is indeed common to encounter power-law data streams.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Italy (0.04)
Sampling and Inference for Beta Neutral-to-the-Left Models of Sparse Networks
Bloem-Reddy, Benjamin, Foster, Adam, Mathieu, Emile, Teh, Yee Whye
Empirical evidence suggests that heavy-tailed degree distributions occurring in many real networks are well-approximated by power laws with exponents $\eta$ that may take values either less than and greater than two. Models based on various forms of exchangeability are able to capture power laws with $\eta < 2$, and admit tractable inference algorithms; we draw on previous results to show that $\eta > 2$ cannot be generated by the forms of exchangeability used in existing random graph models. Preferential attachment models generate power law exponents greater than two, but have been of limited use as statistical models due to the inherent difficulty of performing inference in non-exchangeable models. Motivated by this gap, we design and implement inference algorithms for a recently proposed class of models that generates $\eta$ of all possible values. We show that although they are not exchangeable, these models have probabilistic structure amenable to inference. Our methods make a large class of previously intractable models useful for statistical inference.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Communications (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.70)
Flexible Models for Microclustering with Application to Entity Resolution
Betancourt, Brenda, Zanella, Giacomo, Miller, Jeffrey W., Wallach, Hanna, Zaidi, Abbas, Steorts, Rebecca C.
Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate. For example, when performing entity resolution, the size of each cluster should be unrelated to the size of the data set, and each cluster should contain a negligible fraction of the total number of data points. These applications require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and introducing a new class of models that can exhibit this property. We compare models within this class to two commonly used clustering models using four entity-resolution data sets.
- Asia > Middle East > Syria (0.14)
- North America > United States (0.14)
- Europe > Italy (0.05)
- (2 more...)
- Government (0.68)
- Health & Medicine (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.84)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
Flexible Models for Microclustering with Application to Entity Resolution
Zanella, Giacomo, Betancourt, Brenda, Wallach, Hanna, Miller, Jeffrey, Zaidi, Abbas, Steorts, Rebecca C.
Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate. For example, when performing entity resolution, the size of each cluster should be unrelated to the size of the data set, and each cluster should contain a negligible fraction of the total number of data points. These applications require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and introducing a new class of models that can exhibit this property. We compare models within this class to two commonly used clustering models using four entity-resolution data sets.
- Asia > Middle East > Syria (0.14)
- North America > United States (0.14)
- Europe > Italy (0.05)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Government (0.68)
- Health & Medicine (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.83)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
Bibliographic Analysis with the Citation Network Topic Model
Bibliographic analysis considers author's research areas, the citation network and paper content among other things. In this paper, we combine these three in a topic model that produces a bibliographic model of authors, topics and documents using a non-parametric extension of a combination of the Poisson mixed-topic link model and the author-topic model. We propose a novel and efficient inference algorithm for the model to explore subsets of research publications from CiteSeerX. Our model demonstrates improved performance in both model fitting and a clustering task compared to several baselines.
- Asia > Middle East > Jordan (0.04)
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
Bibliographic Analysis on Research Publications using Authors, Categorical Labels and the Citation Network
Bibliographic analysis considers the author's research areas, the citation network and the paper content among other things. In this paper, we combine these three in a topic model that produces a bibliographic model of authors, topics and documents, using a nonparametric extension of a combination of the Poisson mixed-topic link model and the author-topic model. This gives rise to the Citation Network Topic Model (CNTM). We propose a novel and efficient inference algorithm for the CNTM to explore subsets of research publications from CiteSeerX. The publication datasets are organised into three corpora, totalling to about 168k publications with about 62k authors. The queried datasets are made available online. In three publicly available corpora in addition to the queried datasets, our proposed model demonstrates an improved performance in both model fitting and document clustering, compared to several baselines. Moreover, our model allows extraction of additional useful knowledge from the corpora, such as the visualisation of the author-topics network. Additionally, we propose a simple method to incorporate supervision into topic modelling to achieve further improvement on the clustering task.
- Europe (0.46)
- Asia (0.28)
- Africa (0.14)
- (2 more...)
- Health & Medicine (1.00)
- Government (1.00)
- Education (1.00)
- (2 more...)
Extracting Topical Phrases from Clinical Documents
In clinical documents, medical terms are often expressed in multi-word phrases. Traditional topic modelling approaches relying on the "bag-of-words" assumption are not effective in extracting topic themes from clinical documents. This paper proposes to first extract medical phrases using an off-the-shelf tool for medical concept mention extraction, and then train a topic model which takes a hierarchy of Pitman-Yor processes as prior for modelling the generation of phrases of arbitrary length. Experimental results on patients' discharge summaries show that the proposed approach outperforms the state-of-the-art topical phrase extraction model on both perplexity and topic coherence measure and finds more interpretable topics.
- Asia > Middle East > Jordan (0.05)
- North America > United States > California (0.04)