Goto

Collaborating Authors

 subcommunity


Predictively Combatting Toxicity in Health-related Online Discussions through Machine Learning

Paz-Ruza, Jorge, Alonso-Betanzos, Amparo, Guijarro-Berdiñas, Bertha, Eiras-Franco, Carlos

arXiv.org Artificial Intelligence

--In health-related topics, user toxicity in online discussions frequently becomes a source of social conflict or promotion of dangerous, unscientific behaviour; common approaches for battling it include different forms of detection, flagging and/or removal of existing toxic comments, which is often counterproductive for platforms and users alike. In this work, we propose the alternative of combatting user toxicity predictively, anticipating where a user could interact toxically in health-related online discussions. The hierarchical and decentralised structure made Reddit a hub of heated debate during the onset of the COVID pandemic, with over 200,000 related posts per day. Center accredited by Galician University System, is funded by "Conseller Conversely, volunteer-based moderation is generally more susceptible to bias and under-moderation, depending on the platform's audience. The design of an adapted Leave Out Last Item data partitioning method suitable for binary classification-oriented Collaborative Filtering tasks. We remove "generic comments'' from the set, i.e. those Label comments as "generic'' if they do not contain any words from Authors have temporarily removed this link to the work's repository to The majority of users do not post toxic comments when discussing health on Reddit, with 9.96% of toxic comments in the aggregate, similar to previous work. Furthermore, as Figure 2 shows, a user's toxicity on a subreddit tends to be consistent (toxic or non-toxic, as indicated by the peaks in the distribution at toxicities 0 Note the logarithmic scale on the y-axis. To tag the toxicity of comments we use Detoxify-original [7], a pre-trained language model. Instead of only detecting and punishing the toxicity of existing interactions like common content moderation methods, which is ineffective and counterproductive in the long term, this work's proposal is to predict the toxicity of an unobserved interaction Figure 5. Topology of the Machine Learning model proposed to predict the toxicity of health-related conversations in unobserved user-subreddit interactions on the Reddit platform. We assessed the predictive ability of our model and baselines using classical binary classification metrics: sensitivity, specificity, and geometric mean (G.Mean) of the class-wise We identify different avenues of future work. U. Naseem, J. Kim, M. Khushi, and A. G. Dunn, "Identification of disease or symptom terms in reddit to improve health mention classification," in "R/redditsecurity - understanding hate on reddit, and the impact of our Iii, "Toxicity detection is not all you need: Measuring the gaps to "Meta to replace'biased' fact-checkers with moderation by users -- J. Brownlee, Imbalanced classification with Python: better metrics, balance skewed classes, cost-sensitive learning .


Reviews: On the Recursive Teaching Dimension of VC Classes

Neural Information Processing Systems

The paper is very insightful - the authors quite nicely explain the approach they took for proving their results. The questions addressed, while interesting only for a fairly small subcommunity of the machine learning community, are really important in that subcommunity, and the authors have achieved a substantial breakthrough on an open problem posed in COLT 2015. I quite liked the idea to formulate the problem of finding a concept class with RTD 3/2 VCD as a SAT problem. In my eyes, the results should definitely be published, and they are important enough to deserve publication in a leading venue like NIPS. The paper is generally well written and easy to read, but there are a few minor (easy to fix issues (mostly just typos etc).


LISTN: Lexicon induction with socio-temporal nuance

de Kock, Christine

arXiv.org Artificial Intelligence

In-group language is an important signifier of group dynamics. This paper proposes a novel method for inducing lexicons of in-group language, which incorporates its socio-temporal context. Existing methods for lexicon induction do not capture the evolving nature of in-group language, nor the social structure of the community. Using dynamic word and user embeddings trained on conversations from online anti-women communities, our approach outperforms prior methods for lexicon induction. We develop a test set for the task of lexicon induction and a new lexicon of manosphere language, validated by human experts, which quantifies the relevance of each term to a specific sub-community at a given point in time. Finally, we present novel insights on in-group language which illustrate the utility of this approach.


Microbiome subcommunity learning with logistic-tree normal latent Dirichlet allocation

LeBlanc, Patrick, Ma, Li

arXiv.org Machine Learning

Mixed-membership (MM) models such as Latent Dirichlet Allocation (LDA) have been applied to microbiome compositional data to identify latent subcommunities of microbial species. However, microbiome compositional data, especially those collected from the gut, typically display substantial cross-sample heterogeneities in the subcommunity composition which current MM methods do not account for. To address this limitation, we incorporate the logistic-tree normal (LTN) model -- using the phylogenetic tree structure -- into the LDA model to form a new MM model. This model allows variation in the composition of each subcommunity around some ``centroid'' composition. Incorporation of auxiliary P\'olya-Gamma variables enables a computationally efficient collapsed blocked Gibbs sampler to carry out Bayesian inference under this model. We compare the new model and LDA and show that in the presence of large cross-sample heterogeneity, under the LDA model the resulting inference can be extremely sensitive to the specification of the total number of subcommunities as it does not account for cross-sample heterogeneity. As such, the popular strategy in other applications of MM models of overspecifying the number of subcommunities -- and hoping that some meaningful subcommunities will emerge among artificial ones -- can lead to highly misleading conclusions in the microbiome context. In contrast, by accounting for such heterogeneity, our MM model restores the robustness of the inference in the specification of the number of subcommunities and again allows meaningful subcommunities to be identified under this strategy.


Pyfectious: An individual-level simulator to discover optimal containment polices for epidemic diseases

Mehrjou, Arash, Soleymani, Ashkan, Abyaneh, Amin, Bhatt, Samir, Schölkopf, Bernhard, Bauer, Stefan

arXiv.org Artificial Intelligence

Simulating the spread of infectious diseases in human communities is critical for predicting the trajectory of an epidemic and verifying various policies to control the devastating impacts of the outbreak. Many existing simulators are based on compartment models that divide people into a few subsets and simulate the dynamics among those subsets using hypothesized differential equations. However, these models lack the requisite granularity to study the effect of intelligent policies that influence every individual in a particular way. In this work, we introduce a simulator software capable of modeling a population structure and controlling the disease's propagation at an individualistic level. In order to estimate the confidence of the conclusions drawn from the simulator, we employ a comprehensive probabilistic approach where the entire population is constructed as a hierarchical random variable. This approach makes the inferred conclusions more robust against sampling artifacts and gives confidence bounds for decisions based on the simulation results. To showcase potential applications, the simulator parameters are set based on the formal statistics of the COVID-19 pandemic, and the outcome of a wide range of control measures is investigated. Furthermore, the simulator is used as the environment of a reinforcement learning problem to find the optimal policies to control the pandemic. The obtained experimental results indicate the simulator's adaptability and capacity in making sound predictions and a successful policy derivation example based on real-world data. As an exemplary application, our results show that the proposed policy discovery method can lead to control measures that produce significantly fewer infected individuals in the population and protect the health system against saturation.