Clustering
Identifying gender bias in blockbuster movies through the lens of machine learning
Haris, Muhammad Junaid, Upreti, Aanchal, Kurtaran, Melih, Ginter, Filip, Lafond, Sebastien, Azimi, Sepinoud
The problem of gender bias is highly prevalent and well known. In this paper, we have analysed the portrayal of gender roles in English movies, a medium that effectively influences society in shaping people's beliefs and opinions. First, we gathered scripts of films from different genres and derived sentiments and emotions using natural language processing techniques. Afterwards, we converted the scripts into embeddings, i.e. a way of representing text in the form of vectors. With a thorough investigation, we found specific patterns in male and female characters' personality traits in movies that align with societal stereotypes. Furthermore, we used mathematical and machine learning techniques and found some biases wherein men are shown to be more dominant and envious than women, whereas women have more joyful roles in movies. In our work, we introduce, to the best of our knowledge, a novel technique to convert dialogues into an array of emotions by combining it with Plutchik's wheel of emotions. Our study aims to encourage reflections on gender equality in the domain of film and facilitate other researchers in analysing movies automatically instead of using manual approaches.
EVNet: An Explainable Deep Network for Dimension Reduction
Zang, Zelin, Cheng, Shenghui, Lu, Linyan, Xia, Hanchen, Li, Liangyu, Sun, Yaoting, Xu, Yongjie, Shang, Lei, Sun, Baigui, Li, Stan Z.
Dimension reduction (DR) is commonly utilized to capture the intrinsic structure and transform high-dimensional data into low-dimensional space while retaining meaningful properties of the original data. It is used in various applications, such as image recognition, single-cell sequencing analysis, and biomarker discovery. However, contemporary parametric-free and parametric DR techniques suffer from several significant shortcomings, such as the inability to preserve global and local features and the pool generalization performance. On the other hand, regarding explainability, it is crucial to comprehend the embedding process, especially the contribution of each part to the embedding process, while understanding how each feature affects the embedding results that identify critical components and help diagnose the embedding process. To address these problems, we have developed a deep neural network method called EVNet, which provides not only excellent performance in structural maintainability but also explainability to the DR therein. EVNet starts with data augmentation and a manifold-based loss function to improve embedding performance. The explanation is based on saliency maps and aims to examine the trained EVNet parameters and contributions of components during the embedding process. The proposed techniques are integrated with a visual interface to help the user to adjust EVNet to achieve better DR performance and explainability. The interactive visual interface makes it easier to illustrate the data features, compare different DR techniques, and investigate DR. An in-depth experimental comparison shows that EVNet consistently outperforms the state-of-the-art methods in both performance measures and explainability.
Semi-supervised Local Cluster Extraction by Compressive Sensing
Shen, Zhaiming, Lai, Ming-Jun, Li, Sheng
Local clustering problem aims at extracting a small local structure inside a graph without the necessity of knowing the entire graph structure. As the local structure is usually small in size compared to the entire graph, one can think of it as a compressive sensing problem where the indices of target cluster can be thought as a sparse solution to a linear system. In this paper, we propose a new semi-supervised local cluster extraction approach by applying the idea of compressive sensing based on two pioneering works under the same framework. Our approves improves the existing works by making the initial cut to be the entire graph and hence overcomes a major limitation of existing works, which is the low quality of initial cut. Extensive experimental results on multiple benchmark datasets demonstrate the effectiveness of our approach.
Towards Effective Clustered Federated Learning: A Peer-to-peer Framework with Adaptive Neighbor Matching
Li, Zexi, Lu, Jiaxun, Luo, Shuang, Zhu, Didi, Shao, Yunfeng, Li, Yinchuan, Zhang, Zhimeng, Wang, Yongheng, Wu, Chao
In federated learning (FL), clients may have diverse objectives, and merging all clients' knowledge into one global model will cause negative transfer to local performance. Thus, clustered FL is proposed to group similar clients into clusters and maintain several global models. In the literature, centralized clustered FL algorithms require the assumption of the number of clusters and hence are not effective enough to explore the latent relationships among clients. In this paper, without assuming the number of clusters, we propose a peer-to-peer (P2P) FL algorithm named PANM. In PANM, clients communicate with peers to adaptively form an effective clustered topology. Specifically, we present two novel metrics for measuring client similarity and a two-stage neighbor matching algorithm based Monte Carlo method and Expectation Maximization under the Gaussian Mixture Model assumption. We have conducted theoretical analyses of PANM on the probability of neighbor estimation and the error gap to the clustered optimum. We have also implemented extensive experiments under both synthetic and real-world clustered heterogeneity. Theoretical analysis and empirical experiments show that the proposed algorithm is superior to the P2P FL counterparts, and it achieves better performance than the centralized cluster FL method. PANM is effective even under extremely low communication budgets.
Hub-VAE: Unsupervised Hub-based Regularization of Variational Autoencoders
Mani, Priya, Domeniconi, Carlotta
Exemplar-based methods rely on informative data points or prototypes to guide the optimization of learning algorithms. Such data facilitate interpretable model design and prediction. Of particular interest is the utility of exemplars in learning unsupervised deep representations. In this paper, we leverage hubs, which emerge as frequent neighbors in high-dimensional spaces, as exemplars to regularize a variational autoencoder and to learn a discriminative embedding for unsupervised down-stream tasks. We propose an unsupervised, data-driven regularization of the latent space with a mixture of hub-based priors and a hub-based contrastive loss. Experimental evaluation shows that our algorithm achieves superior cluster separability in the embedding space, and accurate data reconstruction and generation, compared to baselines and state-of-the-art techniques.
Clustering based opcode graph generation for malware variant detection
Fok, Kar Wai, Thing, Vrizlynn L. L.
Malwares are the key means leveraged by threat actors in the cyber space for their attacks. There is a large array of commercial solutions in the market and significant scientific research to tackle the challenge of the detection and defense against malwares. At the same time, attackers also advance their capabilities in creating polymorphic and metamorphic malwares to make it increasingly challenging for existing solutions. To tackle this issue, we propose a methodology to perform malware detection and family attribution. The proposed methodology first performs the extraction of opcodes from malwares in each family and constructs their respective opcode graphs. We explore the use of clustering algorithms on the opcode graphs to detect clusters of malwares within the same malware family. Such clusters can be seen as belonging to different sub-family groups. Opcode graph signatures are built from each detected cluster. Hence, for each malware family, a group of signatures is generated to represent the family. These signatures are used to classify an unknown sample as benign or belonging to one the malware families. We evaluate our methodology by performing experiments on a dataset consisting of both benign files and malware samples belonging to a number of different malware families and comparing the results to existing approach.
Modeling chronic pain experiences from online reports using the Reddit Reports of Chronic Pain dataset
Nunes, Diogo A. P., Ferreira-Gomes, Joana, Neto, Fani, de Matos, David Martins
Objective: Reveal and quantify qualities of reported experiences of chronic pain on social media, from multiple pathological backgrounds, by means of the novel Reddit Reports of Chronic Pain (RRCP) dataset, using Natural Language Processing techniques. Materials and Methods: Define and validate the RRCP dataset for a set of subreddits related to chronic pain. Identify the main concerns discussed in each subreddit. Model each subreddit according to their main concerns. Compare subreddit models. Results: The RRCP dataset comprises 86,537 Reddit submissions from 12 subreddits related to chronic pain (each related to one pathological background). Each RRCP subreddit has various main concerns. Some of these concerns are shared between multiple subreddits (e.g., the subreddit Sciatica semantically entails the subreddit backpain in their various concerns, but not the other way around), whilst some concerns are exclusive to specific subreddits (e.g., Interstitialcystitis and CrohnsDisease). Discussion: These results suggest that the reported experience of chronic pain, from multiple pathologies (i.e., subreddits), has concerns relevant to all, and concerns exclusive to certain pathologies. Our analysis details each of these concerns and their similarity relations. Conclusion: Although limited by intrinsic qualities of the Reddit platform, to the best of our knowledge, this is the first research work attempting to model the linguistic expression of various chronic pain-inducing pathologies and comparing these models to identify and quantify the similarities and differences between the corresponding emergent chronic pain experiences.
Hierarchical Clustering in Machine Learning - Analytics Vidhya
This article was published as a part of the Data Science Blogathon. Hierarchical clustering is one of the most famous clustering techniques used in unsupervised machine learning. K-means and hierarchical clustering are the two most popular and effective clustering algorithms. The working mechanism they apply in the backend allows them to provide such a high level of performance. In this article, we will discuss hierarchical clustering and its types, its working mechanisms, its core intuition, the pros and cons of using this clustering strategy and conclude with some fundamentals to remember for this practice.
Data Dimension Reduction makes ML Algorithms efficient
Khan, Wisal, Turab, Muhammad, Ahmad, Waqas, Ahmad, Syed Hasnat, Kumar, Kelash, Luo, Bin
Data dimension reduction (DDR) is all about mapping data from high dimensions to low dimensions, various techniques of DDR are being used for image dimension reduction like Random Projections, Principal Component Analysis (PCA), the Variance approach, LSA-Transform, the Combined and Direct approaches, and the New Random Approach. Auto-encoders (AE) are used to learn end-to-end mapping. In this paper, we demonstrate that pre-processing not only speeds up the algorithms but also improves accuracy in both supervised and unsupervised learning. In pre-processing of DDR, first PCA based DDR is used for supervised learning, then we explore AE based DDR for unsupervised learning. In PCA based DDR, we first compare supervised learning algorithms accuracy and time before and after applying PCA. Similarly, in AE based DDR, we compare unsupervised learning algorithm accuracy and time before and after AE representation learning. Supervised learning algorithms including support-vector machines (SVM), Decision Tree with GINI index, Decision Tree with entropy and Stochastic Gradient Descent classifier (SGDC) and unsupervised learning algorithm including K-means clustering, are used for classification purpose. We used two datasets MNIST and FashionMNIST Our experiment shows that there is massive improvement in accuracy and time reduction after pre-processing in both supervised and unsupervised learning.
Unsupervised Learning of Hierarchical Conversation Structure
Lu, Bo-Ru, Hu, Yushi, Cheng, Hao, Smith, Noah A., Ostendorf, Mari
Human conversations can evolve in many different ways, creating challenges for automatic understanding and summarization. Goal-oriented conversations often have meaningful sub-dialogue structure, but it can be highly domain-dependent. This work introduces an unsupervised approach to learning hierarchical conversation structure, including turn and sub-dialogue segment labels, corresponding roughly to dialogue acts and sub-tasks, respectively. The decoded structure is shown to be useful in enhancing neural models of language for three conversation-level understanding tasks. Further, the learned finite-state sub-dialogue network is made interpretable through automatic summarization.