Goto

Collaborating Authors

 bkb


KnowPhish: Large Language Models Meet Multimodal Knowledge Graphs for Enhancing Reference-Based Phishing Detection

arXiv.org Artificial Intelligence

Phishing attacks have inflicted substantial losses on individuals and businesses alike, necessitating the development of robust and efficient automated phishing detection approaches. Reference-based phishing detectors (RBPDs), which compare the logos on a target webpage to a known set of logos, have emerged as the state-of-the-art approach. However, a major limitation of existing RBPDs is that they rely on a manually constructed brand knowledge base, making it infeasible to scale to a large number of brands, which results in false negative errors due to the insufficient brand coverage of the knowledge base. To address this issue, we propose an automated knowledge collection pipeline, using which we collect a large-scale multimodal brand knowledge base, KnowPhish, containing 20k brands with rich information about each brand. KnowPhish can be used to boost the performance of existing RBPDs in a plug-and-play manner. A second limitation of existing RBPDs is that they solely rely on the image modality, ignoring useful textual information present in the webpage HTML. To utilize this textual information, we propose a Large Language Model (LLM)-based approach to extract brand information of webpages from text. Our resulting multimodal phishing detection approach, KnowPhish Detector (KPD), can detect phishing webpages with or without logos. We evaluate KnowPhish and KPD on a manually validated dataset, and a field study under Singapore's local context, showing substantial improvements in effectiveness and efficiency compared to state-of-the-art baselines.


Learning the Finer Things: Bayesian Structure Learning at the Instantiation Level

arXiv.org Artificial Intelligence

Successful machine learning methods require a trade-off between memorization and generalization. Too much memorization and the model cannot generalize to unobserved examples. Too much over-generalization and we risk under-fitting the data. While we commonly measure their performance through cross validation and accuracy metrics, how should these algorithms cope in domains that are extremely under-determined where accuracy is always unsatisfactory? We present a novel probabilistic graphical model structure learning approach that can learn, generalize and explain in these elusive domains by operating at the random variable instantiation level. Using Minimum Description Length (MDL) analysis, we propose a new decomposition of the learning problem over all training exemplars, fusing together minimal entropy inferences to construct a final knowledge base. By leveraging Bayesian Knowledge Bases (BKBs), a framework that operates at the instantiation level and inherently subsumes Bayesian Networks (BNs), we develop both a theoretical MDL score and associated structure learning algorithm that demonstrates significant improvements over learned BNs on 40 benchmark datasets. Further, our algorithm incorporates recent off-the-shelf DAG learning techniques enabling tractable results even on large problems. We then demonstrate the utility of our approach in a significantly under-determined domain by learning gene regulatory networks on breast cancer gene mutational data available from The Cancer Genome Atlas (TCGA).


Gaussian Process Optimization with Adaptive Sketching: Scalable and No Regret

arXiv.org Machine Learning

Gaussian processes (GP) are a popular Bayesian approach for the optimization of black-box functions. Despite their effectiveness in simple problems, GP-based algorithms hardly scale to complex high-dimensional functions, as their per-iteration time and space cost is at least quadratic in the number of dimensions $d$ and iterations $t$. Given a set of $A$ alternative to choose from, the overall runtime $O(t^3A)$ quickly becomes prohibitive. In this paper, we introduce BKB (budgeted kernelized bandit), a novel approximate GP algorithm for optimization under bandit feedback that achieves near-optimal regret (and hence near-optimal convergence rate) with near-constant per-iteration complexity and no assumption on the input space or covariance of the GP. Combining a kernelized linear bandit algorithm (GP-UCB) with randomized matrix sketching technique (i.e., leverage score sampling), we prove that selecting inducing points based on their posterior variance gives an accurate low-rank approximation of the GP, preserving variance estimates and confidence intervals. As a consequence, BKB does not suffer from variance starvation, an important problem faced by many previous sparse GP approximations. Moreover, we show that our procedure selects at most $\tilde{O}(d_{eff})$ points, where $d_{eff}$ is the effective dimension of the explored space, which is typically much smaller than both $d$ and $t$. This greatly reduces the dimensionality of the problem, thus leading to a $O(TAd_{eff}^2)$ runtime and $O(A d_{eff})$ space complexity.


Tuning a Bayesian Knowledge Base

AAAI Conferences

For a knowledge-based system that fails to provide the correct answer, it is important to be able to tune the system while minimizing overall change in the knowledge-base. There are a variety of reasons why the answer is incorrect ranging from incorrect knowledge to information vagueness to incompleteness. Still, in all these situations, it is typically the case that most of the knowledge in the system is likely to be correct as specified by the expert(s) and/or knowledge engineer(s). In this paper, we propose a method to identify the possible changes by understanding the contribution of parameters on the outputs of concern. Our approach is based on Bayesian Knowledge Bases for modeling uncertainties. We start with single parameter changes and then extend to multiple parameters. In order to identify the optimal solution that can minimize the change to the model as specified by the domain experts, we define and evaluate the sensitivity values of the results with respect to the parameters. We discuss the computational complexities of determining the solution and show that the problem of multiple parameters changes can be transformed into Linear Programming problems, and thus, efficiently solvable. Our work can also be applied towards validating the knowledge base such that the updated model can satisfy all test-cases collected from the domain experts.


Bayesian Knowledge Fusion

AAAI Conferences

We address the problem of information fusion in uncertain environments. Imagine there are multiple experts building probabilistic models of the same situation and we wish to aggregate the information they provide. There are several problems we may run into by naively merging the information from each. For example, the experts may disagree on the probability of a certain event or they may disagree on the direction of causility between two events (e.g., one thinks A causes B while another thinks B causes A). They may even disagree on the entire structure of dependencies among a set of variables in a probabilistic network. In our proposed solution to this problem, we represent the probabilistic models as Bayesian Knowledge Bases (BKBs) and propose an algorithm called Bayesian knowledge fusion that allows the fusion of multiple BKBs into a single BKB that retains the information from all input sources. This allows for easy aggregation and de-aggregation of information from multiple expert sources and facilitates multi-expert decision making by providing a framework in which all opinions can be preserved and reasoned over.