Data Mining
Hundreds of Chrome extensions create a web-scraping botnet
Browser extensions can be just as dangerous as regular apps, and their integration with the tool everyone's constantly using can make them seem erroneously innocuous. Case in point: a collection of more than 200 extensions for Chrome and other major browsers are being used to "scrape" website content. This essentially turns browser users into a free data center, with capacity sold off for profit. The Secure Annex report (spotted by Ars Technica) is an interesting one, documenting the MellowTel system. Here's how it works: Step one, a developer of a legitimate extension is offered a tool that integrates a software library into the extension.
1,000-year-old medieval sword emerges from Dutch river after chance discovery: 'Barely corroded'
SOLVA Archaeology Service in Belgium announced the recent discovery of ancient Roman artifacts and remains, including a well-preserved dog, in Velzeke. A remarkable medieval sword with rare symbols was recently put on display in a Dutch museum, over a year after it was found by construction workers unexpectedly. The discovery of the sword was announced by the Netherlands' National Museum of Antiquities (RMO) in Leiden on June 24. The artifact, named the Linschoten Sword, was found in March 2024 during "maintenance dredging activities," the museum said in a press release. Construction workers were struck by a "long piece of iron" while cleaning a small river known as the Korte Linschoten, the statement noted.
4 ways your organization can adapt and thrive in the age of AI
The evidence suggests almost all business leaders are piloting or investing in AI initiatives, and biopharmaceutical giant Boehringer Ingelheim is committed to investing in emerging technology that could have life-altering consequences. The company's 55,000 employees focus on developing innovative therapies that can improve lives in areas of high unmet medical need, with AI and data playing an increasingly crucial role in their work. Global CIO Markus Schümmelfeder told ZDNET that emerging technology can open all kinds of possibilities when its adoption is accompanied by organizational change: "AI together with big data availability and access to the right capability is the real game-changer." So, how can business leaders drive successful organizational change in an age of AI? Schümmelfeder and his colleague Oliver Sluke, head of IT research, development, and medicine at Boehringer, told ZDNET their four best-practice tips for AI-enabled business transformation. Most digital leaders agree: before you start tinkering with technology, you must ensure your data is managed, sorted, and accessible.
Community Detection on Evolving Graphs
Clustering is a fundamental step in many information-retrieval and data-mining applications. Detecting clusters in graphs is also a key tool for finding the community structure in social and behavioral networks. In many of these applications, the input graph evolves over time in a continual and decentralized manner, and, to maintain a good clustering, the clustering algorithm needs to repeatedly probe the graph. Furthermore, there are often limitations on the frequency of such probes, either imposed explicitly by the online platform (e.g., in the case of crawling proprietary social networks like twitter) or implicitly because of resource limitations (e.g., in the case of crawling the web). In this paper, we study a model of clustering on evolving graphs that captures this aspect of the problem.
FairJob: A Real-World Dataset for Fairness in Online Systems
We introduce a fairness-aware dataset for job recommendation in advertising, designed to foster research in algorithmic fairness within real-world scenarios. It was collected and prepared to comply with privacy standards and business confidentiality. An additional challenge is the lack of access to protected user attributes such as gender, for which we propose a solution to obtain a proxy estimate. Despite being anonymized and including a proxy for a sensitive attribute, our dataset preserves predictive power and maintains a realistic and challenging benchmark. This dataset addresses a significant gap in the availability of fairnessfocused resources for high-impact domains like advertising - the actual impact being having access or not to precious employment opportunities, where balancing fairness and utility is a common industrial challenge. We also explore various stages in the advertising process where unfairness can occur and introduce a method to compute a fair utility metric for the job recommendations in online systems case from a biased dataset. Experimental evaluations of bias mitigation techniques on the released dataset demonstrate potential improvements in fairness and the associated trade-offs with utility.
Dueling Bandits: Beyond Condorcet Winners to General Tournament Solutions
Recent work on deriving O(\log T) anytime regret bounds for stochastic dueling bandit problems has considered mostly Condorcet winners, which do not always exist, and more recently, winners defined by the Copeland set, which do always exist. In this work, we consider a broad notion of winners defined by tournament solutions in social choice theory, which include the Copeland set as a special case but also include several other notions of winners such as the top cycle, uncovered set, and Banks set, and which, like the Copeland set, always exist. We develop a family of UCB-style dueling bandit algorithms for such general tournament solutions, and show O(\log T) anytime regret bounds for them. Experiments confirm the ability of our algorithms to achieve low regret relative to the target winning set of interest.
NE: Surrogate-Assisted Federated Neighbor Embedding for Dimensionality Reduction
Federated learning (FL) has rapidly evolved as a promising paradigm that enables collaborative model training across distributed participants without exchanging their local data. Despite its broad applications in fields such as computer vision, graph learning, and natural language processing, the development of a data projection model that can be effectively used to visualize data in the context of FL is crucial yet remains heavily under-explored. Neighbor embedding (NE) is an essential technique for visualizing complex high-dimensional data, but collaboratively learning a joint NE model is difficult. The key challenge lies in the objective function, as effective visualization algorithms like NE require computing loss functions among pairs of data.
Continuous Temporal Domain Generalization
Temporal Domain Generalization (TDG) addresses the challenge of training predictive models under temporally varying data distributions. Traditional TDG approaches typically focus on domain data collected at fixed, discrete time intervals, which limits their capability to capture the inherent dynamics within continuous-evolving and irregularly-observed temporal domains. To overcome this, this work formalizes the concept of Continuous Temporal Domain Generalization (CTDG), where domain data are derived from continuous times and are collected at arbitrary times. CTDG tackles critical challenges including: 1) Characterizing the continuous dynamics of both data and models, 2) Learning complex high-dimensional nonlinear dynamics, and 3) Optimizing and controlling the generalization across continuous temporal domains. To address them, we propose a Koopman operator-driven continuous temporal domain generalization (Koodos) framework. We formulate the problem within a continuous dynamic system and leverage the Koopman theory to learn the underlying dynamics; the framework is further enhanced with a comprehensive optimization strategy equipped with analysis and control driven by prior knowledge of the dynamics patterns. Extensive experiments demonstrate the effectiveness and efficiency of our approach.
Optimal Cluster Recovery in the Labeled Stochastic Block Model
We consider the problem of community detection or clustering in the labeled Stochastic Block Model (LSBM) with a finite number K of clusters of sizes linearly growing with the global population of items n . Every pair of items is labeled independently at random, and label \ell appears with probability p(i,j,\ell) between two items in clusters indexed by i and j, respectively. The objective is to reconstruct the clusters from the observation of these random labels. Clustering under the SBM and their extensions has attracted much attention recently. Most existing work aimed at characterizing the set of parameters such that it is possible to infer clusters either positively correlated with the true clusters, or with a vanishing proportion of misclassified items, or exactly matching the true clusters.