Goto

Collaborating Authors

 Law


Optimizing Blockchain Analysis: Tackling Temporality and Scalability with an Incremental Approach with Metropolis-Hastings Random Walks

arXiv.org Machine Learning

Blockchain technology, with implications in the financial domain, offers data in the form of large-scale transaction networks. Analyzing transaction networks facilitates fraud detection, market analysis, and supports government regulation. Despite many graph representation learning methods for transaction network analysis, we pinpoint two salient limitations that merit more investigation. Existing methods predominantly focus on the snapshots of transaction networks, sidelining the evolving nature of blockchain transaction networks. Existing methodologies may not sufficiently emphasize efficient, incremental learning capabilities, which are essential for addressing the scalability challenges in ever-expanding large-scale transaction networks. To address these challenges, we employed an incremental approach for random walk-based node representation learning in transaction networks. Further, we proposed a Metropolis-Hastings-based random walk mechanism for improved efficiency. The empirical evaluation conducted on blockchain transaction datasets reveals comparable performance in node classification tasks while reducing computational overhead. Potential applications include transaction network monitoring, the efficient classification of blockchain addresses for fraud detection or the identification of specialized address types within the network.


Reviews: Equality of Opportunity in Supervised Learning

Neural Information Processing Systems

It treats an incredibly important and foundational problem (fairness), proposes a creative but simple new definition, gives techniques for achieving the definition, proves theorems with regards to optimality, and even provides empirical results. As learning algorithms are used more and more broadly in situations where their decisions affect people's lives, fairness of these algorithms becomes a critical technical, social, and legal problem. While there is certainly no single "right" definition and paradigm when it comes to fairness, this definition seems to clearly be *a* right definition. It's so clean and simple that in retrospect, it seems obvious--a sign of an excellent idea. One of the many things I love about this definition and this work is how it shifts the structure of power and incentives--once a learner is constrained to be fair, under either of the definitions proposed, she is immediately incentivised to gather more data or make other efforts to do a better job of understanding protected populations.


You Can't Get There From Here: Redefining Information Science to address our sociotechnical futures

arXiv.org Artificial Intelligence

Current definitions of Information Science are inadequate to comprehensively describe the nature of its field of study and for addressing the problems that are arising from intelligent technologies. The ubiquitous rise of artificial intelligence applications and their impact on society demands the field of Information Science acknowledge the socio-technical nature of these technologies. Previous definitions of Information Science over the last six decades have inadequately addressed the environmental, human, and social aspects of these technologies. This perspective piece advocates for an expanded definition of Information Science that fully includes the socio-technical impacts information has on the conduct of research in this field. Proposing an expanded definition of Information Science that includes the socio-technical aspects of this field should stimulate both conversation and widen the interdisciplinary lens necessary to address how intelligent technologies may be incorporated into society and our lives more fairly.


Can Generative AI be Egalitarian?

arXiv.org Artificial Intelligence

The recent explosion of "foundation" generative AI models has been built upon the extensive extraction of value from online sources, often without corresponding reciprocation. This pattern mirrors and intensifies the extractive practices of surveillance capitalism, while the potential for enormous profit has challenged technology organizations' commitments to responsible AI practices, raising significant ethical and societal concerns. However, a promising alternative is emerging: the development of models that rely on content willingly and collaboratively provided by users. This article explores this "egalitarian" approach to generative AI, taking inspiration from the successful model of Wikipedia. We explore the potential implications of this approach for the design, development, and constraints of future foundation models. We argue that such an approach is not only ethically sound but may also lead to models that are more responsive to user needs, more diverse in their training data, and ultimately more aligned with societal values. Furthermore, we explore potential challenges and limitations of this approach, including issues of scalability, quality control, and potential biases inherent in volunteer-contributed content.


Human services organizations and the responsible integration of AI: Considering ethics and contextualizing risk(s)

arXiv.org Artificial Intelligence

This paper examines the responsible integration of artificial intelligence (AI) in human services organizations (HSOs), proposing a nuanced framework for evaluating AI applications across multiple dimensions of risk. The authors argue that ethical concerns about AI deployment -- including professional judgment displacement, environmental impact, model bias, and data laborer exploitation -- vary significantly based on implementation context and specific use cases. They challenge the binary view of AI adoption, demonstrating how different applications present varying levels of risk that can often be effectively managed through careful implementation strategies. The paper highlights promising solutions, such as local large language models, that can facilitate responsible AI integration while addressing common ethical concerns. The authors propose a dimensional risk assessment approach that considers factors like data sensitivity, professional oversight requirements, and potential impact on client wellbeing. They conclude by outlining a path forward that emphasizes empirical evaluation, starting with lower-risk applications and building evidence-based understanding through careful experimentation. This approach enables organizations to maintain high ethical standards while thoughtfully exploring how AI might enhance their capacity to serve clients and communities effectively.


A Comprehensive Mathematical and System-Level Analysis of Autonomous Vehicle Timelines

arXiv.org Artificial Intelligence

Fully autonomous vehicles (AVs) continue to spark immense global interest, yet predictions on when they will operate safely and broadly remain heavily debated. This paper synthesizes two distinct research traditions: computational complexity and algorithmic constraints versus reliability growth modeling and real-world testing to form an integrated, quantitative timeline for future AV deployment. We propose a mathematical framework that unifies NP-hard multi-agent path planning analyses, high-performance computing (HPC) projections, and extensive Crow-AMSAA reliability growth calculations, factoring in operational design domain (ODD) variations, severity, and partial vs. full domain restrictions. Through category-specific case studies (e.g., consumer automotive, robo-taxis, highway trucking, industrial and defense applications), we show how combining HPC limitations, safety demonstration requirements, production/regulatory hurdles, and parallel/serial test strategies can push out the horizon for universal Level 5 deployment by up to several decades. Conversely, more constrained ODDs; like fenced industrial sites or specialized defense operations; may see autonomy reach commercial viability in the near-to-medium term. Our findings illustrate that while targeted domains can achieve automated service sooner, widespread driverless vehicles handling every environment remain far from realized. This paper thus offers a unique and rigorous perspective on why AV timelines extend well beyond short-term optimism, underscoring how each dimension of complexity and reliability imposes its own multi-year delays. By quantifying these constraints and exploring potential accelerators (e.g., advanced AI hardware, infrastructure up-grades), we provide a structured baseline for researchers, policymakers, and industry stakeholders to more accurately map their expectations and investments in AV technology.


Technical Report for the Forgotten-by-Design Project: Targeted Obfuscation for Machine Learning

arXiv.org Artificial Intelligence

The right to privacy, enshrined in various human rights declarations, faces new challenges in the age of artificial intelligence (AI). This paper explores the concept of the Right to be Forgotten (RTBF) within AI systems, contrasting it with traditional data erasure methods. We introduce Forgotten by Design, a proactive approach to privacy preservation that integrates instance-specific obfuscation techniques during the AI model training process. Unlike machine unlearning, which modifies models post-training, our method prevents sensitive data from being embedded in the first place. Using the LIRA membership inference attack, we identify vulnerable data points and propose defenses that combine additive gradient noise and weighting schemes. Our experiments on the CIFAR-10 dataset demonstrate that our techniques reduce privacy risks by at least an order of magnitude while maintaining model accuracy (at 95% significance). Additionally, we present visualization methods for the privacy-utility trade-off, providing a clear framework for balancing privacy risk and model accuracy. This work contributes to the development of privacy-preserving AI systems that align with human cognitive processes of motivated forgetting, offering a robust framework for safeguarding sensitive information and ensuring compliance with privacy regulations.


Data Stewardship Decoded: Mapping Its Diverse Manifestations and Emerging Relevance at a time of AI

arXiv.org Artificial Intelligence

Data stewardship has become a critical component of modern data governance, especially with the growing use of artificial intelligence (AI). Despite its increasing importance, the concept of data stewardship remains ambiguous and varies in its application. This paper explores four distinct manifestations of data stewardship to clarify its emerging position in the data governance landscape. These manifestations include a) data stewardship as a set of competencies and skills, b) a function or role within organizations, c) an intermediary organization facilitating collaborations, and d) a set of guiding principles. The paper subsequently outlines the core competencies required for effective data stewardship, explains the distinction between data stewards and Chief Data Officers (CDOs), and details the intermediary role of stewards in bridging gaps between data holders and external stakeholders. It also explores key principles aligned with the FAIR framework (Findable, Accessible, Interoperable, Reusable) and introduces the emerging principle of AI readiness to ensure data meets the ethical and technical requirements of AI systems. The paper emphasizes the importance of data stewardship in enhancing data collaboration, fostering public value, and managing data reuse responsibly, particularly in the era of AI. It concludes by identifying challenges and opportunities for advancing data stewardship, including the need for standardized definitions, capacity building efforts, and the creation of a professional association for data stewardship.


The "Law" of the Unconscious Contrastive Learner: Probabilistic Alignment of Unpaired Modalities

arXiv.org Machine Learning

While internet-scale data often comes in pairs (e.g., audio/image, image/text), we often want to perform inferences over modalities unseen together in the training data (e.g., audio/text). Empirically, this can often be addressed by learning multiple contrastive embedding spaces between existing modality pairs, implicitly hoping that unseen modality pairs will end up being aligned. This theoretical paper proves that this hope is well founded, under certain assumptions. Starting with the proper Bayesian approach of integrating out intermediate modalities, we show that directly comparing the representations of data from unpaired modalities can recover the same likelihood ratio. Our analysis builds on prior work on the geometry and probabilistic interpretation of contrastive representations, showing how these representations can answer many of the same inferences as probabilistic graphical models. Our analysis suggests two new ways of using contrastive representations: in settings with pre-trained contrastive models, and for handling language ambiguity in reinforcement learning. Our numerical experiments study the importance of our assumptions and demonstrate these new applications. Much of the appeal of contrastive learning is that it gives a "plug-n-play" approach for swapping one modality for another. Because representations from different modalities are trained to be aligned when representing the same object, the hope is that (say) a language representation and image representation of the same scene can be used as substitutes. This property is practically appealing for at least two reasons. First, it allows us to make use of pre-trained models. If you have a model that wants to make use of (say) language input and you have access to a pre-trained image-language contrastive model, you might simply train your model on the pre-trained image representations and hope that it will continue to work when you swap in the language representations.


Differentiable sorting for censored time-to-event data.

Neural Information Processing Systems

Survival analysis is a crucial semi-supervised task in machine learning with significant real-world applications, especially in healthcare. The most common approach to survival analysis, Cox's partial likelihood, can be interpreted as a ranking model optimized on a lower bound of the concordance index. We follow these connections further, with listwise ranking losses that allow for a relaxation of the pairwise independence assumption. Given the inherent transitivity of ranking, we explore differentiable sorting networks as a means to introduce a stronger transitive inductive bias during optimization. We propose a novel method, Diffsurv, to overcome this limitation by extending differentiable sorting methods to handle censored tasks.