Goto

Collaborating Authors

 Law


Improving Vietnamese Legal Document Retrieval using Synthetic Data

arXiv.org Artificial Intelligence

In the field of legal information retrieval, effective embedding-based models are essential for accurate question-answering systems. However, the scarcity of large annotated datasets poses a significant challenge, particularly for Vietnamese legal texts. To address this issue, we propose a novel approach that leverages large language models to generate high-quality, diverse synthetic queries for Vietnamese legal passages. This synthetic data is then used to pre-train retrieval models, specifically bi-encoder and ColBERT, which are further fine-tuned using contrastive loss with mined hard negatives. Our experiments demonstrate that these enhancements lead to strong improvement in retrieval accuracy, validating the effectiveness of synthetic data and pre-training techniques in overcoming the limitations posed by the lack of large labeled datasets in the Vietnamese legal domain.


Five Canadian news media outlets sue OpenAI for copyright breach

Al Jazeera

Microsoft is OpenAI's major backer. In a statement, Torstar, Postmedia, The Globe and Mail, The Canadian Press, and CBC/Radio-Canada said OpenAI was scraping large swaths of content to develop its products without getting permission or compensating content owners. "Journalism is in the public interest. OpenAI using other companies' journalism for their own commercial gain is not. A New York federal judge dismissed a lawsuit on November 7 against OpenAI that claimed it misused articles from news outlets Raw Story and AlterNet.


Canadian news organizations sue OpenAI for ChatGPT copyright infringement

Engadget

The joint lawsuit accuses the company of "capitalizing and profiting" from the unauthorized use of their content for ChatGPT. The legal action was filed in the Ontario Superior Court of Justice. The plaintiffs include CBC/Radio-Canada, Postmedia, Metroland, the Toronto Star, the Globe and Mail and The Canadian Press. They're seeking punitive damages from OpenAI, payments for any profits the ChatGPT creator made from using their news articles and a ban on further use of their content. "OpenAI is capitalizing and profiting from the use of this content, without getting permission or compensating content owners."


Generative AI Literacy: Twelve Defining Competencies

arXiv.org Artificial Intelligence

This paper introduces a competency-based model for generative artificial intelligence (AI) literacy covering essential skills and knowledge areas necessary to interact with generative AI. The competencies range from foundational AI literacy to prompt engineering and programming skills, including ethical and legal considerations. These twelve competencies offer a framework for individuals, policymakers, government officials, and educators looking to navigate and take advantage of the potential of generative AI responsibly. Embedding these competencies into educational programs and professional training initiatives can equip individuals to become responsible and informed users and creators of generative AI. The competencies follow a logical progression and serve as a roadmap for individuals seeking to get familiar with generative AI and for researchers and policymakers to develop assessments, educational programs, guidelines, and regulations.


Clinical Document Corpora and Assorted Domain Proxies: A Survey of Diversity in Corpus Design, with Focus on German Text Data

arXiv.org Artificial Intelligence

We survey clinical document corpora, with focus on German textual data. Due to rigid data privacy legislation in Germany these resources, with only few exceptions, are stored in safe clinical data spaces and locked against clinic-external researchers. This situation stands in stark contrast with established workflows in the field of natural language processing where easy accessibility and reuse of data collections are common practice. Hence, alternative corpus designs have been examined to escape from this data poverty. Besides machine translation of English clinical datasets and the generation of synthetic corpora with fictitious clinical contents, several other types of domain proxies have come up as substitutes for authentic clinical documents. Common instances of close proxies are medical journal publications, clinical therapy guidelines, drug labels, etc., more distant proxies include online encyclopedic medical articles or medical contents from social media channels. After PRISM-conformant screening of 359 hits from four bibliographic systems, 75 relevant documents were finally selected for this review and 59 distinct corpora were determined. We identified 24 real clinical corpora (from 40 publications) out of which only 5 are publicly distributable. 2 translations of real corpora and 3 synthetic ones complement the set of clinical corpora. 14 corpora were categorized as close domain proxies, 16 as distant ones. There is a clear divide between the large number of non-accessible authentic clinical German-language corpora and their publicly accessible substitutes: translated or synthetic, close or more distant proxies. So on first sight, the data bottleneck seems broken. Intuitively yet, differences in genre-specific writing style, wording and medical domain expertise in this typological space are also obvious. This raises the question how valid alternative corpus designs really are.


VLSBench: Unveiling Visual Leakage in Multimodal Safety

arXiv.org Artificial Intelligence

Safety concerns of Multimodal large language models (MLLMs) have gradually become an important problem in various applications. Surprisingly, previous works indicate a counter-intuitive phenomenon that using textual unlearning to align MLLMs achieves comparable safety performances with MLLMs trained with image-text pairs. To explain such a counter-intuitive phenomenon, we discover a visual safety information leakage (VSIL) problem in existing multimodal safety benchmarks, i.e., the potentially risky and sensitive content in the image has been revealed in the textual query. In this way, MLLMs can easily refuse these sensitive text-image queries according to textual queries. However, image-text pairs without VSIL are common in real-world scenarios and are overlooked by existing multimodal safety benchmarks. To this end, we construct multimodal visual leakless safety benchmark (VLSBench) preventing visual safety leakage from image to textual query with 2.4k image-text pairs. Experimental results indicate that VLSBench poses a significant challenge to both open-source and close-source MLLMs, including LLaVA, Qwen2-VL, Llama3.2-Vision, and GPT-4o. This study demonstrates that textual alignment is enough for multimodal safety scenarios with VSIL, while multimodal alignment is a more promising solution for multimodal safety scenarios without VSIL. Please see our code and data at: http://hxhcreate.github.io/VLSBench


Handling irresolvable conflicts in the Semantic Web: an RDF-based conflict-tolerant version of the Deontic Traditional Scheme

arXiv.org Artificial Intelligence

This paper presents a new ontology that implements the well-known Deontic Traditional Scheme in RDFs and SPARQL, fit to handle irresolvable conflicts, i.e., situations in which two or more statements prescribe conflicting obligations, prohibitions, or permissions, with none of them being "stronger" than the other one(s). In our view, this paper marks a significant advancement in standard theoretical research in formal Deontic Logic. Most contemporary approaches in this field are confined to the propositional level, mainly focus on the notion of obligation, and lack implementations. The proposed framework is encoded in RDF, which is not only a first-order language but also the most widely used knowledge representation language, as it forms the foundation of the Semantic Web. Moreover, the proposed computational ontology formalizes all deontic modalities defined in the Deontic Traditional Scheme, without specifically focusing on obligations, and offers constructs to model and reason with various types of irresolvable conflicts, violations, and the interaction between deontic modalities and contextual constraints in a given state of affairs. To the best of our knowledge, no existing approach in the literature addresses all these aspects within a unified integrated framework. All examples presented and discussed in this paper, together with Java code and clear instructions to re-execute them locally, are available at https://github.com/liviorobaldo/conflict-tolerantDeonticTraditionalScheme


Privacy-Preserving Orthogonal Aggregation for Guaranteeing Gender Fairness in Federated Recommendation

arXiv.org Artificial Intelligence

Under stringent privacy constraints, whether federated recommendation systems can achieve group fairness remains an inadequately explored question. Taking gender fairness as a representative issue, we identify three phenomena in federated recommendation systems: performance difference, data imbalance, and preference disparity. We discover that the state-of-the-art methods only focus on the first phenomenon. Consequently, their imposition of inappropriate fairness constraints detrimentally affects the model training. Moreover, due to insufficient sensitive attribute protection of existing works, we can infer the gender of all users with 99.90% accuracy even with the addition of maximal noise. In this work, we propose Privacy-Preserving Orthogonal Aggregation (PPOA), which employs the secure aggregation scheme and quantization technique, to prevent the suppression of minority groups by the majority and preserve the distinct preferences for better group fairness. PPOA can assist different groups in obtaining their respective model aggregation results through a designed orthogonal mapping while keeping their attributes private. Experimental results on three real-world datasets demonstrate that PPOA enhances recommendation effectiveness for both females and males by up to 8.25% and 6.36%, respectively, with a maximum overall improvement of 7.30%, and achieves optimal fairness in most cases. Extensive ablation experiments and visualizations indicate that PPOA successfully maintains preferences for different gender groups.


Real-Time Anomaly Detection in Video Streams

arXiv.org Artificial Intelligence

This thesis is part of a CIFRE agreement between the company Othello and the LIASD laboratory. The objective is to develop an artificial intelligence system that can detect real-time dangers in a video stream. To achieve this, a novel approach combining temporal and spatial analysis has been proposed. Several avenues have been explored to improve anomaly detection by integrating object detection, human pose detection, and motion analysis. For result interpretability, techniques commonly used for image analysis, such as activation and saliency maps, have been extended to videos, and an original method has been proposed. The proposed architecture performs binary or multiclass classification depending on whether an alert or the cause needs to be identified. Numerous neural networkmodels have been tested, and three of them have been selected. You Only Looks Once (YOLO) has been used for spatial analysis, a Convolutional Recurrent Neuronal Network (CRNN) composed of VGG19 and a Gated Recurrent Unit (GRU) for temporal analysis, and a multi-layer perceptron for classification. These models handle different types of data and can be combined in parallel or in series. Although the parallel mode is faster, the serial mode is generally more reliable. For training these models, supervised learning was chosen, and two proprietary datasets were created. The first dataset focuses on objects that may play a potential role in anomalies, while the second consists of videos containing anomalies or non-anomalies. This approach allows for the processing of both continuous video streams and finite videos, providing greater flexibility in detection.


Use robots instead of hiring low-paid migrants, says shadow home secretary

The Guardian

Businesses should be using more robots instead of hiring low-paid migrants, the shadow home secretary has said. The Conservative MP Chris Philp says other countries "use a lot more automation" for tasks such as picking fruit and vegetables "rather than simply importing a lot of low-wage migrant labour". Speaking on BBC Breakfast, he called for more investment in technology to reduce the UK's net migration figures. Philp said: "To give an example, in Australia and New Zealand, they are rolling out robotic and automated fruit- and vegetable-picking equipment, in South Korea they use nine times the number of robots in manufacturing processes compared to us, in America they use a lot more modular construction which is much faster and much more efficient. "There's a lot of things British industry can do to grow without needing to import large numbers of low-wage migrants." At an impromptu press conference on Wednesday, Kemi Badenoch, the Conservative leader, said her party had got it wrong on immigration. She promised a review of "every policy, treaty and part of our legal framework" including the role of the European convention on human rights (ECHR) and the Human Rights Act. Get the day's headlines and highlights emailed direct to you every morning She said her party still believed in a "deterrent" to irregular migration but did not commit to restoring the Rwanda scheme scrapped by Labour, even though Philp called for it to be reinstated two weeks ago. He said on Thursday that Labour had "cancelled the Rwanda scheme before it even started". Philp was asked about reports that under the Conservatives, ministers had been examining using a giant wave machine to deter Channel crossings. He told the BBC: "I don't recall ever having seriously looked at that idea.