Schulz, Claudia
The Right Model for the Job: An Evaluation of Legal Multi-Label Classification Baselines
Forster, Martina, Schulz, Claudia, Nokku, Prudhvi, Mirsafian, Melicaalsadat, Kasundra, Jaykumar, Skylaki, Stavroula
Multi-Label Classification (MLC) is a common task in the legal domain, where more than one label may be assigned to a legal document. A wide range of methods can be applied, ranging from traditional ML approaches to the latest Transformer-based architectures. In this work, we perform an evaluation of different MLC methods using two public legal datasets, POSTURE50K and EURLEX57K. By varying the amount of training data and the number of labels, we explore the comparative advantage offered by different approaches in relation to the dataset properties. Our findings highlight DistilRoBERTa and LegalBERT as performing consistently well in legal MLC with reasonable computational demands. T5 also demonstrates comparable performance while offering advantages as a generative model in the presence of changing label sets. Finally, we show that the CrossEncoder exhibits potential for notable macro-F1 score improvements, albeit with increased computational costs.
A Framework for Monitoring and Retraining Language Models in Real-World Applications
Kasundra, Jaykumar, Schulz, Claudia, Mirsafian, Melicaalsadat, Skylaki, Stavroula
The typical model development lifecycle consists of four phases: 1) problem scoping, 2) data definition and collection, 3) model training and iterative improvement through error analysis, and 4) model deployment in production and implementation of continuous monitoring and retraining [1]. While the first three phases are typically performed in an offline setting, model deployment represents the critical step where the ML model becomes available in a production environment, a live application, where it needs to process live data and ideally sustain performance over time to keep delivering value. Model monitoring refers to the process of evaluating the quality of the production data and the performance of the model according to relevant metrics over time. When either data quality or model performance does not meet predefined criteria, a monitoring warning can be triggered, to alert the model owners. Defining an effective model monitoring and retraining strategy is key to successful ML model deployment since it can safeguard model quality over prolonged periods of time.
Biomedical Concept Relatedness -- A large EHR-based benchmark
Schulz, Claudia, Levy-Kramer, Josh, Van Assel, Camille, Kepes, Miklos, Hammerla, Nils
A promising application of AI to healthcare is the retrieval of information from electronic health records (EHRs), e.g. to aid clinicians in finding relevant information for a consultation or to recruit suitable patients for a study. This requires search capabilities far beyond simple string matching, including the retrieval of concepts (diagnoses, symptoms, medications, etc.) related to the one in question. The suitability of AI methods for such applications is tested by predicting the relatedness of concepts with known relatedness scores. However, all existing biomedical concept relatedness datasets are notoriously small and consist of hand-picked concept pairs. We open-source a novel concept relatedness benchmark overcoming these issues: it is six times larger than existing datasets and concept pairs are chosen based on co-occurrence in EHRs, ensuring their relevance for the application of interest. We present an in-depth analysis of our new dataset and compare it to existing ones, highlighting that it is not only larger but also complements existing datasets in terms of the types of concepts included. Initial experiments with state-of-the-art embedding methods show that our dataset is a challenging new benchmark for testing concept relatedness models.
Analysis of Automatic Annotation Suggestions for Hard Discourse-Level Tasks in Expert Domains
Schulz, Claudia, Meyer, Christian M., Kiesewetter, Jan, Sailer, Michael, Bauer, Elisabeth, Fischer, Martin R., Fischer, Frank, Gurevych, Iryna
Many complex discourse-level tasks can aid domain experts in their work but require costly expert annotations for data creation. To speed up and ease annotations, we investigate the viability of automatically generated annotation suggestions for such tasks. As an example, we choose a task that is particularly hard for both humans and machines: the segmentation and classification of epistemic activities in diagnostic reasoning texts. We create and publish a new dataset covering two domains and carefully analyse the suggested annotations. We find that suggestions have positive effects on annotation speed and performance, while not introducing noteworthy biases. Envisioning suggestion models that improve with newly annotated texts, we contrast methods for continuous model adjustment and suggest the most effective setup for suggestions in future expert tasks.
Challenges in the Automatic Analysis of Students' Diagnostic Reasoning
Schulz, Claudia, Meyer, Christian M., Sailer, Michael, Kiesewetter, Jan, Bauer, Elisabeth, Fischer, Frank, Fischer, Martin R., Gurevych, Iryna
Diagnostic reasoning is a key component of many professions. To improve students' diagnostic reasoning skills, educational psychologists analyse and give feedback on epistemic activities used by these students while diagnosing, in particular, hypothesis generation, evidence generation, evidence evaluation, and drawing conclusions. However, this manual analysis is highly time-consuming. We aim to enable the large-scale adoption of diagnostic reasoning analysis and feedback by automating the epistemic activity identification. We create the first corpus for this task, comprising diagnostic reasoning self-explanations of students from two domains annotated with epistemic activities. Based on insights from the corpus creation and the task's characteristics, we discuss three challenges for the automatic identification of epistemic activities using AI methods: the correct identification of epistemic activity spans, the reliable distinction of similar epistemic activities, and the detection of overlapping epistemic activities. We propose a separate performance metric for each challenge and thus provide an evaluation framework for future research. Indeed, our evaluation of various state-of-the-art recurrent neural network architectures reveals that current techniques fail to address some of these challenges.
Answering the "why" in Answer Set Programming - A Survey of Explanation Approaches
Fandinno, Jorge, Schulz, Claudia
Artificial Intelligence (AI) approaches to problem-solving and decision-making are becoming more and more complex, leading to a decrease in the understandability of solutions. The European Union's new General Data Protection Regulation tries to tackle this problem by stipulating a "right to explanation" for decisions made by AI systems. One of the AI paradigms that may be affected by this new regulation is Answer Set Programming (ASP). Thanks to the emergence of efficient solvers, ASP has recently been used for problem-solving in a variety of domains, including medicine, cryptography, and biology. To ensure the successful application of ASP as a problem-solving paradigm in the future, explanations of ASP solutions are crucial. In this survey, we give an overview of approaches that provide an answer to the question of why an answer set is a solution to a given problem, notably off-line justifications, causal graphs, argumentative explanations and why-not provenance, and highlight their similarities and differences. Moreover, we review methods explaining why a set of literals is not an answer set or why no solution exists at all.
UKP-Athene: Multi-Sentence Textual Entailment for Claim Verification
Hanselowski, Andreas, Zhang, Hao, Li, Zile, Sorokin, Daniil, Schiller, Benjamin, Schulz, Claudia, Gurevych, Iryna
The Fact Extraction and VERification (FEVER) shared task was launched to support the development of systems able to verify claims by extracting supporting or refuting facts from raw text. The shared task organizers provide a large-scale dataset for the consecutive steps involved in claim verification, in particular, document retrieval, fact extraction, and claim classification. In this paper, we present our claim verification pipeline approach, which, according to the preliminary results, scored third in the shared task, out of 23 competing systems. For the document retrieval, we implemented a new entity linking approach. In order to be able to rank candidate facts and classify a claim on the basis of several selected facts, we introduce two extensions to the Enhanced LSTM (ESIM).
On the Equivalence between Assumption-Based Argumentation and Logic Programming
Caminada, Martin, Schulz, Claudia
Assumption-Based Argumentation (ABA) has been shown to subsume various other non-monotonic reasoning formalisms, among them normal logic programming (LP). We re-examine the relationship between ABA and LP and show that normal LP also subsumes (flat) ABA. More precisely, we specify a procedure that given a (flat) ABA framework yields an associated logic program with almost the same syntax whose semantics coincide with those of the ABA framework. That is, the 3-valued stable (respectively well-founded, regular, 2-valued stable, and ideal) models of the associated logic program coincide with the complete (respectively grounded, preferred, stable, and ideal) assumption labellings and extensions of the ABA framework. Moreover, we show how our results on the translation from ABA to LP can be reapplied for a reverse translation from LP to ABA, and observe that some of the existing results in the literature are in fact special cases of our work. Overall, we show that (flat) ABA frameworks can be seen as normal logic programs with a slightly different syntax. This implies that methods developed for one of these formalisms can be equivalently applied to the other by simply modifying the syntax.
Blue Sky Ideas in Artificial Intelligence Education from the EAAI 2017 New and Future AI Educator Program
Eaton, Eric, Koenig, Sven, Schulz, Claudia, Maurelli, Francesco, Lee, John, Eckroth, Joshua, Crowley, Mark, Freedman, Richard G., Cardona-Rivera, Rogelio E., Machado, Tiago, Williams, Tom
The 7th Symposium on Educational Advances in Artificial Intelligence (EAAI'17, co-chaired by Sven Koenig and Eric Eaton) launched the EAAI New and Future AI Educator Program to support the training of early-career university faculty, secondary school faculty, and future educators (PhD candidates or postdocs who intend a career in academia). As part of the program, awardees were asked to address one of the following "blue sky" questions: * How could/should Artificial Intelligence (AI) courses incorporate ethics into the curriculum? * How could we teach AI topics at an early undergraduate or a secondary school level? * AI has the potential for broad impact to numerous disciplines. How could we make AI education more interdisciplinary, specifically to benefit non-engineering fields? This paper is a collection of their responses, intended to help motivate discussion around these issues in AI education.
Explaining Answer Set Programming in Argumentative Terms
Schulz, Claudia (Imperial College London)
Argumentation Theory and Answer Set Programming (ASP) are two prominent theories in the field of knowledge representation and non-monotonic reasoning,where Argumentation Theory stands for a variety of approaches following similar ideas.The main difference between Argumentation Theory and ASP is that the former focusses on representing knowledge and reasoning about it in a way that resembles human reasoning, neglecting the efficiency of the reasoning procedure,whereas the latter is concerned with the efficient computation of solutions to a reasoning problem, resulting in a less human-understandable process. In recent years, ASP has been frequently applied for the computation of reasoning problems represented in argumentation-theoretical-terms and has been found an efficient method for determining solutions to problems in Argumentation Theory. My research is concerned with the opposite direction, i.e. with applying Argumentation Theory to ASP in order to explain the solutions to an ASP reasoning problem in a more human-understandable way.Developing such an explanation method also involves to investigate both the exact relationship between different approaches in Argumentation Theory in order to find the most suitable one for explanations and their connection with ASP, in particular with respect to their semantics.