Goto

Collaborating Authors

 Logic & Formal Reasoning


LLMSTEP: LLM proofstep suggestions in Lean

arXiv.org Artificial Intelligence

We present LLMSTEP, a tool for integrating a language model into the Lean proof assistant. LLMSTEP is a Lean 4 tactic that sends a user's proof state to a server hosting a language model. The language model generates suggestions, which are checked in Lean and displayed to a user in their development environment. We provide a baseline language model, along with code for fine-tuning and evaluation to support further development. We provide server implementations that run on CPU, a CUDA GPU, or a Google Colab notebook, as a step towards fast, effective language model suggestions for any user.


Fine-Tuning Language Models Using Formal Methods Feedback

arXiv.org Artificial Intelligence

Although pre-trained language models encode generic knowledge beneficial for planning and control, they may fail to generate appropriate control policies for domain-specific tasks. Existing fine-tuning methods use human feedback to address this limitation, however, sourcing human feedback is labor intensive and costly. We present a fully automated approach to fine-tune pre-trained language models for applications in autonomous systems, bridging the gap between generic knowledge and domain-specific requirements while reducing cost. The method synthesizes automaton-based controllers from pre-trained models guided by natural language task descriptions. These controllers are verifiable against independently provided specifications within a world model, which can be abstract or obtained from a high-fidelity simulator. Controllers with high compliance with the desired specifications receive higher ranks, guiding the iterative fine-tuning process. We provide quantitative evidences, primarily in autonomous driving, to demonstrate the method's effectiveness across multiple tasks. The results indicate an improvement in percentage of specifications satisfied by the controller from 60% to 90%.


The Validity of Evaluation Results: Assessing Concurrence Across Compositionality Benchmarks

arXiv.org Artificial Intelligence

NLP models have progressed drastically in recent years, according to numerous datasets proposed to evaluate performance. Questions remain, however, about how particular dataset design choices may impact the conclusions we draw about model capabilities. In this work, we investigate this question in the domain of compositional generalization. We examine the performance of six modeling approaches across 4 datasets, split according to 8 compositional splitting strategies, ranking models by 18 compositional generalization splits in total. Our results show that: i) the datasets, although all designed to evaluate compositional generalization, rank modeling approaches differently; ii) datasets generated by humans align better with each other than they with synthetic datasets, or than synthetic datasets among themselves; iii) generally, whether datasets are sampled from the same source is more predictive of the resulting model ranking than whether they maintain the same interpretation of compositionality; and iv) which lexical items are used in the data can strongly impact conclusions. Overall, our results demonstrate that much work remains to be done when it comes to assessing whether popular evaluation datasets measure what they intend to measure, and suggest that elucidating more rigorous standards for establishing the validity of evaluation sets could benefit the field.


Synthesizing Efficiently Monitorable Formulas in Metric Temporal Logic

arXiv.org Artificial Intelligence

In runtime verification, manually formalizing a specification for monitoring system executions is a tedious and error-prone process. To address this issue, we consider the problem of automatically synthesizing formal specifications from system executions. To demonstrate our approach, we consider the popular specification language Metric Temporal Logic (MTL), which is particularly tailored towards specifying temporal properties for cyber-physical systems (CPS). Most of the classical approaches for synthesizing temporal logic formulas aim at minimizing the size of the formula. However, for efficiency in monitoring, along with the size, the amount of "lookahead" required for the specification becomes relevant, especially for safety-critical applications. We formalize this notion and devise a learning algorithm that synthesizes concise formulas having bounded lookahead. To do so, our algorithm reduces the synthesis task to a series of satisfiability problems in Linear Real Arithmetic (LRA) and generates MTL formulas from their satisfying assignments. The reduction uses a novel encoding of a popular MTL monitoring procedure using LRA. Finally, we implement our algorithm in a tool called TEAL and demonstrate its ability to synthesize efficiently monitorable MTL formulas in a CPS application.


A Survey of Requirements for COVID-19 Mitigation Strategies. Part II: Elicitation of Requirements

arXiv.org Artificial Intelligence

COVID-19 has influenced virtually all aspects of our lives. Across the world, countries applied wildly varying mitigation strategies for the epidemic, ranging from minimal intrusion in the hope of obtaining "herd immunity", to imposing severe lockdowns on the other extreme. It seems clear at the first glance what all those measures are trying to achieve, and what the criteria of success are. But is it really that clear? Quoting an oft-repeated phrase, with COVID-19 we fight an unprecedented threat to health and economic stability [Soltani et al., 2020]. While fighting it, we must protect privacy, equality and fairness [Morley et al., 2020] and do a coordinated assessment of usefulness, effectiveness, technological readiness, cyber security risks and threats to fundamental freedoms and human rights [Stollmeyer et al., 2020]. Taken together, this is hardly a straightforward set of goals and requirements. Thus, paraphrasing [Stollmeyer et al., 2020], one may ask: What problem does a COVID mitigation strategy solve exactly? 1


MLFMF: Data Sets for Machine Learning for Mathematical Formalization

arXiv.org Artificial Intelligence

We introduce MLFMF, a collection of data sets for benchmarking recommendation systems used to support formalization of mathematics with proof assistants. These systems help humans identify which previous entries (theorems, constructions, datatypes, and postulates) are relevant in proving a new theorem or carrying out a new construction. Each data set is derived from a library of formalized mathematics written in proof assistants Agda or Lean. The collection includes the largest Lean~4 library Mathlib, and some of the largest Agda libraries: the standard library, the library of univalent mathematics Agda-unimath, and the TypeTopology library. Each data set represents the corresponding library in two ways: as a heterogeneous network, and as a list of s-expressions representing the syntax trees of all the entries in the library. The network contains the (modular) structure of the library and the references between entries, while the s-expressions give complete and easily parsed information about every entry. We report baseline results using standard graph and word embeddings, tree ensembles, and instance-based learning algorithms. The MLFMF data sets provide solid benchmarking support for further investigation of the numerous machine learning approaches to formalized mathematics. The methodology used to extract the networks and the s-expressions readily applies to other libraries, and is applicable to other proof assistants. With more than $250\,000$ entries in total, this is currently the largest collection of formalized mathematical knowledge in machine learnable format.


Model of models -- Part 1

arXiv.org Artificial Intelligence

This paper proposes a new cognitive model, acting as the main component of an AGI agent. The model is introduced in its mature intelligence state, and as an extension of previous models, DENN, and especially AKREM, by including operational models (frames/classes) and will. This model's core assumption is that cognition is about operating on accumulated knowledge, with the guidance of an appropriate will. Also, we assume that the actions, part of knowledge, are learning to be aligned with will, during the evolution phase that precedes the mature intelligence state. In addition, this model is mainly based on the duality principle in every known intelligent aspect, such as exhibiting both top-down and bottom-up model learning, generalization verse specialization, and more. Furthermore, a holistic approach is advocated for AGI designing, and cognition under constraints or efficiency is proposed, in the form of reusability and simplicity. Finally, reaching this mature state is described via a cognitive evolution from infancy to adulthood, utilizing a consolidation principle. The final product of this cognitive model is a dynamic operational memory of models and instances. Lastly, some examples and preliminary ideas for the evolution phase to reach the mature state are presented.


Embedding Ontologies in the Description Logic ALC by Axis-Aligned Cones

Journal of Artificial Intelligence Research

This paper is concerned with knowledge graph embedding with background knowledge, taking the formal perspective of logics. In knowledge graph embedding, knowledge— expressed as a set of triples of the form (a R b) (“a is R-related to b”)—is embedded into a real-valued vector space. The embedding helps exploiting geometrical regularities of the space in order to tackle typical inductive tasks of machine learning such as link prediction. Recent embedding approaches also consider incorporating background knowledge, in which the intended meanings of the symbols a, R, b are further constrained via axioms of a theory. Of particular interest are theories expressed in a formal language with a neat semantics and a good balance between expressivity and feasibility. In that case, the knowledge graph together with the background can be considered to be an ontology. This paper develops a cone-based theory for embedding in order to advance the expressivity of the ontology: it works (at least) with ontologies expressed in the description logic ALC, which comprises restricted existential and universal quantifiers, as well as concept negation and concept disjunction. In order to align the classical Tarskian Style semantics for ALC with the sub-symbolic representation of triples, we use the notion of a geometric model of an ALC ontology and show, as one of our main results, that an ALC ontology is satisfiable in the classical sense iff it is satisfiable by a geometric model based on cones. The geometric model, if treated as a partial model, can even be chosen to be faithful, i.e., to reflect all and only the knowledge captured by the ontology. We introduce the class of axis-aligned cones and show that modulo simple geometric operations any distributive logic (such as ALC) interpreted over cones employs this class of cones. Cones are also attractive from a machine learning perspective on knowledge graph embeddings since they give rise to applying conic optimization techniques.


A Skin Microbiome Model with AMP interactions and Analysis of Quasi-Stability vs Stability in Population Dynamics

arXiv.org Artificial Intelligence

The skin microbiome plays an important role in the maintenance of a healthy skin. It is an ecosystem, composed of several species, competing for resources and interacting with the skin cells. Imbalance in the cutaneous microbiome, also called dysbiosis, has been correlated with several skin conditions, including acne and atopic dermatitis. Generally, dysbiosis is linked to colonization of the skin by a population of opportunistic pathogenic bacteria. Treatments consisting in non-specific elimination of cutaneous microflora have shown conflicting results. In this article, we introduce a mathematical model based on ordinary differential equations, with 2 types of bacteria populations (skin commensals and opportunistic pathogens) and including the production of antimicrobial peptides to study the mechanisms driving the dominance of one population over the other. By using published experimental data, assumed to correspond to the observation of stable states in our model, we reduce the number of parameters of the model from 13 to 5. We then use a formal specification in quantitative temporal logic to calibrate our model by global parameter optimization and perform sensitivity analyses. On the time scale of 2 days of the experiments, the model predicts that certain changes of the environment, like the elevation of skin surface pH, create favorable conditions for the emergence and colonization of the skin by the opportunistic pathogen population, while the production of human AMPs has non-linear effect on the balance between pathogens and commensals. Surprisingly, simulations on longer time scales reveal that the equilibrium reached around 2 days can in fact be a quasi-stable state followed by the reaching of a reversed stable state after 12 days or more. We analyse the conditions of quasi-stability observed in this model using tropical algebraic methods, and show their non-generic character in contrast to slow-fast systems. These conditions are then generalized to a large class of population dynamics models over any number of species.


LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers

arXiv.org Artificial Intelligence

Logical reasoning, i.e., deductively inferring the truth value of a conclusion from a set of premises, is an important task for artificial intelligence with wide potential impacts on science, mathematics, and society. While many prompting-based strategies have been proposed to enable Large Language Models (LLMs) to do such reasoning more effectively, they still appear unsatisfactory, often failing in subtle and unpredictable ways. In this work, we investigate the validity of instead reformulating such tasks as modular neurosymbolic programming, which we call LINC: Logical Inference via Neurosymbolic Computation. In LINC, the LLM acts as a semantic parser, translating premises and conclusions from natural language to expressions in first-order logic. These expressions are then offloaded to an external theorem prover, which symbolically performs deductive inference. Leveraging this approach, we observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate. On ProofWriter, augmenting the comparatively small open-source StarCoder+ (15.5B parameters) with LINC even outperforms GPT-3.5 and GPT-4 with Chain-of-Thought (CoT) prompting by an absolute 38% and 10%, respectively. When used with GPT-4, LINC scores 26% higher than CoT on ProofWriter while performing comparatively on FOLIO. Further analysis reveals that although both methods on average succeed roughly equally often on this dataset, they exhibit distinct and complementary failure modes. We thus provide promising evidence for how logical reasoning over natural language can be tackled through jointly leveraging LLMs alongside symbolic provers. All corresponding code is publicly available at https://github.com/benlipkin/linc