Goto

Collaborating Authors

 contradiction


Testing the General Deductive Reasoning Capacity of Large Language Models Using OODExamples

Neural Information Processing Systems

Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity. Recent studies have shown that large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. However, they have primarily been tested on proofs using modus ponens or of a specific size, and from the same distribution as the in-context examples. To measure the general deductive reasoning ability of LLMs, we test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations from multiple angles: depth-, width-, and compositional generalization. To facilitate systematic exploration, we construct a new synthetic and programmable reasoning dataset that enables control over deduction rules and proof complexity. Our experiments on four LLMs of various sizes and training objectives show that they are able to generalize to compositional proofs. However, they have difficulty generalizing to longer proofs, and they require explicit demonstrations to produce hypothetical subproofs, specifically in proof by cases and proof by contradiction.


Appendix ARemovable Variables

Neural Information Processing Systems

In this section, we first prove the proposed graphical representation for a removable variable in a MAGM (Theorem 1). Then, we discuss how this representation reduces to Theorem 5 of [11] in the case of DAGs. Throughout our proofs, we say a path between X and Y is blocked by a set Wif it is not m-connecting relative to W. In this case, there exists a non-collider W on the path which is a member of W, or there exists a collider W on the path such that W/2 Anc({X,Y }[ W). In both cases we say W blocks this path with respect to W, or W blocks the path in short when W is clear from the context. We say X is a descendant of Y if Y 2Anc(X), and we denote by DeM(X) the set of descendants of X in the MAGM, and De(X) whenever the graph is clear from the context. A.1 Graphical representation Theorem 1. Vertex X is removable in a MAGM over the variables V, if and only if 1. for any Y 2Adj(X) and Z 2Ch(X)[N(X)\{Y}, Y and Z are adjacent, and 2. for any collider path u =( X,V1,...,V m,Y) and Z 2 V\{X,Y,V1,...,V m} such that {X,V1,...,V m} Pa(Z), Y and Z are adjacent. Let H denote the induced subgraph of M over V\{X}. For any W V\{X,Y,Z}, (Z,X,Y) is an m-connecting path relative to W in M, as X is a non-collider and X/2W. That is, no such W can m-separate Y and Z. Since X is removable in M, by definition of removability, (Y?Z|W)M ()(Y?Z|W)H. Again for any W V\{X,Y,Z}, (Z,X,V1,...,V m,Y) is an m-connecting path relative to W in M since I) every collider on this path is a parent (and therefore an ancestor) of Z, and II) X/2W and X is the only non-collider on this path. That is, no such W can m-separate Y and Z. Since X is removable in M, Equation 8 implies that Y and Z have no m-separating sets in H. Hence, Y is adjacent to Z in H, and therefore, in M. if part: We need to prove that for any Y,Z 2V\{X} and any W V\{X,Y,Z}, (Y?Z|W)M ()(Y?Z|W)H.


Bandit Social Learning under Myopic Behavior

Neural Information Processing Systems

We study social learning dynamics motivated by reviews on online platforms. The agents collectively follow a simple multi-armed bandit protocol, but each agent acts myopically, without regards to exploration. We allow a wide range of myopic behaviors that are consistent with (parameterized) confidence intervals for the arms' expected rewards. We derive stark exploration failures for any such behavior, and provide matching positive results. As a special case, we obtain the first general results on failure of the greedy algorithm in bandits, thus providing a theoretical foundation for why bandit algorithms should explore.1




Faster Directional Convergence of Linear Neural Networks under Spherically Symmetric Data

Neural Information Processing Systems

In this paper, we study gradient methods for training deep linear neural networks with binary cross-entropy loss. In particular, we show global directional convergence guarantees from a polynomial rate to a linear rate for (deep) linear networks with spherically symmetric data distribution, which can be viewed as a specific zero-margin dataset. Our results do not require the assumptions in other works such as small initial loss, presumed convergence of weight direction, or overparameterization. We also characterize our findings in experiments.


Appendix 1 Proof of Lemma 3.1

Neural Information Processing Systems

If hhas the form (1), we have h = g f . Hence, we obtain h h = g2 f (1 f). Since 0 f <1 and f >0, we have h h <0. Now, from the other direction, suppose we know h h <0. To prove that hmust have the form (1), we first show that h(0)h() >0 ( 0), (2) namely, h() always keeps the same sign with h(0).


Supplementary Material

Neural Information Processing Systems

For a vector x 2 Rd and H [d], we denote vH to denote the vector that is equal to v on i 2 H, and zero otherwise. For a real-valued random variable X and m 2 N, we use kXkLm to denote (E|X|m)1/m. For a set S Rd and a function f, we also define the set function notation f(S) as {f(x)|x 2 S}. A.1 Finding a stable subset from a stable weighted subset For a set S on npoints, we define n, as the set of weights w 2 Rn such that wi 2 [0,1/((1)n]for all i 2 [n]and P i wi =1 . For a fixed vector µ 2 Rd that will be clear from context, a set of npoints S = {x1,...,x n}, and weights w 2 n, over S, we use w to denote P i wi(xi µ)(xi µ)>. The goal of this section is to show Proposition A.1, which states that if we have a weight w over S such that w (with respect to some vector µ) has bounded Xk norm proportional to 2 for some > 0, then there must exists some large subset S0 S that is stable with respect to µ and . Let S be a set of n points in Rd. Suppose that there exists a w 2 n, such that k wkXk B 2 for some vector µ. Then there exists a subset S0 S such that (i)|S0| (1 2)n and (ii) S0 is (,,k)-stable with respect to µ and, where = O( p B +1). Observe that k wkXk B 2 implies k w 2IkXk (B +1) 2 by the triangle inequality. In order to show Proposition A.1, we show Lemma A.2, which is a weakening of Proposition A.1 where we additionally assume that µw = P i wixi is close to µ, where µ is the vector we use to define w as well as the vector that we want to find a large sample subset S0 to be stable with respect to. To use Lemma A.2, we additionally show Proposition A.4, which states that k wkXk B 2 is enough to imply that µw is close to µ.