conclusion
VLForgery Face Triad: Detection, Localization and Attribution via Multimodal Large Language Models
Faces synthesized by diffusion models (DMs) with high-quality and controllable attributes pose a significant challenge for Deepfake detection. Most state-of-the-art detectors only yield a binary decision, incapable of forgery localization, attribution of forgery methods, and providing analysis on the cause of forgeries. In this work, we integrate Multimodal Large Language Models (MLLMs) within DMbased face forensics, and propose a fine-grained analysis triad framework called VLForgery, that can 1) predict falsified facial images; 2) locate the falsified face regions subjected to partial synthesis; and 3) attribute the synthesis with specific generators. To achieve the above goals, we introduce VLF (Visual Language Forensics), a novel and diverse synthesis face dataset designed to facilitate rich interactions between'Visual' and'Language' modalities in MLLMs. Additionally, we propose an extrinsic knowledge-guided description method, termed EkCot, which leverages knowledge from the image generation pipeline to enable MLLMs to quickly capture image content. Furthermore, we introduce a low-level vision comparison pipeline designed to identify differential features between real and fake that MLLMs can inherently understand. These features are then incorporated into EkCot, enhancing its ability to analyze forgeries in a structured manner, following the sequence of detection, localization, and attribution. Extensive experiments demonstrate that VLForgery outperforms other state-of-the-art forensic approaches in detection accuracy, with additional potential for falsified region localization and attribution analysis.
62d8cb520f9ba0674daf95491ea60f81-Paper-Conference.pdf
Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of reasoning: efficiency. In real-world reasoning scenarios, much of the available information is irrelevant, and effective deductive inference requires identifying and ignoring such distractions. We propose a framework for assessing LM reasoning efficiency through the lens of logic programming, introducing a simple method to align proofs written in natural language--as generated by an LM--with shortest proofs found by executing the logic program. Efficiency is quantified by measuring how well a model avoids unnecessary inference. Empirically, we construct a dataset of math word problems injected with various number of irrelevant axioms that vary in semantic overlap with the goal theorem. We find that current LMs show marked accuracy declines under such conditions--even with minimal, domainconsistent distractions--and the proofs they generate frequently exhibit detours through irrelevant inferences.2
MuSLR: Multimodal Symbolic Logical Reasoning
Multimodal symbolic logical reasoning, which aims to deduce new facts from multimodal input via formal logic, is critical in high-stakes applications such as autonomous driving and medical diagnosis, as its rigorous, deterministic reasoning helps prevent serious consequences. To evaluate such capabilities of current state-of-the-art vision language models (VLMs), we introduce MuSLR, the first multimodal symbolic logical reasoning grounded in formal logical rules. We curate a benchmark dataset for MuSLR comprising 1,093 instances across 7 domains, including 35 atomic symbolic logic and 976 logical combinations, with reasoning depths ranging from 2 to 9. We evaluate 7 state-of-the-art VLMs on our benchmark and find that they all struggle with multimodal symbolic reasoning, with the best model, GPT-4.1, achieving only 46.8%. Thus, we propose LogiCAM, a modular framework that applies formal logical rules to multimodal inputs, boosting GPT-4.1's
Resolution of Simpson's paradox via the common cause principle
Simpson's paradox poses a challenge in probabilistic inference and decisionmaking. Our study revisits the paradox by re-estimating its frequency with an unbiased data generation process and reaffirms that it is not an artifact of deficient data collection. Thus, it can lead to incorrect recommendations in fields as diverse as statistics, psychology, and artificial intelligence. We show that the paradox can be resolved by assuming a minimal -- though not necessarily observed -- common cause (or screening) variable for the involved random variables. In our approach, conditioning on this minimal common cause establishes the correct association between events, which coincides with the conditioning (i.e., fine-grained) option of the original Simpson paradox. This resolution applies to both discrete cases of binary variables and continuous settings modeled by Gaussian variables. For a non-minimal common cause, the resolution of the paradox is possible, but detailed knowledge of the common cause is required. Our findings extend traditional understandings of the paradox and offer practical guidance for resolving apparent contradictions in probabilistic inference, ultimately enhancing decision-making processes. This point is illustrated by several examples.
Interpreting Representation Quality of DNNs for 3DPoint Cloud Processing: Supplementary Materials Wen Shenb Qihan Rena Dongrui Liua Quanshi Zhanga aShanghai Jiao Tong UniversitybTongji University
This section provides more details about Shapley values in Section 3 of the paper. Linearity: If two independent games vand wcan be merged into one game u(S) = v(S)+w(S), then the Shapley value of the player i in game v and game w also can be merged, i.e. ฯu(i) = ฯv(i)+ฯw(i). Nullity: A dummy player isatisfies S N\{i},v(S {i}) = v(S)+v({i}), which indicates that the player ihas no interaction with other players, i.e. ฯ(i) = v({i}). Efficiency: The overall reward can be allocated to all players in the game, i.e. This section provides more details about multi-order interactions [8] in Section 3.3 of the paper.
Appendix
We first introduce some handy concepts and results to make the proof succinct, meanwhile providing more information for understanding our model and theory. We begin with some extended discussions on CSG. Note that a reparameterization unnecessarily has its output dimensions in S, i.e. The condition that p(y|s) = p0(y|ฮฆS(s,v)) for any v V does not indicate that ฮฆS(s,v) is constant of v, since p0(y|s0) may ignore the change of s0 = ฮฆS(s,v) from the change of v. The following lemma shows the meaning of a reparameterization: it allows a CSG to vary while inducing the same distribution on the observed data variables (x,y) (i.e., holding the same effect on describing data). We can now define and verify an equivalent relation on CSGs so that the resulting equivalent class contains CSGs that induce the same (x,y) data distribution and hold the same semantic information in their svariables. We say two CSGs pand p0 are semantic-equivalent, if there exists a homeomorphism11 ฮฆ on S V, such that (i) is semantic-preserving: its output dimensions in S is constant of v, ฮฆS(s,v) = ฮฆS(s) for any v V, and (ii) it acts as a reparameterization from p to p0: ฮฆ#[ps,v] = p0s,v, p(x|s,v) = p0(x|ฮฆ(s,v)) and p(y|s) = p0(y|ฮฆS(s)). A.1 below shows that the defined binary relation is indeed an equivalence relation in common cases. As a reparameterization, ฮฆ allows the two models to have different latent-variable parameterizations while inducing the same distribution on the observed data variables (x,y) (Lemma 9). This definition of semantic-equivalence can be rephrased as the existence of a semantic-preserving reparameterization. With proper model assumptions, we can show that any reparameterization between two CSGs is semantic-preserving, so that semantic-preserving CSGs cannot be converted to each other by a reparameterization that mixes swith v. Lemma 11. For two CSGs pand p0, if p0(y|s) has a statistics M0(s) that is an injective function of s, then any reparameterization ฮฆ from pto p0, if exists, has its ฮฆS constant of v. Proof. Then the condition that p(y|s) = p0(y|ฮฆS(s,v)) for any v V indicates that M(s) = M0(ฮฆS(s,v)). If there exist s S and v(1) 6= v(2) V such that ฮฆS(s,v(1)) 6= ฮฆS(s,v(2)), then M0(ฮฆS(s,v(1))) 6= M0(ฮฆS(s,v(2))) 11A transformation is a homeomorphism if it is a continuous bijection with continuous inverse. This violates M(s) = M0(ฮฆS(s,v)) which requires both M0(ฮฆS(s,v(1))) and M0(ฮฆS(s,v(2))) to be equal to M(s). We then introduce two mathematical facts. Let z be a random variable on a Euclidean space RdZ with density function pz(z), and let ฮฆ be a homeomorphism on RdZ whose inverse ฮฆ 1 is differentiable.
35th Conference on Neural Information Processing Systems 2021 . Corresponding author https
We demonstrate our framework's utility by proving and methods that are guaranteed to be defended against deception, given bounded sistent conclusions about performance. Our framework enables us to prove EHPO put forth a logical framework to capture its semantics and how it can lead to inconrigorous. We call this process epistemic hyperparameter optimization (EHPO), and deception, the process of drawing conclusions from HPO should be made more provide a theoretical complement to this prior work, arguing that, to avoid such the opposite. In short, the way we choose hyperparameters can deceive us. We yield the conclusion that J outperforms K, whereas searching another can entail research.