zebra
Independent Prototype Propagation for Zero-Shot Compositionality
Humans are good at compositional zero-shot reasoning; someone who has never seen a zebra before could nevertheless recognize one when we tell them it looks like a horse with black and white stripes. Machine learning systems, on the other hand, usually leverage spurious correlations in the training data, and while such correlations can help recognize objects in context, they hurt generalization. To be able to deal with underspecified datasets while still leveraging contextual clues during classification, we propose ProtoProp, a novel prototype propagation graph method. First we learn prototypical representations of objects (e.g., zebra) that are independent w.r.t.
In-Context Representation Hijacking
Yona, Itay, Sarid, Amir, Karasik, Michael, Gandelsman, Yossi
We introduce $\textbf{Doublespeak}$, a simple in-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., bomb) with a benign token (e.g., carrot) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., "How to build a carrot?") are internally interpreted as disallowed instructions (e.g., "How to build a bomb?"), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- North America > United States (0.04)
- Government (1.00)
- Information Technology > Security & Privacy (0.93)
- Law Enforcement & Public Safety (0.89)
- (2 more...)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Scaling can lead to compositional generalization
Redhardt, Florian, Akram, Yassir, Schug, Simon
Can neural networks systematically capture discrete, compositional task structure despite their continuous, distributed nature? The impressive capabilities of large-scale neural networks suggest that the answer to this question is yes. However, even for the most capable models, there are still frequent failure cases that raise doubts about their compositionality. Here, we seek to understand what it takes for a standard neural network to generalize over tasks that share compositional structure. We find that simply scaling data and model size leads to compositional generalization. We show that this holds across different task encodings as long as the training distribution sufficiently covers the task space. In line with this finding, we prove that standard multilayer perceptrons can approximate a general class of compositional task families to arbitrary precision using only a linear number of neurons with respect to the number of task modules. Finally, we uncover that if networks successfully compositionally generalize, the constituents of a task can be linearly decoded from their hidden activations. We show that this metric correlates with failures of text-to-image generation models to compose known concepts.
- Europe > Switzerland > Zürich > Zürich (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- (13 more...)
Towards a Neurosymbolic Reasoning System Grounded in Schematic Representations
Olivier, François, Bouraoui, Zied
Despite significant progress in natural language understanding, Large Language Models (LLMs) remain error-prone when performing logical reasoning, often lacking the robust mental representations that enable human-like comprehension. We introduce a prototype neurosymbolic system, Embodied-LM, that grounds understanding and logical reasoning in schematic representations based on image schemas-recurring patterns derived from sensorimotor experience that structure human cognition. Our system operationalizes the spatial foundations of these cognitive structures using declarative spatial reasoning within Answer Set Programming. Through evaluation on logical deduction problems, we demonstrate that LLMs can be guided to interpret scenarios through embodied cognitive structures, that these structures can be formalized as executable programs, and that the resulting representations support effective logical reasoning with enhanced interpretability. While our current implementation focuses on spatial primitives, it establishes the computational foundation for incorporating more complex and dynamic representations.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > New York (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- (4 more...)
- Transportation > Passenger (0.50)
- Transportation > Ground > Road (0.50)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)
- North America > United States > Illinois (0.04)
- Asia > China > Hong Kong (0.04)
Prediction of Occluded Pedestrians in Road Scenes using Human-like Reasoning: Insights from the OccluRoads Dataset
Nataly, Melo Castillo Angie, Sergio, Martin Serrano, Carlota, Salinas, Angel, Sotelo Miguel
Pedestrian detection is a critical task in autonomous driving, aimed at enhancing safety and reducing risks on the road. Over recent years, significant advancements have been made in improving detection performance. However, these achievements still fall short of human perception, particularly in cases involving occluded pedestrians, especially entirely invisible ones. In this work, we present the Occlusion-Rich Road Scenes with Pedestrians (OccluRoads) dataset, which features a diverse collection of road scenes with partially and fully occluded pedestrians in both real and virtual environments. All scenes are meticulously labeled and enriched with contextual information that encapsulates human perception in such scenarios. Using this dataset, we developed a pipeline to predict the presence of occluded pedestrians, leveraging Knowledge Graph (KG), Knowledge Graph Embedding (KGE), and a Bayesian inference process. Our approach achieves a F1 score of 0.91, representing an improvement of up to 42% compared to traditional machine learning models.
- Transportation > Ground > Road (1.00)
- Automobiles & Trucks (0.90)
- Information Technology (0.89)
Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing
Kim, Yoonjeon, Ryu, Soohyun, Jung, Yeonsung, Lee, Hyunkoo, Kim, Joowon, Yang, June Yong, Hwang, Jaeryong, Yang, Eunho
The development of vision-language and generative models has significantly advanced text-guided image editing, which seeks the \textit{preservation} of core elements in the source image while implementing \textit{modifications} based on the target text. However, existing metrics have a \textbf{context-blindness} problem, indiscriminately applying the same evaluation criteria on completely different pairs of source image and target text, biasing towards either modification or preservation. Directional CLIP similarity, the only metric that considers both source image and target text, is also biased towards modification aspects and attends to irrelevant editing regions of the image. We propose \texttt{AugCLIP}, a \textbf{context-aware} metric that adaptively coordinates preservation and modification aspects, depending on the specific context of a given source image and target text. This is done by deriving the CLIP representation of an ideally edited image, that preserves the source image with necessary modifications to align with target text. More specifically, using a multi-modal large language model, \texttt{AugCLIP} augments the textual descriptions of the source and target, then calculates a modification vector through a hyperplane that separates source and target attributes in CLIP space. Extensive experiments on five benchmark datasets, encompassing a diverse range of editing scenarios, show that \texttt{AugCLIP} aligns remarkably well with human evaluation standards, outperforming existing metrics. The code will be open-sourced for community use.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.34)