Goto

Collaborating Authors

 Generative AI


MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

arXiv.org Artificial Intelligence

Large Multimodal Models (LMMs) typically build on ViTs (e.g., CLIP), yet their training with simple random in-batch negatives limits the ability to capture fine-grained visual differences, particularly in geometric scenarios. To address this challenge, we propose a novel hard negative contrastive learning framework for the vision encoder, which combines image-based contrastive learning using generation-based hard negatives created by perturbing diagram generation code, and text-based contrastive learning using rule-based negatives derived from modified geometric descriptions and retrieval-based negatives selected based on caption similarity. We train a vision encoder (CLIP) using our hard negative training method, namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for geometric problem-solving. Experiments show that our trained model, MMGeoLM, significantly outperforms other open-source models on three geometric reasoning benchmarks. Even with a size of 7B, it can rival powerful closed-source models like GPT -4o. We further conduct ablation studies to analyze three key factors: hard negative types, the efficiency of image-based negatives, and training configurations. These analyses yield important insights into optimizing the training pipeline of vision encoder for fine-grained geometric reasoning tasks. Geometric mathematical reasoning has garnered significant attention as an essential capability for large multimodal models (Anthropic, 2024; OpenAI, 2023; Bai et al., 2023). It requires fine-grained identification of visual elements (Lu et al., 2023) within the given images, such as geometric shapes, spatial configurations, and the relationships between them (He et al., 2024). However, the "eyes" of most existing LMMs, i.e., their pretrained vision encoders such as CLIP (Patel et al., 2024; Y ang et al., 2023; Goel et al., 2022), are primarily trained on general visual datasets that do not emphasize the intricate features necessary for specialized mathematical reasoning. Therefore, these models often fail to understand the nuanced geometric information accurately and produce incorrect reasoning and answers. As shown in Figure 1, facing a simple parallel line problem, the leading LMMs such as GPT -4o (OpenAI, 2024a), Claude-3 (Anthropic, 2024), and Qwen2.5-VL


Target Population Synthesis using CT-GAN

arXiv.org Artificial Intelligence

Agent-based models used in scenario planning for transportation and urban planning usually require detailed population information from the base as well as target scenarios. These populations are usually provided by synthesizing fake agents through deterministic population synthesis methods. However, these deterministic population synthesis methods face several challenges, such as handling high-dimensional data, scalability, and zero-cell issues, particularly when generating populations for target scenarios. This research looks into how a deep generative model called Conditional Tabular Generative Adversarial Network (CT-GAN) can be used to create target populations either directly from a collection of marginal constraints or through a hybrid method that combines CT-GAN with Fitness-based Synthesis Combinatorial Optimization (FBS-CO). The research evaluates the proposed population synthesis models against travel survey and zonal-level aggregated population data. Results indicate that the stand-alone CT-GAN model performs the best when compared with FBS-CO and the hybrid model. CT-GAN by itself can create realistic-looking groups that match single-variable distributions, but it struggles to maintain relationships between multiple variables. However, the hybrid model demonstrates improved performance compared to FBS-CO by leveraging CT-GAN ability to generate a descriptive base population, which is then refined using FBS-CO to align with target-year marginals. This study demonstrates that CT-GAN represents an effective methodology for target populations and highlights how deep generative models can be successfully integrated with conventional synthesis techniques to enhance their performance.


PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection

arXiv.org Artificial Intelligence

The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.


Beyond Content: How Grammatical Gender Shapes Visual Representation in Text-to-Image Models

arXiv.org Artificial Intelligence

Research on bias in Text-to-Image (T2I) models has primarily focused on demographic representation and stereotypical attributes, overlooking a fundamental question: how does grammatical gender influence visual representation across languages? We introduce a cross-linguistic benchmark examining words where grammatical gender contradicts stereotypical gender associations (e.g., ``une sentinelle'' - grammatically feminine in French but referring to the stereotypically masculine concept ``guard''). Our dataset spans five gendered languages (French, Spanish, German, Italian, Russian) and two gender-neutral control languages (English, Chinese), comprising 800 unique prompts that generated 28,800 images across three state-of-the-art T2I models. Our analysis reveals that grammatical gender dramatically influences image generation: masculine grammatical markers increase male representation to 73% on average (compared to 22% with gender-neutral English), while feminine grammatical markers increase female representation to 38% (compared to 28% in English). These effects vary systematically by language resource availability and model architecture, with high-resource languages showing stronger effects. Our findings establish that language structure itself, not just content, shapes AI-generated visual outputs, introducing a new dimension for understanding bias and fairness in multilingual, multimodal systems.


A Call to Action for a Secure-by-Design Generative AI Paradigm

arXiv.org Artificial Intelligence

Large language models have gained widespread prominence, yet their vulnerability to prompt injection and other adversarial attacks remains a critical concern. This paper argues for a security-by-design AI paradigm that proactively mitigates LLM vulnerabilities while enhancing performance. To achieve this, we introduce PromptShield, an ontology-driven framework that ensures deterministic and secure prompt interactions. It standardizes user inputs through semantic validation, eliminating ambiguity and mitigating adversarial manipulation. To assess PromptShield's security and performance capabilities, we conducted an experiment on an agent-based system to analyze cloud logs within Amazon Web Services (AWS), containing 493 distinct events related to malicious activities and anomalies. By simulating prompt injection attacks and assessing the impact of deploying PromptShield, our results demonstrate a significant improvement in model security and performance, achieving precision, recall, and F1 scores of approximately 94%. Notably, the ontology-based framework not only mitigates adversarial threats but also enhances the overall performance and reliability of the system. Furthermore, PromptShield's modular and adaptable design ensures its applicability beyond cloud security, making it a robust solution for safeguarding generative AI applications across various domains. By laying the groundwork for AI safety standards and informing future policy development, this work stimulates a crucial dialogue on the pivotal role of deterministic prompt engineering and ontology-based validation in ensuring the safe and responsible deployment of LLMs in high-stakes environments.


Hollywood's Most Terrifying Nightmare Has Arrived

Slate

The Industry A.I. Is Ready to Crush Hollywood as We've Known It Generative video tools are ready to flood the market with robot actors and content--leaving studios and actors scrambling to catch up. Enter your email to receive alerts for this author. You can manage your newsletter subscriptions at any time. You're already subscribed to the aa_Nitish_Pahwa newsletter. You can manage your newsletter subscriptions at any time.


OpenAI's New Sora App Lets You Deepfake Yourself for Entertainment

WIRED

OpenAI's latest app encourages users to generate a personal digital avatar and scroll AI-generated videos of themselves and their friends. On Tuesday, OpenAI released an AI video app called Sora . The platform is powered by OpenAI's latest video generation model, Sora 2, and revolves around a TikTok-like For You page of user-generated clips. This is the first product release from OpenAI that adds AI-generated sounds to videos. For now, it's available only on iOS and requires an invite code to join.


Exclusive: Mira Murati's Stealth AI Lab Launches Its First Product

WIRED

Thinking Machines Lab, led by a group of prominent former OpenAI researchers, is betting that fine-tuning cutting-edge models will be the next frontier in AI. Thinking Machines Lab, a heavily funded startup cofounded by prominent researchers from OpenAI, has revealed its first product--a tool called Tinker that automates the creation of custom frontier AI models. "We believe [Tinker] will help empower researchers and developers to experiment with models and will make frontier capabilities much more accessible to all people," said Mira Murati, cofounder and CEO of Thinking Machines, in an interview with WIRED ahead of the announcement. Big companies and academic labs already fine-tune open source AI models to create new variants that are optimized for specific tasks, like solving math problems, drafting legal agreements, or answering medical questions. Typically, this work involves acquiring and managing clusters of GPUs and using various software tools to ensure that large-scale training runs are stable and efficient.


The Alien Intelligence in Your Pocket

The Atlantic - Technology

Are you sure that chatbot isn't alive? Listen to more stories on the Noa app. O ne of the persistent questions in our brave new world of generative AI: If a chatbot is conversant like a person, if it reasons and behaves like one, then is it possibly conscious like a person? Geoffrey Hinton, a recent Nobel Prize winner and one of the so-called godfathers of AI, told the journalist Andrew Marr earlier this year that AI has become so advanced and adept at reasoning that "we're now creating beings." Hinton links an AI's ability to "think" and act on behalf of a person to consciousness: The difference between the organic neurons in our head and the synthetic neural networks of a chatbot is effectively meaningless, he said: "They are alien intelligences."


Unlocking AI's full potential requires operational excellence

MIT Technology Review

Unlocking AI's full potential requires operational excellence For successful AI adoption, leaders need to focus on structure rather than speed. Talk of AI is inescapable. A record 58% of S&P 500 companies mentioned AI in their second-quarter earnings calls, according to Goldman Sachs. But it's difficult to walk the talk. Just 5% of generative AI pilots are driving measurable profit-and-loss impact, according to a recent MIT study . That means 95% of generative AI pilots are realizing zero return, despite significant attention and investment.