Goto

Collaborating Authors

 Generative AI


Automated Meta Prompt Engineering for Alignment with the Theory of Mind

arXiv.org Artificial Intelligence

We introduce a method of meta-prompting that jointly produces fluent text for complex tasks while optimizing the similarity of neural states between a human's mental expectation and a Large Language Model's (LLM) neural processing. A technique of agentic reinforcement learning is applied, in which an LLM as a Judge (LLMaaJ) teaches another LLM, through in-context learning, how to produce content by interpreting the intended and unintended generated text traits. To measure human mental beliefs around content production, users modify long form AI-generated text articles before publication at the US Open 2024 tennis Grand Slam. Now, an LLMaaJ can solve the Theory of Mind (ToM) alignment problem by anticipating and including human edits within the creation of text from an LLM. Throughout experimentation and by interpreting the results of a live production system, the expectations of human content reviewers had 100% of alignment with AI 53.8% of the time with an average iteration count of 4.38. The geometric interpretation of content traits such as factualness, novelty, repetitiveness, and relevancy over a Hilbert vector space combines spatial volume (all trait importance) with vertices alignment (individual trait relevance) enabled the LLMaaJ to optimize on Human ToM. This resulted in an increase in content quality by extending the coverage of tennis action. Our work that was deployed at the US Open 2024 has been used across other live events within sports and entertainment.


A Comprehensive Analysis of Large Language Model Outputs: Similarity, Diversity, and Bias

arXiv.org Artificial Intelligence

Large Language Models (LLMs) represent a major step toward artificial general intelligence, significantly advancing our ability to interact with technology. While LLMs perform well on Natural Language Processing tasks -- such as translation, generation, code writing, and summarization -- questions remain about their output similarity, variability, and ethical implications. For instance, how similar are texts generated by the same model? How does this compare across different models? And which models best uphold ethical standards? To investigate, we used 5{,}000 prompts spanning diverse tasks like generation, explanation, and rewriting. This resulted in approximately 3 million texts from 12 LLMs, including proprietary and open-source systems from OpenAI, Google, Microsoft, Meta, and Mistral. Key findings include: (1) outputs from the same LLM are more similar to each other than to human-written texts; (2) models like WizardLM-2-8x22b generate highly similar outputs, while GPT-4 produces more varied responses; (3) LLM writing styles differ significantly, with Llama 3 and Mistral showing higher similarity, and GPT-4 standing out for distinctiveness; (4) differences in vocabulary and tone underscore the linguistic uniqueness of LLM-generated content; (5) some LLMs demonstrate greater gender balance and reduced bias. These results offer new insights into the behavior and diversity of LLM outputs, helping guide future development and ethical evaluation.


Tracing the Invisible: Understanding Students' Judgment in AI-Supported Design Work

arXiv.org Artificial Intelligence

As generative AI tools become integrated into design workflows, students increasingly engage with these tools not just as aids, but as collaborators. This study analyzes reflections from 33 student teams in an HCI design course to examine the kinds of judgments students make when using AI tools. We found both established forms of design judgment (e.g., instrumental, appreciative, quality) and emergent types: agency-distribution judgment and reliability judgment. These new forms capture how students negotiate creative responsibility with AI and assess the trustworthiness of its outputs. Our findings suggest that generative AI introduces new layers of complexity into design reasoning, prompting students to reflect not only on what AI produces, but also on how and when to rely on it. By foregrounding these judgments, we offer a conceptual lens for understanding how students engage in co-creative sensemaking with AI in design contexts.


Ethics and Persuasion in Reinforcement Learning from Human Feedback: A Procedural Rhetorical Approach

arXiv.org Artificial Intelligence

Since 2022, versions of generative AI chatbots such as ChatGPT and Claude have been trained using a specialized technique called Reinforcement Learning from Human Feedback (RLHF) to fine-tune language model output using feedback from human annotators. As a result, the integration of RLHF has greatly enhanced the outputs of these large language models (LLMs) and made the interactions and responses appear more "human-like" than those of previous versions using only supervised learning. The increasing convergence of human and machine-written text has potentially severe ethical, sociotechnical, and pedagogical implications relating to transparency, trust, bias, and interpersonal relations. To highlight these implications, this paper presents a rhetorical analysis of some of the central procedures and processes currently being reshaped by RLHF-enhanced generative AI chatbots: upholding language conventions, information seeking practices, and expectations for social relationships. Rhetorical investigations of generative AI and LLMs have, to this point, focused largely on the persuasiveness of the content generated. Using Ian Bogost's concept of procedural rhetoric, this paper shifts the site of rhetorical investigation from content analysis to the underlying mechanisms of persuasion built into RLHF-enhanced LLMs. In doing so, this theoretical investigation opens a new direction for further inquiry in AI ethics that considers how procedures rerouted through AI-driven technologies might reinforce hegemonic language use, perpetuate biases, decontextualize learning, and encroach upon human relationships. It will therefore be of interest to educators, researchers, scholars, and the growing number of users of generative AI chatbots.


Elon Musk's Grok AI Can't Stop Talking About 'White Genocide'

WIRED

A chatbot developed by Elon Musk's multibillion-dollar artificial intelligence startup xAI appeared to be suffering from a glitch Wednesday when it repeatedly brought up white genocide in South Africa in response to user queries about unrelated topics on X. Grok, which competes with other chatbots like OpenAI's ChatGPT, is directly integrated into the social media platform that Musk also owns. Numerous examples of the phenomenon could be found by searching the official Grok profile for posts containing the term "boer," a word used to refer to people from South Africa of "Dutch, German, or Huguenot descent." It is sometimes used by Black South Africans as a pejorative against white Afrikaners, or people associated with the apartheid regime. In response to topics ranging from streaming platform HBO Max's name change to Medicaid cuts proposed by US lawmakers, the chatbot often seemed to initially stay on topic before veering back to white genocide in South Africa, completely unprompted. When asked to confirm the salary of Toronto Blue Jays player Max Scherzer, for example, the generative artificial intelligence chatbot launched into an explanation of white genocide and a controversial South African anti-apartheid song.


Move over, Copilot! ChatGPT can now analyze OneDrive files in real time

PCWorld

In addition to gobbling up most of the internet, ChatGPT now wants access to your OneDrive and SharePoint files, too. One of the earliest uses of AI was to summarize documents and folders of documents, and there's only so many times you can ask it whether Spider-Man would beat Wonder Woman in a fair fight. It would be more productive for AI to collate and make sense of your own personal information, assuming you want to grant access to it. According to OpenAI, ChatGPT can now connect to your OneDrive or SharePoint document libraries, assuming you're a paid ChatGPT Plus, Pro, or Team user who lives outside the EEA, Switzerland, and the UK (via Windows Central). You'll obviously have to connect ChatGPT and give it permission to start poring over your cloud documents.


Sumitomo Mitsui, SoftBank to tie up on digital payment services

The Japan Times

Sumitomo Mitsui Financial Group and mobile carrier SoftBank will collaborate in the field of digital payment services, it was learned Wednesday. Under the partnership, the PayPay smartphone payment service operated by a SoftBank affiliate will be made available via the Olive general financial app, provided by Sumitomo Mitsui Banking, the core unit of the financial group. Sumitomo Mitsui, through its Sumitomo Mitsui Card unit, will form a comprehensive partnership with SoftBank and PayPay that will be announced soon. The two sides will allow points in their respective reward programs to be exchanged. They will also collaborate on the use of data and generative artificial intelligence.


SoftBank profit doubles as AI demand boosts chip sales and startups

The Japan Times

SoftBank reported a 124% jump in quarterly profit on resilient AI demand that's supporting startup valuations and chip unit sales, a boost to its aggressive data center investment plans. The Tokyo-based company reported net income of 517.18 billion ( 3.5 billion) in its fiscal fourth quarter. It was helped by the Vision Fund, which swung to a profit of 26.1 billion mainly on a surge in the value of TikTok owner ByteDance and its strong international sales. The earnings come at a critical juncture for SoftBank as it plans to invest 30 billion in OpenAI while leading a 100 billion foray into building AI hardware in the United States. Maintaining a healthy cash flow and balance sheet is key to securing the billions of dollars needed at minimum cost.


CAD-Coder:Text-Guided CAD Files Code Generation

arXiv.org Artificial Intelligence

Computer-aided design (CAD) is a way to digitally create 2D drawings and 3D models of real-world products. Traditional CAD typically relies on hand-drawing by experts or modifications of existing library files, which doesn't allow for rapid personalization. With the emergence of generative artificial intelligence, convenient and efficient personalized CAD generation has become possible. However, existing generative methods typically produce outputs that lack interactive editability and geometric annotations, limiting their practical applications in manufacturing. T o enable interactive generative CAD, we propose CAD-Coder, a framework that transforms natural language instructions into CAD script codes, which can be executed in Python environments to generate human-editable CAD files (.Dxf). T o facilitate the generation of editable CAD sketches with annotation information, we construct a comprehensive dataset comprising 29,130 Dxf files with their corresponding script codes, where each sketch preserves both editability and geometric annotations. W e evaluate CAD-Coder on various 2D/3D CAD generation tasks against existing methods, demonstrating superior interactive capabilities while uniquely providing editable sketches with geometric annotations.


Evaluating LLM Metrics Through Real-World Capabilities

arXiv.org Artificial Intelligence

As generative AI becomes increasingly embedded in everyday workflows, it is important to evaluate its performance in ways that reflect real-world usage rather than abstract notions of intelligence. Unlike many existing benchmarks that assess general intelligence, our approach focuses on real-world utility, evaluating how well models support users in everyday tasks. While current benchmarks emphasize code generation or factual recall, users rely on AI for a much broader range of activities-from writing assistance and summarization to citation formatting and stylistic feedback. In this paper, we analyze large-scale survey data and usage logs to identify six core capabilities that represent how people commonly use Large Language Models (LLMs): Summarization, Technical Assistance, Reviewing Work, Data Structuring, Generation, and Information Retrieval. We then assess the extent to which existing benchmarks cover these capabilities, revealing significant gaps in coverage, efficiency measurement, and interpretability. Drawing on this analysis, we use human-centered criteria to identify gaps in how well current benchmarks reflect common usage that is grounded in five practical criteria: coherence, accuracy, clarity, relevance, and efficiency. For four of the six capabilities, we identify the benchmarks that best align with real-world tasks and use them to compare leading models. We find that Google Gemini outperforms other models-including OpenAI's GPT, xAI's Grok, Meta's LLaMA, Anthropic's Claude, DeepSeek, and Qwen from Alibaba-on these utility-focused metrics.