creating
Creating a Public Repository for Joining Private Data
How can one publish a dataset with sensitive attributes in a way that both preserves privacy and enables joins with other datasets on those same sensitive attributes? This problem arises in many contexts, e.g., a hospital and an airline may want to jointly determine whether people who take long-haul flights are more likely to catch respiratory infections. If they join their data by a common keyed user identifier such as email address, they can determine the answer, though it breaks privacy. This paper shows how the hospital can generate a private sketch and how the airline can privately join with the hospital's sketch by email address. The proposed solution satisfies pure differential privacy and gives approximate answers to linear queries and optimization problems over those joins. Whereas prior work such as secure function evaluation requires sender/receiver interaction, a distinguishing characteristic of the proposed approach is that it is non-interactive. Consequently, the sketch can be published to a repository for any organization to join with, facilitating data discovery. The accuracy of the method is demonstrated through both theoretical analysis and extensive empirical evidence.
- Health & Medicine (1.00)
- Transportation > Air (0.60)
- Information Technology > Security & Privacy (0.43)
LLM Robustness Leaderboard v1 --Technical report
Lefebvre, Pierre Peigné -, Feuillade-Montixi, Quentin, David, Tom, Miailhe, Nicolas
This technical report accompanies the LLM robustness leaderboard published by PRISM Eval for the Paris AI Action Summit. We introduce PRISM Eval Behavior Elicitation Tool (BET), an AI system performing automated red-teaming through Dynamic Adversarial Optimization that achieves 100% Attack Success Rate (ASR) against 37 of 41 state-of-the-art LLMs. Beyond binary success metrics, we propose a fine-grained robustness metric estimating the average number of attempts required to elicit harmful behaviors, revealing that attack difficulty varies by over 300-fold across models despite universal vulnerability. We introduce primitive-level vulnerability analysis to identify which jailbreaking techniques are most effective for specific hazard categories. Our collaborative evaluation with trusted third parties from the AI Safety Network demonstrates practical pathways for distributed robustness assessment across the community.
- North America > United States (0.92)
- Europe > France (0.04)
- North America > Canada (0.04)
- Asia > Singapore (0.04)
A Step-by-Step Guide to Creating a Robust Autonomous Drone Testing Pipeline
Jiang, Yupeng, Deng, Yao, Schroder, Sebastian, Liang, Linfeng, Gambhir, Suhaas, James, Alice, Seth, Avishkar, Pirrie, James, Zhang, Yihao, Zheng, Xi
Autonomous drones are rapidly reshaping industries ranging from aerial delivery and infrastructure inspection to environmental monitoring and disaster response. Ensuring the safety, reliability, and efficiency of these systems is paramount as they transition from research prototypes to mission-critical platforms. This paper presents a step-by-step guide to establishing a robust autonomous drone testing pipeline, covering each critical stage: Software-in-the-Loop (SIL) Simulation Testing, Hardware-in-the-Loop (HIL) Testing, Controlled Real-World Testing, and In-Field Testing. Using practical examples, including the marker-based autonomous landing system, we demonstrate how to systematically verify drone system behaviors, identify integration issues, and optimize performance. Furthermore, we highlight emerging trends shaping the future of drone testing, including the integration of Neurosymbolic and LLMs, creating co-simulation environments, and Digital Twin-enabled simulation-based testing techniques. By following this pipeline, developers and researchers can achieve comprehensive validation, minimize deployment risks, and prepare autonomous drones for safe and reliable real-world operations.
- North America > United States > California (0.14)
- Asia > China > Beijing > Beijing (0.04)
- Africa > Rwanda (0.04)
- (5 more...)
- Workflow (1.00)
- Instructional Material > Training Manual (0.61)
- Transportation > Air (1.00)
- Media (1.00)
- Aerospace & Defense > Aircraft (1.00)
- (4 more...)
THiNK: Can Large Language Models Think-aloud?
Yu, Yongan, Wu, Mengqian, Lin, Yiran, Lobczowski, Nikki G.
Assessing higher-order thinking skills in large language models (LLMs) remains a fundamental challenge, especially in tasks that go beyond surface-level accuracy. In this work, we propose THiNK (Testing Higher-order Notion of Knowledge), a multi-agent, feedback-driven evaluation framework grounded in Bloom's Taxonomy. THiNK frames reasoning assessment as an iterative task of problem generation, critique, and revision, encouraging LLMs to think-aloud through step-by-step reflection and refinement. This enables a systematic evaluation of both lower-order (e.g., remember, understand) and higher-order (e.g., evaluate, create) thinking skills. We apply THiNK to seven state-of-the-art LLMs and perform a detailed cognitive analysis of their outputs. Results reveal that while models reliably perform lower-order categories well, they struggle with applying knowledge in realistic contexts and exhibit limited abstraction. Structured feedback loops significantly improve reasoning performance, particularly in higher-order thinking. Qualitative evaluations further confirm that THiNK-guided outputs better align with domain logic and problem structure. The code of our framework provides a scalable methodology for probing and enhancing LLM reasoning, offering new directions for evaluation grounded in learning science, which is available at our GitHub repository.
- Education > Educational Setting (0.67)
- Education > Curriculum > Subject-Specific Education (0.66)
Think Twice Before Creating That ChatGPT Action Figure
At the start of April, an influx of action figure started appearing on social media sites including LinkedIn and X. Each figure depicted the person who had created it with uncanny accuracy, complete with personalized accessories such as reusable coffee cups, yoga mats, and headphones. All this is possible because of OpenAI's new GPT-4o-powered image generator, which supercharges ChatGPT's ability to edit pictures, render text, and more. OpenAI's ChatGPT image generator can also create pictures in the style of Japanese animated film company Studio Ghibli--a trend that quickly went viral, too. The images are fun and easy to make--all you need is a free ChatGPT account and a photo.
- Media > Film (1.00)
- Information Technology > Security & Privacy (1.00)
- Leisure & Entertainment (0.92)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.60)
Creating a Public Repository for Joining Private Data
How can one publish a dataset with sensitive attributes in a way that both preserves privacy and enables joins with other datasets on those same sensitive attributes? This problem arises in many contexts, e.g., a hospital and an airline may want to jointly determine whether people who take long-haul flights are more likely to catch respiratory infections. If they join their data by a common keyed user identifier such as email address, they can determine the answer, though it breaks privacy. This paper shows how the hospital can generate a private sketch and how the airline can privately join with the hospital's sketch by email address. The proposed solution satisfies pure differential privacy and gives approximate answers to linear queries and optimization problems over those joins.
- Health & Medicine (1.00)
- Transportation > Air (0.63)
- Information Technology > Security & Privacy (0.40)
Creating a Formally Verified Neural Network for Autonomous Navigation: An Experience Report
Bukhari, Syed Ali Asadullah, Flinkow, Thomas, Inkarbekov, Medet, Pearlmutter, Barak A., Monahan, Rosemary
The increased reliance of self-driving vehicles on neural networks opens up the challenge of their verification. In this paper we present an experience report, describing a case study which we undertook to explore the design and training of a neural network on a custom dataset for vision-based autonomous navigation. We are particularly interested in the use of machine learning with differentiable logics to obtain networks satisfying basic safety properties by design, guaranteeing the behaviour of the neural network after training. We motivate the choice of a suitable neural network verifier for our purposes and report our observations on the use of neural network verifiers for self-driving systems.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Austria > Vienna (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Ireland (0.04)
- Information Technology (1.00)
- Transportation > Ground > Road (0.69)
- Automobiles & Trucks (0.69)
Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models
Chen, Yuyan, Wu, Chenwei, Yan, Songzhou, Liu, Panjun, Zhou, Haoyu, Xiao, Yanghua
Teachers are important to imparting knowledge and guiding learners, and the role of large language models (LLMs) as potential educators is emerging as an important area of study. Recognizing LLMs' capability to generate educational content can lead to advances in automated and personalized learning. While LLMs have been tested for their comprehension and problem-solving skills, their capability in teaching remains largely unexplored. In teaching, questioning is a key skill that guides students to analyze, evaluate, and synthesize core concepts and principles. Therefore, our research introduces a benchmark to evaluate the questioning capability in education as a teacher of LLMs through evaluating their generated educational questions, utilizing Anderson and Krathwohl's taxonomy across general, monodisciplinary, and interdisciplinary domains. We shift the focus from LLMs as learners to LLMs as educators, assessing their teaching capability through guiding them to generate questions. We apply four metrics, including relevance, coverage, representativeness, and consistency, to evaluate the educational quality of LLMs' outputs. Our results indicate that GPT-4 demonstrates significant potential in teaching general, humanities, and science courses; Claude2 appears more apt as an interdisciplinary teacher. Furthermore, the automatic scores align with human perspectives.
- Africa > Middle East > Egypt (0.04)
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- (23 more...)
- Research Report > New Finding (0.47)
- Personal > Interview (0.46)
- Instructional Material > Course Syllabus & Notes (0.34)
- Materials > Chemicals > Industrial Gases (1.00)
- Education > Assessment & Standards (0.66)
Interview with Henok Biadglign Ademtew: Creating an Amharic, Ge'ez and English parallel dataset
African languages are not well-represented in natural language processing (NLP). This is in large part due to a lack of resources for training models. Henok Biadglign Ademtew and Mikiyas Girma Birbo have created an Amharic, Ge'ez, and English parallel dataset to help advance research into low-resource languages. We spoke to Henok about this project, the creation of the dataset, and some of the challenges faced. Most of the languages in Africa are very low-resourced, and not much text data is available.
Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations
Pinhanez, Claudio, Fernandez, Raul, Grave, Marcelo, Nogima, Julio, Hoory, Ron
This poses challenges for applications interested in targeting specific demographics (e.g., an African American business or NGO; a voice-tutoring system for children that are not of White ethnicity, etc.). The ultimate goal of the project described in this paper is to provide to designers, developers, and enterprises the choice of having a professional voice which is clearly recognizable as African American, and therefore more able to address diversity and inclusiveness issues. Being more precise, our goal is to create an African American Text-to-Speech system, which we will refer simply as an African American voice or AA voice, able to produce synthetic audio segments from standard English texts, and which will be recognized by African American speakers and non-speakers as sounding like a native African American speaker. The AA voice should exhibit a level of technical quality similar to the Standard American English (SAE) synthetic voices currently available through professional platforms. The evaluation of the technical quality of the AA voice, however, is not addressed in this paper, which focuses primarily on whether the AA voice can be recognized as sounding like an African American speaker. Linguists [27, 28] have described a continuum of dialects under what is often termed African American Vernacular English (AAVE). At one end of the spectrum, one finds the largest deviation from SAE in terms of lexicon (including slang), syntax and morphology, and phonological/phonetic properties. At the other end, AAVE speakers begin to approach SAE in terms of lexicon and grammar but still retain marked speech characteristics (primarily in terms of intonation, phonation, and vowel placement [14, 28]) which grant the speech a distinctive identity which listeners use as cues in the perception of African American English [44].
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > South Carolina > Greenville County > Greenville (0.06)
- North America > United States > New York > New York County > New York City (0.04)
- (31 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (1.00)