Goto

Collaborating Authors

 Media


Multi-Hop Question Generation via Dual-Perspective Keyword Guidance

arXiv.org Artificial Intelligence

Multi-hop question generation (MQG) aims to generate questions that require synthesizing multiple information snippets from documents to derive target answers. The primary challenge lies in effectively pinpointing crucial information snippets related to question-answer (QA) pairs, typically relying on keywords. However, existing works fail to fully utilize the guiding potential of keywords and neglect to differentiate the distinct roles of question-specific and document-specific keywords. To address this, we define dual-perspective keywords (i.e., question and document keywords) and propose a Dual-Perspective Keyword-Guided (DPKG) framework, which seamlessly integrates keywords into the multi-hop question generation process. We argue that question keywords capture the questioner's intent, whereas document keywords reflect the content related to the QA pair. Functionally, question and document keywords work together to pinpoint essential information snippets in the document, with question keywords required to appear in the generated question. The DPKG framework consists of an expanded transformer encoder and two answer-aware transformer decoders for keyword and question generation, respectively. Extensive experiments demonstrate the effectiveness of our work, showcasing its promising performance and underscoring its significant value in the MQG task.


AI, bot farms and innocent indie victims: how music streaming became a hotbed of fraud and fakery

The Guardian

There is a battle gripping the music business today around the manipulation of streaming services โ€“ and innocent indie artists are the collateral damage. Fraudsters are flooding Spotify, Apple Music and the rest with AI-generated tracks, to try and hoover up the royalties generated by people listening to them. These tracks are cheap, quick and easy to make, with Deezer estimating in April that over 20,000 fully AI-created tracks โ€“ that's 18% of new tracks โ€“ were being ingested into its platform daily, almost double the number in January. The fraudsters often then use bots, AI or humans to endlessly listen to these fake songs and generate revenue, while others are exploiting upload services to get fake songs put on real artists' pages and siphon off royalties that way. Spotify fines the worst offenders and says it puts "significant engineering resources and research into detecting, mitigating, and removing artificial streaming activity", while Apple Music claims "less than 1% of all streams are manipulated" on its service.


Will AI wipe out the first rung of the career ladder?

The Guardian

This week, I'm wondering what my first jobs in journalism would have been like had generative AI been around. In other news: Elon Musk leaves a trail of chaos, and influencers are selling the text they fed to AI to make art. Generative artificial intelligence may eliminate the job you got with your diploma still in hand, say executives who offered grim assessments of the entry-level job market last week in multiple forums. Dario Amodei, CEO of Anthropic, which makes the multifunctional AI model Claude, told Axios last week that he believes that AI could cut half of all entry-level white-collar jobs and send overall unemployment rocketing to 20% within the next five years. One explanation why an AI company CEO might make such a dire prediction is to hype the capabilities of his product.


Google's New AI Tool Generates Convincing Deepfakes of Riots, Conflict, and Election Fraud

TIME - Tech

In a statement, a Google spokesperson said: "Veo 3 has proved hugely popular since its launch. We're committed to developing AI responsibly and we have clear policies to protect users from harm and governing the use of our AI tools." Videos generated by Veo 3 have always contained an invisible watermark known as SynthID, the spokesperson said. Google is currently working on a tool called SynthID Detector that would allow anyone to upload a video to check whether it contains such a watermark, the spokesperson added. However, this tool is not yet publicly available.


'Nobody wants a robot to read them a story!' The creatives and academics rejecting AI โ€“ at work and at home

The Guardian

The novelist Ewan Morrison was alarmed, though amused, to discover he had written a book called Nine Inches Pleases a Lady. Intrigued by the limits of generative artificial intelligence (AI), he had asked ChatGPT to give him the names of the 12 novels he had written. "I've only written nine," he says. "Always eager to please, it decided to invent three." The "nine inches" from the fake title it hallucinated was stolen from a filthy Robert Burns poem.


CiteEval: Principle-Driven Citation Evaluation for Source Attribution

arXiv.org Artificial Intelligence

Citation quality is crucial in information-seeking systems, directly influencing trust and the effectiveness of information access. Current evaluation frameworks, both human and automatic, mainly rely on Natural Language Inference (NLI) to assess binary or ternary supportiveness from cited sources, which we argue is a suboptimal proxy for citation evaluation. In this work we introduce CiteEval, a citation evaluation framework driven by principles focusing on fine-grained citation assessment within a broad context, encompassing not only the cited sources but the full retrieval context, user query, and generated text. Guided by the proposed framework, we construct CiteBench, a multi-domain benchmark with high-quality human annotations on citation quality. To enable efficient evaluation, we further develop CiteEval-Auto, a suite of model-based metrics that exhibit strong correlation with human judgments. Experiments across diverse systems demonstrate CiteEval-Auto's superior ability to capture the multifaceted nature of citations compared to existing metrics, offering a principled and scalable approach to evaluate and improve model-generated citations.


Reconsidering LLM Uncertainty Estimation Methods in the Wild

arXiv.org Artificial Intelligence

Large Language Model (LLM) Uncertainty Estimation (UE) methods have become a crucial tool for detecting hallucinations in recent years. While numerous UE methods have been proposed, most existing studies evaluate them in isolated short-form QA settings using threshold-independent metrics such as AUROC or PRR. However, real-world deployment of UE methods introduces several challenges. In this work, we systematically examine four key aspects of deploying UE methods in practical settings. Specifically, we assess (1) the sensitivity of UE methods to decision threshold selection, (2) their robustness to query transformations such as typos, adversarial prompts, and prior chat history, (3) their applicability to long-form generation, and (4) strategies for handling multiple UE scores for a single query. Our evaluations on 19 UE methods reveal that most of them are highly sensitive to threshold selection when there is a distribution shift in the calibration dataset. While these methods generally exhibit robustness against previous chat history and typos, they are significantly vulnerable to adversarial prompts. Additionally, while existing UE methods can be adapted for long-form generation through various strategies, there remains considerable room for improvement. Lastly, ensembling multiple UE scores at test time provides a notable performance boost, which highlights its potential as a practical improvement strategy. Code is available at: https://github.com/duygunuryldz/uncertainty_in_the_wild.


An evaluation of LLMs for generating movie reviews: GPT-4o, Gemini-2.0 and DeepSeek-V3

arXiv.org Artificial Intelligence

Large language models (LLMs) have been prominent in various tasks, including text generation and summarisation. The applicability of LLMs to the generation of product reviews is gaining momentum, paving the way for the generation of movie reviews. In this study, we propose a framework that generates movie reviews using three LLMs (GPT-4o, DeepSeek-V3, and Gemini-2.0), and evaluate their performance by comparing the generated outputs with IMDb user reviews. We use movie subtitles and screenplays as input to the LLMs and investigate how they affect the quality of reviews generated. We review the LLM-based movie reviews in terms of vocabulary, sentiment polarity, similarity, and thematic consistency in comparison to IMDB user reviews. The results demonstrate that LLMs are capable of generating syntactically fluent and structurally complete movie reviews. Nevertheless, there is still a noticeable gap in emotional richness and stylistic coherence between LLM-generated and IMDb reviews, suggesting that further refinement is needed to improve the overall quality of movie review generation. We provided a survey-based analysis where participants were told to distinguish between LLM and IMDb user reviews. The results show that LLM-generated reviews are difficult to distinguish from IMDB user reviews. We found that DeepSeek-V3 produced the most balanced reviews, closely matching IMDb reviews. GPT-4o overemphasised positive emotions, while Gemini-2.0 captured negative emotions better but showed excessive emotional intensity.


SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset

arXiv.org Artificial Intelligence

Code-switching (CS) is the alternating use of two or more languages within a conversation or utterance, often influenced by social context and speaker identity. This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems, which are typically designed for a single language and struggle to handle multilingual inputs. The growing global demand for multilingual applications, including Code-Switching ASR (CSASR), Text-to-Speech (CSTTS), and Cross-Lingual Information Retrieval (CLIR), highlights the inadequacy of existing monolingual datasets. Although some code-switching datasets exist, most are limited to bilingual mixing within homogeneous ethnic groups, leaving a critical need for a large-scale, diverse benchmark akin to ImageNet in computer vision. To bridge this gap, we introduce \textbf{LinguaMaster}, a multi-agent collaboration framework specifically designed for efficient and scalable multilingual data synthesis. Leveraging this framework, we curate \textbf{SwitchLingua}, the first large-scale multilingual and multi-ethnic code-switching dataset, including: (1) 420K CS textual samples across 12 languages, and (2) over 80 hours of audio recordings from 174 speakers representing 18 countries/regions and 63 racial/ethnic backgrounds, based on the textual data. This dataset captures rich linguistic and cultural diversity, offering a foundational resource for advancing multilingual and multicultural research. Furthermore, to address the issue that existing ASR evaluation metrics lack sensitivity to code-switching scenarios, we propose the \textbf{Semantic-Aware Error Rate (SAER)}, a novel evaluation metric that incorporates semantic information, providing a more accurate and context-aware assessment of system performance.


Optimizing Storytelling, Improving Audience Retention, and Reducing Waste in the Entertainment Industry

arXiv.org Artificial Intelligence

Television networks face high financial risk when making programming decisions, often relying on limited historical data to forecast episodic viewership. This study introduces a machine learning framework that integrates natural language processing (NLP) features from over 25000 television episodes with traditional viewership data to enhance predictive accuracy. By extracting emotional tone, cognitive complexity, and narrative structure from episode dialogue, we evaluate forecasting performance using SARIMAX, rolling XGBoost, and feature selection models. While prior viewership remains a strong baseline predictor, NLP features contribute meaningful improvements for some series. We also introduce a similarity scoring method based on Euclidean distance between aggregate dialogue vectors to compare shows by content. Tested across diverse genres, including Better Call Saul and Abbott Elementary, our framework reveals genre-specific performance and offers interpretable metrics for writers, executives, and marketers seeking data-driven insight into audience behavior.