mistral
Leave big tech behind! How to replace Amazon, Google, X, Meta, Apple – and more
Switching to big tech alternatives is easier than you might imagine. Switching to big tech alternatives is easier than you might imagine. T here's not much to love about big tech these days. So many ills can be laid at its door: social media harms, misinformation, polarisation, mining and misuse of personal data, environmental negligence, tax avoidance, the list goes on. Added to which, Silicon Valley's leaders seem all too keen to cosy up to the Trump administration, to shower the president with bribes - sorry, gifts - and remain silent about his worsening political overreach. And that's before we get to the rampant " enshittification ", as the tech writer Cory Doctorow describes it, which means that by design many big tech products have become less useful and more extractive than they were when we originally signed up to them.
- Europe > United Kingdom (0.48)
- North America > United States > California (0.24)
- Europe > France (0.06)
- (10 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Services (0.94)
- (2 more...)
- North America > United States > West Virginia (0.04)
- North America > United States > Virginia (0.04)
- North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
Mistral's New Ultra-Fast Translation Model Gives Big AI Labs a Run for Their Money
Mistral's New Ultra-Fast Translation Model Gives Big AI Labs a Run for Their Money "Too many GPUs makes you lazy," says the French startup's vice president of science operations, as the company carves out a different path than the major US AI companies. Mistral AI has released a new family of AI models that it claims will clear the path to seamless conversation between people speaking different languages . On Wednesday, the Paris-based AI lab released two new speech-to-text models: Voxtral Mini Transcribe V2 and Voxtral Realtime. The former is built to transcribe audio files in large batches and the latter for nearly real-time transcription, within 200 milliseconds; both can translate between 13 languages. Voxtral Realtime is freely available under an open source license.
- North America > United States > California (0.16)
- North America > United States > Vermont (0.05)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.05)
- (2 more...)
Going All-In on LLM Accuracy: Fake Prediction Markets, Real Confidence Signals
Large language models are increasingly used to evaluate other models, yet these judgments typically lack any representation of confidence. This pilot study tests whether framing an evaluation task as a betting game (a fictional prediction market with its own LLM currency) improves forecasting accuracy and surfaces calibrated confidence signals. We generated 100 math and logic questions with verifiable answers. Six Baseline models (three current-generation, three prior-generation) answered all items. Three Predictor models then forecasted, for each question-baseline pair, if the baseline would answer correctly. Each predictor completed matched runs in two conditions: Control (simple correct/incorrect predictions) and Incentive (predictions plus wagers of 1-100,000 LLMCoin under even odds, starting from a 1,000,000 LLMCoin bankroll). Across 5,400 predictions per condition, Incentive runs showed modestly higher accuracy (81.5% vs. 79.1%, p = .089, d = 0.86) and significantly faster learning across rounds (12.0 vs. 2.9 percentage-point improvement from Round 1 to Round 4, p = .011). Most notably, stake size tracked confidence. "Whale" bets of 40,000+ coins were correct ~99% of the time, while small bets (<1,000 coins) showed only ~74% accuracy. The key finding is not that fictional money makes models smarter; accuracy gains were modest and did not reach statistical significance (p = .089) in this pilot. Rather, the betting mechanic created a legible confidence signal absent from binary yes/no outputs. This suggests that simple financial framing may help transform LLMs into risk-aware forecasters, making their internal beliefs visible and usable. The protocol offers a foundation for future work for meta-evaluation systems and what may become LLM-to-LLM prediction markets.
Large Language Model-Based Generation of Discharge Summaries
Rodrigues, Tiago, Lopes, Carla Teixeira
Discharge Summaries are documents written by medical professionals that detail a patient's visit to a care facility. They contain a wealth of information crucial for patient care, and automating their generation could significantly reduce the effort required from healthcare professionals, minimize errors, and ensure that critical patient information is easily accessible and actionable. In this work, we explore the use of five Large Language Models on this task, from open-source models (Mistral, Llama 2) to proprietary systems (GPT-3, GPT-4, Gemini 1.5 Pro), leveraging MIMIC-III summaries and notes. We evaluate them using exact-match, soft-overlap, and reference-free metrics. Our results show that proprietary models, particularly Gemini with one-shot prompting, outperformed others, producing summaries with the highest similarity to the gold-standard ones. Open-source models, while promising, especially Mistral after fine-tuning, lagged in performance, often struggling with hallucinations and repeated information. Human evaluation by a clinical expert confirmed the practical utility of the summaries generated by proprietary models. Despite the challenges, such as hallucinations and missing information, the findings suggest that LLMs, especially proprietary models, are promising candidates for automatic discharge summary generation as long as data privacy is ensured.
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- Europe > Portugal > Porto > Porto (0.04)
- Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)
- (6 more...)
Mind Reading or Misreading? LLMs on the Big Five Personality Test
Di Cursi, Francesco, Boldrini, Chiara, Conti, Marco, Passarella, Andrea
We evaluate large language models (LLMs) for automatic personality prediction from text under the binary Five Factor Model (BIG5). Five models -- including GPT-4 and lightweight open-source alternatives -- are tested across three heterogeneous datasets (Essays, MyPersonality, Pandora) and two prompting strategies (minimal vs. enriched with linguistic and psychological cues). Enriched prompts reduce invalid outputs and improve class balance, but also introduce a systematic bias toward predicting trait presence. Performance varies substantially: Openness and Agreeableness are relatively easier to detect, while Extraversion and Neuroticism remain challenging. Although open-source models sometimes approach GPT-4 and prior benchmarks, no configuration yields consistently reliable predictions in zero-shot binary settings. Moreover, aggregate metrics such as accuracy and macro-F1 mask significant asymmetries, with per-class recall offering clearer diagnostic value. These findings show that current out-of-the-box LLMs are not yet suitable for APPT, and that careful coordination of prompt design, trait framing, and evaluation metrics is essential for interpretable results.
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Assessing the Capability of LLMs in Solving POSCOMP Questions
Viegas, Cayo, Gheyi, Rohit, Ribeiro, Márcio
--Recent advancements in Large Language Models (LLMs) have significantly expanded the capabilities of artificial intelligence in natural language processing tasks. Despite this progress, their performance in specialized domains such as computer science remains relatively unexplored. Understanding the proficiency of LLMs in these domains is critical for evaluating their practical utility and guiding future developments. The POSCOMP, a prestigious Brazilian examination used for graduate admissions in computer science promoted by the Brazlian Computer Society (SBC), provides a challenging benchmark. This study investigates whether LLMs can match or surpass human performance on the POSCOMP exam. Four LLMs - ChatGPT -4, Gemini 1.0 Advanced, Claude 3 Sonnet, and Le Chat Mistral Large - were initially evaluated on the 2022 and 2023 POSCOMP exams. The assessments measured the models' proficiency in handling complex questions typical of the exam. LLM performance was notably better on text-based questions than on image interpretation tasks. In the 2022 exam, ChatGPT - 4 led with 57 correct answers out of 69 questions, followed by Gemini 1.0 Advanced (49), Le Chat Mistral (48), and Claude 3 Sonnet (44). Similar trends were observed in the 2023 exam. ChatGPT -4 achieved the highest performance, surpassing all students who took the POSCOMP 2023 exam. LLMs, particularly ChatGPT -4, show promise in text-based tasks on the POSCOMP exam, although image interpretation remains a challenge. Given the rapid evolution of LLMs, we expanded our analysis to include more recent models - o1, Gemini 2.5 Pro, Claude 3.7 Sonnet, and o3-mini-high - evaluated on the 2022-2024 POSCOMP exams. These newer models demonstrate further improvements and consistently surpass both the average and top-performing human participants across all three years. The POSCOMP [1] is a prestigious assessment designed to test the knowledge of prospective computer science graduate students, promoted by the Brazilian Computer Society (SBC). It serves as an entry criterion for many graduate programs across Brazil. Using this exam as a benchmark for evaluating Large Language Models (LLMs) allows for a direct comparison between AI capabilities and human standards, offering valuable insights into the strengths and limitations of current AI models. Recent advancements in LLMs [2], [3] have significantly expanded the capabilities of Artificial Intelligence (AI), particularly in natural language processing tasks.
- South America > Brazil (0.25)
- Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
- South America > Peru (0.04)
- (2 more...)
- Education > Assessment & Standards (0.46)
- Education > Educational Setting > Higher Education (0.34)
From Passive to Persuasive: Steering Emotional Nuance in Human-AI Negotiation
Chebrolu, Niranjan, Yeo, Gerard Christopher, Jaidka, Kokil
Large Language Models (LLMs) demonstrate increasing conversational fluency, yet instilling them with nuanced, human-like emotional expression remains a significant challenge. Current alignment techniques often address surface-level output or require extensive fine-tuning. This paper demonstrates that targeted activation engineering can steer LLaMA 3.1-8B to exhibit more human-like emotional nuances. We first employ attribution patching to identify causally influential components, to find a key intervention locus by observing activation patterns during diagnostic conversational tasks. We then derive emotional expression vectors from the difference in the activations generated by contrastive text pairs (positive vs. negative examples of target emotions). Applying these vectors to new conversational prompts significantly enhances emotional characteristics: steered responses show increased positive sentiment (e.g., joy, trust) and more frequent first-person pronoun usage, indicative of greater personal engagement. Our findings offer a precise and interpretable framework and new directions for the study of conversational AI.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
- Asia > Singapore (0.04)
- Research Report > Experimental Study (0.69)
- Research Report > New Finding (0.66)
Where Should I Study? Biased Language Models Decide! Evaluating Fairness in LMs for Academic Recommendations
Shailya, Krithi, Mishra, Akhilesh Kumar, Krishnan, Gokul S, Ravindran, Balaraman
Large Language Models (LLMs) are increasingly used as daily recommendation systems for tasks like education planning, yet their recommendations risk perpetuating societal biases. This paper empirically examines geographic, demographic, and economic biases in university and program suggestions from three open-source LLMs: LLaMA-3.1-8B, Gemma-7B, and Mistral-7B. Using 360 simulated user profiles varying by gender, nationality, and economic status, we analyze over 25,000 recommendations. Results show strong biases: institutions in the Global North are disproportionately favored, recommendations often reinforce gender stereotypes, and institutional repetition is prevalent. While LLaMA-3.1 achieves the highest diversity, recommending 481 unique universities across 58 countries, systemic disparities persist. To quantify these issues, we propose a novel, multi-dimensional evaluation framework that goes beyond accuracy by measuring demographic and geographic representation. Our findings highlight the urgent need for bias consideration in educational LMs to ensure equitable global access to higher education.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.15)
- Asia > India (0.05)
- Africa > Nigeria (0.05)
- (51 more...)
- Information Technology (1.00)
- Health & Medicine (1.00)
- Education > Educational Setting > Higher Education (0.89)
Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
Casademunt, Helena, Juang, Caden, Karvonen, Adam, Marks, Samuel, Rajamanoharan, Senthooran, Nanda, Neel
Fine-tuning large language models (LLMs) can lead to unintended out-of-distribution generalization. Standard approaches to this problem rely on modifying training data, for example by adding data that better specify the intended generalization. However, this is not always practical. We introduce Concept Ablation Fine-Tuning (CAFT), a technique that leverages interpretability tools to control how LLMs generalize from fine-tuning, without needing to modify the training data or otherwise use data from the target distribution. Given a set of directions in an LLM's latent space corresponding to undesired concepts, CAFT works by ablating these concepts with linear projections during fine-tuning, steering the model away from unintended generalizations. We successfully apply CAFT to three fine-tuning tasks, including emergent misalignment, a phenomenon where LLMs fine-tuned on a narrow task generalize to give egregiously misaligned responses to general questions. Without any changes to the fine-tuning data, CAFT reduces misaligned responses by 10x without degrading performance on the training distribution. Overall, CAFT represents a novel approach for steering LLM generalization without modifying training data.
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (8 more...)
- Research Report > New Finding (0.67)
- Research Report > Promising Solution (0.65)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Law (0.92)
- Banking & Finance (0.67)
- Energy (0.67)