Goto

Collaborating Authors

 judgment


I Am Begging AI Companies to Stop Naming Features After Human Processes

WIRED

Anthropic announced "dreaming" for AI agents to sort through "memories" at its developer conference. Anthropic just announced a new feature called "dreaming" at the company's developer conference in San Francisco. It's part of Anthropic's recently launched AI agent infrastructure designed to help users manage and deploy tools that automate software processes. This "dreaming" aspect sorts through the transcript of what an agent recently completed and attempts to glean insights to improve the agent's performance. Folks using AI agents often send them on multistep journeys, like visiting a few websites or reading multiple files, to complete online tasks.


Noisy Nonreciprocal Pairwise Comparisons: Scale Variation, Noise Calibration, and Admissible Ranking Regions

Magnot, Jean-Pierre

arXiv.org Machine Learning

Pairwise comparisons are widely used in decision analysis, preference modeling, and evaluation problems. In many practical situations, the observed comparison matrix is not reciprocal. This lack of reciprocity is often treated as a defect to be corrected immediately. In this article, we adopt a different point of view: part of the nonreciprocity may reflect a genuine variation in the evaluation scale, while another part is due to random perturbations. We introduce an additive model in which the unknown underlying comparison matrix is consistent but not necessarily reciprocal. The reciprocal component carries the global ranking information, whereas the symmetric component describes possible scale variation. Around this structured matrix, we add a random perturbation and show how to estimate the noise level, assess whether the scale variation remains moderate, and assign probabilities to admissible ranking regions in the sense of strict ranking by pairwise comparisons. We also compare this approach with the brutal projection onto reciprocal matrices, which suppresses all symmetric information at once. The Gaussian perturbation model is used here not because human decisions are exactly Gaussian, but because observed judgment errors often result from the accumulation of many small effects. In such a context, the central limit principle provides a natural heuristic justification for Gaussian noise. This makes it possible to derive explicit estimators and probability assessments while keeping the model interpretable for decision problems.


Schools are using AI counselors to track students' mental health. Is it safe?

The Guardian

'You can't replace human connection, human judgment,' warns Sarah Caliboso-Soto, a licensed clinical social worker. 'You can't replace human connection, human judgment,' warns Sarah Caliboso-Soto, a licensed clinical social worker. Schools are using AI counselors to track students' mental health. As hundreds of schools implement an automated monitoring tool, educators say that students can find talking to a chatbot'more natural' than confiding in a human The alert came around 7pm. Brittani Phillips checked her phone. A middle school counselor in Putnam county, Florida, Phillips receives messages from an artificial intelligence-enabled therapy platform that students use during nonschool hours.


The robots who predict the future

MIT Technology Review

Three books unpack our infatuation with prediction, and what we lose when we outsource this task to machines. To be human is, fundamentally, to be a forecaster. Trying to see the future, whether through the lens of past experience or the logic of cause and effect, has helped us hunt, avoid hunted, plant crops, forge social bonds, and in general survive in a world that does not prioritize our survival. Indeed, as the tools of divination have changed over the centuries, from tea leaves to data sets, our conviction that the future can be known (and therefore controlled) has only grown stronger. Today, we are awash in a sea of predictions so vast and unrelenting that most of us barely even register them. As I write this sentence, algorithms on some remote server are busy trying to guess my next word based on those I have already typed.



Supplementary File for ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Evaluation Capability for Large Vision-Language Models

Neural Information Processing Systems

We calculate the agreement of human judgment and our automatic evaluation (i.e., ConvBenchEval()) and find it reaches 81.83% (seeing Table 3 - 6 for detailed agreement of each turn of overall). It demonstrates the effectiveness of ConvBenchEval(), which uses ChatGPT. The agreement between ChatGPT and GPT4 is very high at 87.38%. It demonstrates that using different LLMs as judges slightly influences the evaluation results. ConvBenchEval() armed with ChatGPT can is reliable and low-cost. From the above tables, we also observe that though GPT4V is expensive and can capture images, its judgment performs worse than GPT4's judgment.



Author Contributions

Neural Information Processing Systems

A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective In this appendix, we will derive Eq. 4. Analogously to Eq. 3, we optimize the following objective: max