florence
- North America > United States > Oregon (0.28)
- North America > United States > Alaska (0.05)
- North America > United States > Massachusetts (0.05)
- (2 more...)
- North America > Canada > Quebec > Montreal (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Hong Kong (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.72)
- North America > Canada > Quebec > Montreal (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Hong Kong (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.71)
Skewed Score: A statistical framework to assess autograders
Dubois, Magda, Coppock, Harry, Giulianelli, Mario, Flesch, Timo, Luettgau, Lennart, Ududec, Cozmin
The evaluation of large language model (LLM) outputs is increasingly performed by other LLMs, a setup commonly known as "LLM-as-a-judge", or autograders. While autograders offer a scalable alternative to human evaluation, they have shown mixed reliability and may exhibit systematic biases, depending on response type, scoring methodology, domain specificity, or other factors. Here we propose a statistical framework based on Bayesian generalised linear models (GLMs) that enables researchers to simultaneously assess their autograders while addressing their primary research questions (e.g., LLM evaluation). Our approach models evaluation outcomes (e.g., scores or pairwise preferences) as a function of properties of the grader (e.g., human vs. autograder) and the evaluated item (e.g., response length or the LLM that generated it), allowing for explicit quantification of scoring differences and potential biases within a unified framework. In addition, our method can be used to augment traditional metrics such as inter-rater agreement, by providing uncertainty estimates and clarifying sources of disagreement. Overall, this approach contributes to more robust and interpretable use of autograders in LLM evaluation, enabling both performance analysis and bias detection.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (6 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study > Negative Result (0.34)
A man stalked a professor for six years. Then he used AI chatbots to lure strangers to her home
A man from Massachusetts has agreed to plead guilty to a seven-year cyberstalking campaign that included using artificial intelligence (AI) chatbots to impersonate a university professor and invite men online to her home address for sex. James Florence, 36, used platforms such as CrushOn.ai and JanitorAI, which allow users to design their own chatbots and direct them how to respond to other users during chats, including in sexually suggestive and explicit ways, according to court documents seen by the Guardian. The victim's identity has been kept confidential by law enforcement officials. Florence admitted to using the victim's personal and professional information – including her home address, date of birth and family information to instruct the chatbots to impersonate her and engage in sexual dialogue with users, per court filings. He told the chatbots to answer "yes" in the guise of his victim when a user asked whether she was sexually adventurous and fed the AI responses of what underwear she liked to wear.
Navigation services amplify concentration of traffic and emissions in our cities
Cornacchia, Giuliano, Nanni, Mirco, Pedreschi, Dino, Pappalardo, Luca
The proliferation of human-AI ecosystems involving human interaction with algorithms, such as assistants and recommenders, raises concerns about large-scale social behaviour. Despite evidence of such phenomena across several contexts, the collective impact of GPS navigation services remains unclear: while beneficial to the user, they can also cause chaos if too many vehicles are driven through the same few roads. Our study employs a simulation framework to assess navigation services' influence on road network usage and CO2 emissions. The results demonstrate a universal pattern of amplified conformity: increasing adoption rates of navigation services cause a reduction of route diversity of mobile travellers and increased concentration of traffic and emissions on fewer roads, thus exacerbating an unequal distribution of negative externalities on selected neighbourhoods. Although navigation services recommendations can help reduce CO2 emissions when their adoption rate is low, these benefits diminish or even disappear when the adoption rate is high and exceeds a certain city- and service-dependent threshold. We summarize these discoveries in a non-linear function that connects the marginal increase of conformity with the marginal reduction in CO2 emissions. Our simulation approach addresses the challenges posed by the complexity of transportation systems and the lack of data and algorithmic transparency.
- Europe > Italy > Tuscany > Pisa Province > Pisa (0.04)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- North America > United States > Ohio (0.04)
- (6 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Transportation > Infrastructure & Services (1.00)
- Transportation > Ground > Road (1.00)
- Consumer Products & Services > Travel (0.94)
Popularity-based Alternative Routing
Cornacchia, Giuliano, Lemma, Ludovico, Pappalardo, Luca
Alternative routing is crucial to minimize the environmental impact of urban transportation while enhancing road network efficiency and reducing traffic congestion. Existing methods neglect information about road popularity, possibly leading to unintended consequences such as increasing emissions and congestion. This paper introduces Polaris, an alternative routing algorithm that exploits road popularity to optimize traffic distribution and reduce CO2 emissions. Polaris leverages the novel concept of K-road layers, which mitigates the feedback loop effect where redirecting vehicles to less popular roads could increase their popularity in the future. We conduct experiments in three cities to evaluate Polaris against state-of-the-art alternative routing algorithms. Our results demonstrate that Polaris significantly reduces the overuse of highly popular road edges and traversed regulated intersections, showcasing its ability to generate efficient routes and distribute traffic more evenly. Furthermore, Polaris achieves substantial CO2 reductions, outperforming existing alternative routing strategies. Finally, we compare Polaris to an algorithm that coordinates vehicles centrally to distribute them more evenly on the road network. Our findings reveal that Polaris performs comparably well, even with much less information, highlighting its potential as an efficient and sustainable solution for urban traffic management.
- Europe > Italy > Tuscany > Pisa Province > Pisa (0.04)
- North America > United States > Virginia (0.04)
- North America > United States > New York (0.04)
- (2 more...)
- Transportation > Infrastructure & Services (0.73)
- Transportation > Ground > Road (0.73)
- Government > Regional Government (0.68)
NATURAL PLAN: Benchmarking LLMs on Natural Language Planning
Zheng, Huaixiu Steven, Mishra, Swaroop, Zhang, Hugh, Chen, Xinyun, Chen, Minmin, Nova, Azade, Hou, Le, Cheng, Heng-Tze, Le, Quoc V., Chi, Ed H., Zhou, Denny
We introduce NATURAL PLAN, a realistic planning benchmark in natural language containing 3 key tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. We focus our evaluation on the planning capabilities of LLMs with full information on the task, by providing outputs from tools such as Google Flights, Google Maps, and Google Calendar as contexts to the models. This eliminates the need for a tool-use environment for evaluating LLMs on Planning. We observe that NATURAL PLAN is a challenging benchmark for state of the art models. For example, in Trip Planning, GPT-4 and Gemini 1.5 Pro could only achieve 31.1% and 34.8% solve rate respectively. We find that model performance drops drastically as the complexity of the problem increases: all models perform below 5% when there are 10 cities, highlighting a significant gap in planning in natural language for SoTA LLMs. We also conduct extensive ablation studies on NATURAL PLAN to further shed light on the (in)effectiveness of approaches such as self-correction, few-shot generalization, and in-context planning with long-contexts on improving LLM planning.
- Europe > Finland > Uusimaa > Helsinki (0.08)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
- Pacific Ocean > North Pacific Ocean > San Francisco Bay > Golden Gate (0.05)
- (3 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Measuring the perception of the personalized activities with CloudIA robot
Sorrentino, Alessandra, Fiorini, Laura, La Viola, Carlo, Cavallo, Filippo
Socially Assistive Robots represent a valid solution for improving the quality of life and the mood of older adults. In this context, this work presents the CloudIA robot, a non-human-like robot intended to promote sociality and well-being among older adults. The design of the robot and of the provided services were carried out by a multidisciplinary team of designers and technology developers in tandem with professional caregivers. The capabilities of the robot were implemented according to the received guidelines and tested in two nursing facilities by 15 older people. Qualitative and quantitative metrics were used to investigate the engagement of the participants during the interaction with the robot, and to investigate any differences in the interaction during the proposed activities. The results highlighted the general tendency of humanizing the robotic platform and demonstrated the feasibility of introducing the CloudIA robot in support of the professional caregivers' work. From this pilot test, further ideas on improving the personalization of the robotic platform emerged.
Leveraging Citizen Science for Flood Extent Detection using Machine Learning Benchmark Dataset
Ramasubramanian, Muthukumaran, Gurung, Iksha, Gahlot, Shubhankar, Hänsch, Ronny, Molthan, Andrew L., Maskey, Manil
Accurate detection of inundated water extents during flooding events is crucial in emergency response decisions and aids in recovery efforts. Satellite Remote Sensing data provides a global framework for detecting flooding extents. Specifically, Sentinel-1 C-Band Synthetic Aperture Radar (SAR) imagery has proven to be useful in detecting water bodies due to low backscatter of water features in both co-polarized and cross-polarized SAR imagery. However, increased backscatter can be observed in certain flooded regions such as presence of infrastructure and trees - rendering simple methods such as pixel intensity thresholding and time-series differencing inadequate. Machine Learning techniques has been leveraged to precisely capture flood extents in flooded areas with bumps in backscatter but needs high amounts of labelled data to work desirably. Hence, we created a labeled known water body extent and flooded area extents during known flooding events covering about 36,000 sq. kilometers of regions within mainland U.S and Bangladesh. Further, We also leveraged citizen science by open-sourcing the dataset and hosting an open competition based on the dataset to rapidly prototype flood extent detection using community generated models. In this paper we present the information about the dataset, the data processing pipeline, a baseline model and the details about the competition, along with discussion on winning approaches. We believe the dataset adds to already existing datasets based on Sentinel-1C SAR data and leads to more robust modeling of flood extents. We also hope the results from the competition pushes the research in flood extent detection further.
- Asia > Bangladesh (0.25)
- North America > United States > Alabama > Lauderdale County > Florence (0.15)
- North America > United States > Missouri (0.06)
- (6 more...)
- Health & Medicine (0.74)
- Government > Space Agency (0.47)