test
- Africa > Nigeria (0.05)
- Asia > China > Hong Kong (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Education (1.00)
- Health & Medicine > Therapeutic Area (0.93)
- Information Technology > Security & Privacy (0.67)
- (2 more...)
- North America > United States > Virginia (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
- Leisure & Entertainment > Sports (0.47)
- Consumer Products & Services (0.47)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.95)
- Information Technology > Communications (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
- Leisure & Entertainment (0.94)
- Media (0.69)
Is Your Imitation Learning Policy Better than Mine? Policy Comparison with Near-Optimal Stopping
Snyder, David, Hancock, Asher James, Badithela, Apurva, Dixon, Emma, Miller, Patrick, Ambrus, Rares Andrei, Majumdar, Anirudha, Itkina, Masha, Nishimura, Haruki
Imitation learning has enabled robots to perform complex, long-horizon tasks in challenging dexterous manipulation settings. As new methods are developed, they must be rigorously evaluated and compared against corresponding baselines through repeated evaluation trials. However, policy comparison is fundamentally constrained by a small feasible sample size (e.g., 10 or 50) due to significant human effort and limited inference throughput of policies. This paper proposes a novel statistical framework for rigorously comparing two policies in the small sample size regime. Prior work in statistical policy comparison relies on batch testing, which requires a fixed, pre-determined number of trials and lacks flexibility in adapting the sample size to the observed evaluation data. Furthermore, extending the test with additional trials risks inducing inadvertent p-hacking, undermining statistical assurances. In contrast, our proposed statistical test is sequential, allowing researchers to decide whether or not to run more trials based on intermediate results. This adaptively tailors the number of trials to the difficulty of the underlying comparison, saving significant time and effort without sacrificing probabilistic correctness. Extensive numerical simulation and real-world robot manipulation experiments show that our test achieves near-optimal stopping, letting researchers stop evaluation and make a decision in a near-minimal number of trials. Specifically, it reduces the number of evaluation trials by up to 40% as compared to state-of-the-art baselines, while preserving the probabilistic correctness and statistical power of the comparison. Moreover, our method is strongest in the most challenging comparison instances (requiring the most evaluation trials); in a multi-task comparison scenario, we save the evaluator more than 200 simulation rollouts.
- North America > United States (0.28)
- Europe > Germany (0.14)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.67)
OLMD: Orientation-aware Long-term Motion Decoupling for Continuous Sign Language Recognition
Yu, Yiheng, Liu, Sheng, Feng, Yuan, Xu, Min, Jin, Zhelun, Yang, Xuhua
The primary challenge in continuous sign language recognition (CSLR) mainly stems from the presence of multi-orientational and long-term motions. However, current research overlooks these crucial aspects, significantly impacting accuracy. To tackle these issues, we propose a novel CSLR framework: Orientation-aware Long-term Motion Decoupling (OLMD), which efficiently aggregates long-term motions and decouples multi-orientational signals into easily interpretable components. Specifically, our innovative Long-term Motion Aggregation (LMA) module filters out static redundancy while adaptively capturing abundant features of long-term motions. We further enhance orientation awareness by decoupling complex movements into horizontal and vertical components, allowing for motion purification in both orientations. Additionally, two coupling mechanisms are proposed: stage and cross-stage coupling, which together enrich multi-scale features and improve the generalization capabilities of the model. Experimentally, OLMD shows SOTA performance on three large-scale datasets: PHOENIX14, PHOENIX14-T, and CSL-Daily. Notably, we improved the word error rate (WER) on PHOENIX14 by an absolute 1.6% compared to the previous SOTA
- Europe > Switzerland (0.14)
- Asia > China (0.14)
Sample size determination for machine learning in medical research
Arifin, Wan Nor, Yaacob, Najib Majdi
Machine learning (ML) methods are being increasingly used across various domains of medicine research. However, despite advancements in the use of ML in medicine, clear and definitive guidelines for determining sample sizes in medical ML research are lacking. This article proposes a method for determining sample sizes for medical research utilizing ML methods, beginning with the determination of the testing set sample size, followed with the determination of the training set and total sample sizes. Introduction Machine learning (ML) methods are being increasingly used in medical research, spanning various domains of medicine from oncology, orthopaedics, ophthalmology and general practice (Sirocchi et al., 2024). However, despite this advancement in medical research, currently there are no clear and definitive guidelines for determining sample sizes when using ML methods in the medical domain.
FlowAR: une plateforme uniformis\'ee pour la reconnaissance des activit\'es humaines \`a partir de capteurs binaires
Ncibi, Ali, Bouganim, Luc, Pucheral, Philippe
This demo showcases a platform for developing human activity recognition (AR) systems, focusing on daily activities using sensor data, like binary sensors. With a data-driven approach, this platform, named FlowAR, features a three-step pipeline (flow): data cleaning, segmentation, and personalized classification. Its modularity allows flexibility to test methods, datasets, and ensure rigorous evaluations. A concrete use case demonstrates its effectiveness.
On Iterative Evaluation and Enhancement of Code Quality Using GPT-4o
Liu, Rundong, Frade, Andre, Vaidya, Amal, Labonne, Maxime, Kaiser, Marcus, Chakrabarti, Bismayan, Budd, Jonathan, Moran, Sean
This paper introduces CodeQUEST, a novel framework leveraging Large Language Models (LLMs) to iteratively evaluate and enhance code quality across multiple dimensions, including readability, maintainability, efficiency, and security. The framework is divided into two main components: an Evaluator that assesses code quality across ten dimensions, providing both quantitative scores and qualitative summaries, and an Optimizer that iteratively improves the code based on the Evaluator's feedback. Our study demonstrates that CodeQUEST can effectively and robustly evaluate code quality, with its assessments aligning closely with established code quality metrics. Through a series of experiments using a curated dataset of Python and JavaScript examples, CodeQUEST demonstrated significant improvements in code quality, achieving a mean relative percentage improvement of 52.6%. The framework's evaluations were validated against a set of proxy metrics comprising of Pylint Score, Radon Maintainability Index, and Bandit output logs, showing a meaningful correlation. This highlights the potential of LLMs in automating code quality evaluation and improvement processes, presenting a significant advancement toward enhancing software development practices. The code implementation of the framework is available at: https://github.com/jpmorganchase/CodeQuest.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
NLI under the Microscope: What Atomic Hypothesis Decomposition Reveals
Srikanth, Neha, Rudinger, Rachel
Decomposition of text into atomic propositions is a flexible framework allowing for the closer inspection of input and output text. We use atomic decomposition of hypotheses in two natural language reasoning tasks, traditional NLI and defeasible NLI, to form atomic sub-problems, or granular inferences that models must weigh when solving the overall problem. These atomic sub-problems serve as a tool to further understand the structure of both NLI and defeasible reasoning, probe a model's consistency and understanding of different inferences, and measure the diversity of examples in benchmark datasets. Our results indicate that LLMs still struggle with logical consistency on atomic NLI and defeasible NLI sub-problems. Lastly, we identify critical atomic sub-problems of defeasible NLI examples, or those that most contribute to the overall label, and propose a method to measure the inferential consistency of a model, a metric designed to capture the degree to which a model makes consistently correct or incorrect predictions about the same fact under different contexts.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > Singapore (0.04)
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- (10 more...)