rec
Paper: Generalization of Reinforcement Learners with Working and Episodic Memory
We thank the reviewers for their thoughtful and constructive feedback on our manuscript. This should help both contextualize each task's difficulty and illustrate what it involves. Reviewer 3 noted the Section 2 task descriptions could be better presented. We have reformatted it so that "the order We also changed our description of IMP ALA to match Reviewer 5's suggestion. Regarding the task suite, Reviewer 4 raised a thoughtful consideration on whether "most of the findings translate when Some 3D tasks in the suite already have '2D-like' semi-counterparts that do not require navigation, '2D-like' because everything is fully observable and the agent has a first-person point of view from a fixed point, without Spot the Difference level, was overall harder than Change Detection for our ablation models.
LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews
Madeyski, Lech, Kitchenham, Barbara, Shepperd, Martin
Context: Large language models (LLMs) are released faster than users' ability to evaluate them rigorously. When LLMs underpin research, such as identifying relevant literature for systematic reviews (SRs), robust empirical assessment is essential. Objective: We identify and discuss key challenges in assessing LLM performance for selecting relevant literature, identify good (evaluation) practices, and propose recommendations. Method: Using a recent large-scale study as an example, we identify problems with the use of traditional metrics for assessing the performance of Gen-AI tools for identifying relevant literature in SRs. We analyzed 27 additional papers investigating this issue, extracted the performance metrics, and found both good practices and widespread problems, especially with the use and reporting of performance metrics for SR screening. Results: Major weaknesses included: i) a failure to use metrics that are robust to imbalanced data and do not directly indicate whether results are better than chance, e.g., the use of Accuracy, ii) a failure to consider the impact of lost evidence when making claims concerning workload savings, and iii) pervasive failure to report the full confusion matrix (or performance metrics from which it can be reconstructed) which is essential for future meta-analyses. On the positive side, we extract good (evaluation) practices on which our recommendations for researchers and practitioners, as well as policymakers, are built. Conclusions: SR screening evaluations should prioritize lost evidence/recall alongside chance-anchored and cost-sensitive Weighted MCC (WMCC) metric, report complete confusion matrices, treat unclassifiable outputs as referred-back positives for assessment, adopt leakage-aware designs with non-LLM baselines and open artifacts, and ground conclusions in cost-benefit analysis where FNs carry higher penalties than FPs.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Poland > Lower Silesia Province > Wroclaw (0.04)
- Europe > United Kingdom > England > Staffordshire (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Supplementary material for "FET A: Towards Specializing Foundation Models for Expert Task Applications "
Supplementary material for "FET A: T owards Specializing Foundation Models for Expert T ask Applications" The downloaded documents were processed by the DeepSearch tool https://ds4sd.github.io/ We employ a dilation technique in which we increase the length of each of the box's horizontal edges This creates some overlaps between neighboring boxes. We created manual annotations for part of the Car Manuals dataset. The steps are shown on an example pages from the cars dataset. In this test we consider only the manually annotated documents.
A Distinguishing supervised learning from reinforcement learning in a model
Consider the simple feedforward network shown in Fig. S1. C is a function with a peak near zero. For the simulations shown in Fig. S1, we have used uncorrelated inputs In Fig. S1, we show results from a linear feedforward In this section we provide derivations of the two learning rules studied in our paper. Evidence for such "three-factor" learning rules has been found in a number of neuroscience In this section, we derive a local RNN update rule using policy gradient learning. "node perturbation" learning algorithm is essentially equivalent to previously proposed RL rules for S.E.M. Unless stated otherwise, all simulations involved pretraining.
- North America (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Energy > Energy Storage (0.67)
- Electrical Industrial Apparatus (0.67)
- North America > United States (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (2 more...)
Zero-Shot Referring Expression Comprehension via Vison-Language True/False Verification
Referring Expression Comprehension (REC) is usually addressed with task-trained grounding models. We show that a zero-shot workflow, without any REC-specific training, can achieve competitive or superior performance. Our approach reformulates REC as box-wise visual-language verification: given proposals from a COCO-clean generic detector (YOLO-World), a general-purpose VLM independently answers True/False queries for each region. This simple procedure reduces cross-box interference, supports abstention and multiple matches, and requires no fine-tuning. On RefCOCO, RefCOCO+, and RefCOCOg, our method not only surpasses a zero-shot GroundingDINO baseline but also exceeds reported results for GroundingDINO trained on REC and GroundingDINO+CRG. Controlled studies with identical proposals confirm that verification significantly outperforms selection-based prompting, and results hold with open VLMs. Overall, we show that workflow design, rather than task-specific pretraining, drives strong zero-shot REC performance.
- North America > United States > Tennessee > Davidson County > Nashville (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)