palm-2
Hierarchical Organization Simulacra in the Investment Sector
Chen, Chung-Chi, Takamura, Hiroya, Kobayashi, Ichiro, Miyao, Yusuke
This paper explores designing artificial organizations with professional behavior in investments using a multi-agent simulation. The method mimics hierarchical decision-making in investment firms, using news articles to inform decisions. A large-scale study analyzing over 115,000 news articles of 300 companies across 15 years compared this approach against professional traders' decisions. Results show that hierarchical simulations align closely with professional choices, both in frequency and profitability. However, the study also reveals biases in decision-making, where changes in prompt wording and perceived agent seniority significantly influence outcomes. This highlights both the potential and limitations of large language models in replicating professional financial decision-making.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Asia > Taiwan (0.05)
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > China > Hong Kong (0.04)
Faithful Chart Summarization with ChaTS-Pi
Krichene, Syrine, Piccinno, Francesco, Liu, Fangyu, Eisenschlos, Julian Martin
Chart-to-summary generation can help explore data, communicate insights, and help the visually impaired people. Multi-modal generative models have been used to produce fluent summaries, but they can suffer from factual and perceptual errors. In this work we present CHATS-CRITIC, a reference-free chart summarization metric for scoring faithfulness. CHATS-CRITIC is composed of an image-to-text model to recover the table from a chart, and a tabular entailment model applied to score the summary sentence by sentence. We find that CHATS-CRITIC evaluates the summary quality according to human ratings better than reference-based metrics, either learned or n-gram based, and can be further used to fix candidate summaries by removing not supported sentences. We then introduce CHATS-PI, a chart-to-summary pipeline that leverages CHATS-CRITIC during inference to fix and rank sampled candidates from any chart-summarization model. We evaluate CHATS-PI and CHATS-CRITIC using human raters, establishing state-of-the-art results on two popular chart-to-summary datasets.
- North America > United States > Washington > King County > Seattle (0.14)
- North America > United States > Oregon > Multnomah County > Portland (0.05)
- North America > United States > District of Columbia > Washington (0.05)
- (16 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.46)
Comprehensive Evaluation and Insights into the Use of Large Language Models in the Automation of Behavior-Driven Development Acceptance Test Formulation
Karpurapu, Shanthi, Myneni, Sravanthy, Nettur, Unnati, Gajja, Likhit Sagar, Burke, Dave, Stiehm, Tom, Payne, Jeffery
Behavior-driven development (BDD) is an Agile testing methodology fostering collaboration among developers, QA analysts, and stakeholders. In this manuscript, we propose a novel approach to enhance BDD practices using large language models (LLMs) to automate acceptance test generation. Our study uses zero and few-shot prompts to evaluate LLMs such as GPT-3.5, GPT-4, Llama-2-13B, and PaLM-2. The paper presents a detailed methodology that includes the dataset, prompt techniques, LLMs, and the evaluation process. The results demonstrate that GPT-3.5 and GPT-4 generate error-free BDD acceptance tests with better performance. The few-shot prompt technique highlights its ability to provide higher accuracy by incorporating examples for in-context learning. Furthermore, the study examines syntax errors, validation accuracy, and comparative analysis of LLMs, revealing their effectiveness in enhancing BDD practices. However, our study acknowledges that there are limitations to the proposed approach. We emphasize that this approach can support collaborative BDD processes and create opportunities for future research into automated BDD acceptance test generation using LLMs.
- Asia > India (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Ukraine > Kharkiv Oblast > Kharkiv (0.04)
- (4 more...)
- Research Report > New Finding (0.48)
- Research Report > Promising Solution (0.34)
Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization?
Fu, Xue-Yong, Laskar, Md Tahmid Rahman, Khasanova, Elena, Chen, Cheng, TN, Shashi Bhushan
Large Language Models (LLMs) have demonstrated impressive capabilities to solve a wide range of tasks without being explicitly fine-tuned on task-specific datasets. However, deploying LLMs in the real world is not trivial, as it requires substantial computing resources. In this paper, we investigate whether smaller, compact LLMs are a good alternative to the comparatively Larger LLMs2 to address significant costs associated with utilizing LLMs in the real world. In this regard, we study the meeting summarization task in a real-world industrial environment and conduct extensive experiments by comparing the performance of fine-tuned compact LLMs (e.g., FLAN-T5, TinyLLaMA, LiteLLaMA) with zero-shot larger LLMs (e.g., LLaMA-2, GPT-3.5, PaLM-2). We observe that most smaller LLMs, even after fine-tuning, fail to outperform larger zero-shot LLMs in meeting summarization datasets. However, a notable exception is FLAN-T5 (780M parameters), which performs on par or even better than many zero-shot Larger LLMs (from 7B to above 70B parameters), while being significantly smaller. This makes compact LLMs like FLAN-T5 a suitable cost-efficient solution for real-world industrial deployment.
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (2 more...)
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation
Fernandes, Patrick, Deutsch, Daniel, Finkelstein, Mara, Riley, Parker, Martins, André F. T., Neubig, Graham, Garg, Ankush, Clark, Jonathan H., Freitag, Markus, Firat, Orhan
Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems. While considerable progress has been made on estimating a single scalar quality score, current metrics lack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). In this paper, we help fill this gap by proposing AutoMQM, a prompting technique which leverages the reasoning and in-context learning capabilities of large language models (LLMs) and asks them to identify and categorize errors in translations. We start by evaluating recent LLMs, such as PaLM and PaLM-2, through simple score prediction prompting, and we study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores (with particularly large gains for larger models) while providing interpretability through error spans that align with human annotations.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)