Large Language Model
Increasing GPU Utilization during Generative Inference for Higher Throughput
Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even larger than the model itself. This problem is exacerbated in one of the current LLM serving frameworks which reserves the maximum sequence length of memory for the KV cache to guarantee generating a complete sequence as they do not know the output sequence length. This restricts us to use a smaller batch size leading to lower GPU utilization and above all, lower throughput. We argue that designing a system with a priori knowledge of the output sequence can mitigate this problem.
AVATAR: OptimizingLLMAgentsforToolUsagevia ContrastiveReasoning
InIRsystems, theretrievermodule directly influences theperformance ofdownstream tasks, such as retrieval-augmented generation [20, 29, 30] and knowledge-intensive question answering [34, 52]. However, these methods do not explicitly consider targeted optimization for tool usage or the impact on complex multi-stage tasks.
Appendix A Distribution of Class Labels Across Each Probing Task
We also implemented the Iterative Null-Space Projection (INLP) method (Ravfogel et al., 2020) to Results using our method are in Table 4. Results using the INLP method are This pattern holds across all of the linguistic properties that we tested. Each language brain region is not necessarily homogeneous in function across all voxels it contains. Bottom plot displays the pretrained BERT vs. removal of all tasks. Like the probing experiments with BERT in the main paper, we also perform experiments with GPT2. We find the results to be similar to BERT, i.e., a rich hierarchy of linguistic signals: initial to middle layers encode surface information, middle layers encode syntax, middle to top layers We verify that the removal of each linguistic property from GPT2 leads to reduced task performance across all layers, as expected.