Large Language Model
Reusing Models by Multi linear Operators for Efficient Training
Training large models from scratch usually costs a substantial amount of resources. Towards this problem, recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model (termed the "target model"), leading to a considerable acceleration in training. Despite the successes of these previous studies, they grew pretrained models by mapping partial weights only, ignoring potential correlations across the entire model. As we show in this paper, there are inter-and intra-interactions among the weights of both the pretrained and the target models. As a result, the partial mapping may not capture the complete information and lead to inadequate growth. In this paper, we propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model to further enhance acceleration ability. We utilize multi-linear operators to reduce computational and spacial complexity, enabling acceptable resource requirements. Experiments demonstrate that our method can save 76% computational costs on DeiT-base transferred from DeiT-small, which outperforms bert2BERT by +12.0% and LiGO by +20.7%, respectively.
Type-to-Track: Retrieve Any Object via Prompt-based Tracking
One of the recent trends in vision problems is to use natural language captions to describe the objects of interest. This approach can overcome some limitations of traditional methods that rely on bounding boxes or category annotations. This paper introduces a novel paradigm for Multiple Object Tracking called Type-toTrack, which allows users to track objects in videos by typing natural language descriptions. We present a new dataset for that Grounded Multiple Object Tracking task, called GroOT, that contains videos with various types of objects and their corresponding textual captions describing their appearance and action in detail. Additionally, we introduce two new evaluation protocols and formulate evaluation metrics specifically for this task. We develop a new efficient method that models a transformer-based eMbed-ENcoDE-extRact framework (MENDER) using the third-order tensor decomposition. The experiments in five scenarios show that our MENDER approach outperforms another two-stage design in terms of accuracy and efficiency, up to 14.7% accuracy and 4 speed faster.
MosaicBERT: ABidirectional Encoder Optimized for Fast Pretraining
Although BERT-style encoder models are heavily used in NLP research, many researchers do not pretrain their own BERTs from scratch due to the high cost of training. In the past half-decade since BERT first rose to prominence, many advances have been made with other transformer architectures and training configurations that have yet to be systematically incorporated into BERT. Here, we introduce MosaicBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining. This efficient architecture incorporates FlashAttention, Attention with Linear Biases (ALiBi), Gated Linear Units (GLU), a module to dynamically remove padded tokens, and low precision LayerNorm into the classic transformer encoder block. The training recipe includes a 30% masking ratio for the Masked Language Modeling (MLM) objective, bfloat16 precision, and vocabulary size optimized for GPU throughput, in addition to best-practices from RoBERTa and other encoder models. When pretrained from scratch on the C4 dataset, this base model achieves a downstream average GLUE (dev) score of 79.6 in 1.13 hours on 8 A100 80 GBGPUs at a cost of roughly $20. We plot extensive accuracy vs. pretraining speed Pareto curves and show that MosaicBERT base and large are consistently Pareto optimal when compared to a competitive BERT base and large. This empirical speed up in pretraining enables researchers and engineers to pretrain custom BERT-style models at low cost instead of finetune on existing generic models.
I put Microsoft's new Copilot tools to work in Office. It performed like an eager intern
PCWorld reports Microsoft 365 Copilot has evolved from offering passive suggestions to actively making live changes in Excel, PowerPoint, and Word documents. The upgraded agentic capabilities allow Copilot to create presentations and documents from scratch, though with some limitations like missing graphics. These enhanced features are available across Microsoft 365 Copilot, Premium, Personal, and Family subscriptions, representing a significant productivity upgrade. Although Microsoft's Copilot reportedly remains far behind competing AI Large Language Models (LLMs) in terms of usage, the Copilot built into its Microsoft 365 applications remains a potent assistant.
The Download: supercharged scams and studying AI healthcare
Plus: DeepSeek has unveiled its long-awaited new AI model. When ChatGPT was released in late 2022, it showed how easily generative AI could create human-like text. This quickly caught the eye of cybercriminals, who began using LLMs to compose malicious emails. Since then, they've adopted AI for everything from turbocharged phishing and hyperrealistic deepfakes to automated vulnerability scans. Many organizations are now struggling to cope with the sheer volume of cyberattacks. AI is making them faster, cheaper, and easier to carry out, a problem set to worsen as more cybercriminals adopt these tools--and their capabilities improve.
DeepSeek promises its new AI model has 'world-class' reasoning
DeepSeek promises its new AI model has'world-class' reasoning The new models give users access to a'cost effective 1 million context length.' DeepSeek has released its latest AI models, the V4 Pro and Flash versions, a bit over a year after it went viral and became the top rated free app on Apple's App Store in the US. "Welcome to the era of cost-effective 1 million context length," DeepSeek said in its announcement . Context length is what you call the maximum number of tokens that an AI model can remember, so the bigger it is, the more coherent and consistent an AI is when it comes to extended conversations. OpenAI's recently announced GPT 5.5 has a context window ranging from 400,000 to 1 million, for instance.
Honesty Is the Best Policy: Defining and Mitigating AIDeception
Deceptive agents are a challenge for the safety, trustworthiness, and cooperation of AI systems. We focus on the problem that agents might deceive in order to achieve their goals (for instance, in our experiments with language models, the goal of being evaluated as truthful). There are a number of existing definitions of deception in the literature on game theory and symbolic AI, but there is no overarching theory of deception for learning agents in games. We introduce a formal definition of deception in structural causal games, grounded in the philosophy literature, and applicable to real-world machine learning systems. Several examples and results illustrate that our formal definition aligns with the philosophical and commonsense meaning of deception. Our main technical result is to provide graphical criteria for deception. We show, experimentally, that these results can be used to mitigate deception in reinforcement learning agents and language models.