usage
9d411e87d0f37059f40fb27c5de00ba0-Supplemental-Datasets_and_Benchmarks_Track.pdf
The following section is answers to questions listed in datasheets for datasets.858 A.1 Motivation859 Question: For what purpose was the dataset created? Was there a specific task in mind?860 Was there a specific gap that needed to be filled? Answer: To evaluate the linguistic robustness of language models across diverse English862 varieties by transforming Standard American English (SAE) datasets.863 Question: Who created the dataset (e.g., which team, research group) and on behalf of864 which entity (e.g., company, institution, organization)?865 Answer: The authors of this paper.866 Question: Who funded the creation of the dataset? If there is an associated grant, please867 provide the name of the grantor and the grant name and number.868
Trans-EnV: AFramework for Evaluating the Linguistic Robustness of LLMs Against English Varieties
Large Language Models (LLMs) are predominantly evaluated on Standard American English (SAE), often overlooking the diversity of global English varieties. This narrow focus may raise fairness concerns as degraded performance on nonstandard varieties can lead to unequal benefits for users worldwide. Therefore, it is critical to extensively evaluate the linguistic robustness of LLMs on multiple non-standard English varieties. We introduce Trans-EnV, a framework that automatically transforms SAE datasets into multiple English varieties to evaluate the linguistic robustness. Our framework combines (1) linguistics expert knowledge to curate variety-specific features and transformation guidelines from linguistic literature and corpora, and (2) LLM-based transformations to ensure both linguistic validity and scalability. Using Trans-EnV, we transform six benchmark datasets into 38 English varieties and evaluate seven state-of-the-art LLMs. Our results reveal significant performance disparities, with accuracy decreasing by up to 46.3% on non-standard varieties.
Scaling Embedding Layers in Language Models
We propose SCONE (Scalable, Contextualized, Offloaded, N-gram Embedding), a new method for extending input embedding layers to enhance language model performance. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent n-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. After training, embeddings are precomputed and stored in off-accelerator memory; during inference, querying them has minimal impact on latency due to the low complexity of embedding lookups. SCONE enables two new scaling strategies: increasing the number of n-gram embeddings and scaling the model used to learn them, both while maintaining fixed accelerator usage during inference (in terms of FLOPS and memory). We show that scaling both aspects enables a model with 1B accelerator-resident parameters to outperform a 1.9B-parameter baseline across diverse corpora, while using only about half the FLOPS and accelerator memory during inference.
Evolutionary Reasoning Does Not Arise in Standard Usage of Protein Language Models
Protein language models (PLMs) are often assumed to capture evolutionary information by training on large protein sequence datasets. Yet it remains unclear whether PLMs can reason about evolution--that is, infer evolutionary relationships between sequences. We test this capability by evaluating whether standard PLM usage, frozen or fine-tuned embeddings with distance-based comparison, supports evolutionary reasoning. Existing PLMs consistently fail to recover phylogenetic structure, despite strong performance on sequence-level tasks such as masked-token and contact prediction. We present Phyla, a hybrid state-space and transformer model that jointly processes multiple sequences and is trained using a tree-based objective across 3,000 phylogenies spanning diverse protein families.
Equi-mRNA: Protein Translation Equivariant Encoding for mRNA Language Models
The growing importance of mRNA therapeutics and synthetic biology highlights the need for models that capture the latent structure of synonymous codon (different triplets encoding the same amino acid) usage, which subtly modulates translation efficiency and gene expression. While recent efforts incorporate codon-level inductive biases through auxiliary objectives, they often fall short of explicitly modeling the structured relationships that arise from the genetic code's inherent symmetries. We introduce Equi mRNA, the first codon level equivariant mRNA language model that explicitly encodes synonymous codon symmetries as cyclic subgroups of 2D Special Orthogonal matrix ($\mathrm{SO}(2)$). By combining group theoretic priors with an auxiliary equivariance loss and symmetry aware pooling, Equi mRNA learns biologically grounded representations that outperform vanilla baselines across multiple axes. On downstream property prediction tasks including expression, stability, and riboswitch switching Equi mRNA delivers up to $\approx$ 10\% improvements in accuracy. In sequence generation, it produces mRNA constructs that are up to $\approx$ 4$\times$ more realistic under Frรฉchet BioDistance metrics and $\approx$ 28\% better preserve functional properties compared to vanilla baseline. Interpretability analyses further reveal that learned codon rotation distributions recapitulate known GC content biases and tRNA abundance patterns, offering novel insights into codon usage. Equi mRNA establishes a new biologically principled paradigm for mRNA modeling, with significant implications for the design of next generation therapeutics.
OpenAI faces criminal probe over role of ChatGPT in shooting
OpenAI is facing a criminal investigation in the US over whether its ChatGPT technology played a part in the murder of two people during a mass shooting at Florida State University last year. Florida's Attorney General James Uthmeier said on Tuesday his office had been looking into the use of the artificial intelligence (AI) chatbot by a man who allegedly shot several people at the campus in Tallahassee. Our review has revealed that a criminal investigation is necessary, Uthmeier said. ChatGPT offered significant advice to this shooter before he committed such heinous crimes. An OpenAI spokesperson said: ChatGPT is not responsible for this terrible crime.
Disentangled Style Domain for Implicit z -Watermark Towards Copyright Protection
Text-to-image models have shown surprising performance in high-quality image generation, while also raising intensified concerns about the unauthorized usage of personal dataset in training and personalized fine-tuning. Recent approaches, embedding watermarks, introducing perturbations, and inserting backdoors into datasets, rely on adding minor information vulnerable to adversarial training, limiting their ability to detect unauthorized data usage. In this paper, we introduce a novel implicit Zero-Watermarking scheme that first utilizes the disentangled style domain to detect unauthorized dataset usage in text-to-image models. Specifically, our approach generates the watermark from the disentangled style domain, enabling self-generalization and mutual exclusivity within the style domain anchored by protected units.
Supplemental Material
Figure 1: Overview of the Transformer block used in the PromptIR framework. As mentioned in section 3.1.2 Bias-free convolutions are utilized within this submodule. After MDT A Module the features are processed through the GDFN module. Our method effectively removes haze to produce visually better images.