Goto

Collaborating Authors

 Lin, Ying


Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

arXiv.org Artificial Intelligence

Recent English Common Crawl datasets like FineWeb-Edu and DCLM achieved significant benchmark gains via aggressive model-based filtering, but at the cost of removing 90% of data. This limits their suitability for long token horizon training, such as 15T tokens for Llama 3.1. In this paper, we show how to achieve better trade-offs between accuracy and data quantity by a combination of classifier ensembling, synthetic data rephrasing, and reduced reliance on heuristic filters. When training 8B parameter models for 1T tokens, using a high-quality subset of our data improves MMLU by 5.6 over DCLM, demonstrating the efficacy of our methods for boosting accuracies over a relatively short token horizon. Furthermore, our full 6.3T token dataset matches DCLM on MMLU, but contains four times more unique real tokens than DCLM. This unlocks state-of-the-art training over a long token horizon: an 8B parameter model trained for 15T tokens, of which 7.2T came from our dataset, is better than the Llama 3.1 8B model: +5 on MMLU, +3.1 on ARC-Challenge, and +0.5 on average across ten diverse tasks. The dataset is available at https://data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/index.html


FCOM: A Federated Collaborative Online Monitoring Framework via Representation Learning

arXiv.org Artificial Intelligence

Online learning has demonstrated notable potential to dynamically allocate limited resources to monitor a large population of processes, effectively balancing the exploitation of processes yielding high rewards, and the exploration of uncertain processes. However, most online learning algorithms were designed under 1) a centralized setting that requires data sharing across processes to obtain an accurate prediction or 2) a homogeneity assumption that estimates a single global model from the decentralized data. To facilitate the online learning of heterogeneous processes from the decentralized data, we propose a federated collaborative online monitoring method, which captures the latent representative models inherent in the population through representation learning and designs a novel federated collaborative UCB algorithm to estimate the representative models from sequentially observed decentralized data. The efficiency of our method is illustrated through theoretical analysis, simulation studies, and decentralized cognitive degradation monitoring in Alzheimer's disease. Monitoring a large population of dynamic processes within the constraints of monitoring resources poses a significant challenge across various industrial sectors, including healthcare and engineering systems [1], [2]. The complexity arises from two key factors: 1) the inherent disparity between the limited monitoring resources available and the large population of processes to be monitored, and 2) the uncertain and heterogeneous dynamics in the progression of these processes. In tackling this intricate problem, online learning from bandit feedback has demonstrated notable potential [2], [3].


Online Modeling and Monitoring of Dependent Processes under Resource Constraints

arXiv.org Artificial Intelligence

Adaptive monitoring of a large population of dynamic processes is critical for the timely detection of abnormal events under limited resources in many healthcare and engineering systems. Examples include the risk-based disease screening and condition-based process monitoring. However, existing adaptive monitoring models either ignore the dependency among processes or overlook the uncertainty in process modeling. To design an optimal monitoring strategy that accurately monitors the processes with poor health conditions and actively collects information for uncertainty reduction, a novel online collaborative learning method is proposed in this study. The proposed method designs a collaborative learning-based upper confidence bound (CL-UCB) algorithm to optimally balance the exploitation and exploration of dependent processes under limited resources. Efficiency of the proposed method is demonstrated through theoretical analysis, simulation studies and an empirical study of adaptive cognitive monitoring in Alzheimer's disease.


A Generative Adversarial Network-based Selective Ensemble Characteristic-to-Expression Synthesis (SE-CTES) Approach and Its Applications in Healthcare

arXiv.org Machine Learning

Investigating the causal relationships between characteristics and expressions plays a critical role in healthcare analytics. Effective synthesis for expressions using given characteristics can make great contributions to health risk management and medical decision-making. For example, predicting the resulting physiological symptoms on patients from given treatment characteristics is helpful for the disease prevention and personalized treatment strategy design. Therefore, the objective of this study is to effectively synthesize the expressions based on given characteristics. However, the mapping from characteristics to expressions is usually from a relatively low dimension space to a high dimension space, but most of the existing methods such as regression models could not effectively handle such mapping. Besides, the relationship between characteristics and expressions may contain not only deterministic patterns, but also stochastic patterns. To address these challenges, this paper proposed a novel selective ensemble characteristic-to-expression synthesis (SE-CTES) approach inspired by generative adversarial network (GAN). The novelty of the proposed method can be summarized into three aspects: (1) GAN-based architecture for deep neural networks are incorporated to learn the relatively low dimensional mapping to high dimensional mapping containing both deterministic and stochastic patterns; (2) the weights of the two mismatching errors in the GAN-based architecture are proposed to be different to reduce the learning bias in the training process; and (3) a selective ensemble learning framework is proposed to reduce the prediction bias and improve the synthesis stability. To validate the effectiveness of the proposed approach, extensive numerical simulation studies and a real-world healthcare case study were applied and the results demonstrated that the proposed method is very promising.


Adaptive perturbation adversarial training: based on reinforcement learning

arXiv.org Artificial Intelligence

Adversarial training has become the primary method to defend against adversarial samples. However, it is hard to practically apply due to many shortcomings. One of the shortcomings of adversarial training is that it will reduce the recognition accuracy of normal samples. Adaptive perturbation adversarial training is proposed to alleviate this problem. It uses marginal adversarial samples that are close to the decision boundary but does not cross the decision boundary for adversarial training, which improves the accuracy of model recognition while maintaining the robustness of the model. However, searching for marginal adversarial samples brings additional computational costs. This paper proposes a method for finding marginal adversarial samples based on reinforcement learning, and combines it with the latest fast adversarial training technology, which effectively speeds up training process and reduces training costs.


COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation

arXiv.org Artificial Intelligence

To combat COVID-19, both clinicians and scientists need to digest the vast amount of relevant biomedical knowledge in literature to understand the disease mechanism and the related biological functions. We have developed a novel and comprehensive knowledge discovery framework, \textbf{COVID-KG} to extract fine-grained multimedia knowledge elements (entities, relations and events) from scientific literature. We then exploit the constructed multimedia knowledge graphs (KGs) for question answering and report generation, using drug repurposing as a case study. Our framework also provides detailed contextual sentences, subfigures and knowledge subgraphs as evidence. All of the data, KGs, reports, resources and shared services are publicly available.


A Grounded Unsupervised Universal Part-of-Speech Tagger for Low-Resource Languages

arXiv.org Artificial Intelligence

Unsupervised part of speech (POS) tagging is often framed as a clustering problem, but practical taggers need to ground their clusters as well. Grounding generally requires reference labeled data, a luxury a low-resource language might not have. In this work, we describe an approach for low-resource unsupervised POS tagging that yields fully grounded output and requires no labeled training data. We find the classic method of Brown et al. (1992) clusters well in our use case and employ a decipherment-based approach to grounding. This approach presumes a sequence of cluster IDs is a'ciphertext' and seeks a POS tag-tocluster ID mapping that will reveal the POS sequence. We show intrinsically that, despite the difficulty of the task, we obtain reasonable performance across a variety of languages. We also show extrinsically that incorporating our POS tagger into a name tagger leads to stateof-the-art tagging performance in Sinhalese and Kinyarwanda, two languages with nearly no labeled POS data available. We further demonstrate our tagger's utility by incorporating Figure 1: Overview of our approach to grounded POS it into a true'zero-resource' variant of the tagging. We use an unsupervised clustering method MALOPA(Ammar et al., 2016) dependency (Section 3.2) then reduce and ground the clusters using parser model that removes the current reliance a decipherment approach informed by POS tag sequence on multilingual resources and gold POS tags data from many languages (Section 3.3).