training hyperparameter
Kuro Siwo: 33 billion m 2 under the water A global multi-temporal satellite dataset for rapid flood mapping Supplemental material 1 Dataset The total size of the compressed dataset is
All code and data will be maintained at the project's repo. Sentinel-2 RGB image captured in 23/05/2023 (one day later). In Figure 1 we assess the performance of our best model, i.e. Emiglia-Romana, Italy, which took place on May 2023. SAR image acquired on 22/05/2023, and two pre-event SAR images from 10/05/2023 and 28/04/2023.
A Fairness metric
Dynabench comprises four dynamic tasks with multiple rounds of datasets that will grow over time. These names cover 85.6 percent of the U.S. population, are based on the 1990 Census information on first name frequencies [ Note: there is nothing inherently "racial" about particular names--for example, each demographic group had at least a few people named "Anna" or "Benjamin" [ For gender identity, we investigate two kinds of perturbations: names and noun phrases. Social Security Association's Lists of Baby Names (1980-2019), and performed perturbations as For noun phrases (i.e., pronouns and nouns), we adopted a slightly more "her" with a noun like "dad" and expect no effect on the classification label). Given that our perturbations are heuristic, some noise is to be expected. Consider "I've always enjoyed eating at Red Robin " being perturbed to "I've always enjoyed eating at Red Kayla ".
Spatiotemporal Learning on Cell-embedded Graphs
Data-driven simulation of physical systems has recently kindled significant attention, where many neural models have been developed. In particular, mesh-based graph neural networks (GNNs) have demonstrated significant potential in predicting spatiotemporal dynamics across arbitrary geometric domains. However, the existing node-edge message passing mechanism in GNNs limits the model's representation learning ability. In this paper, we proposed a cell-embedded GNN model (aka CeGNN) to learn spatiotemporal dynamics with lifted performance. Specifically, we introduce a learnable cell attribution to the node-edge message passing process, which better captures the spatial dependency of regional features. Such a strategy essentially upgrades the local aggregation scheme from the first order (e.g., from edge to node) to a higher order (e.g., from volume to edge and then to node), which takes advantage of volumetric information in message passing. Meanwhile, a novel feature-enhanced block is designed to further improve the performance of CeGNN and relieve the over-smoothness problem, via treating the latent features as basis functions. The extensive experiments on various PDE systems and one real-world dataset demonstrate that CeGNN achieves superior performance compared with other baseline models, particularly reducing the prediction error with up to 1 orders of magnitude on several PDE systems.
DataComp-LM: In search of the next generation of training sets for language models
Li, Jeffrey, Fang, Alex, Smyrnis, Georgios, Ivgi, Maor, Jordan, Matt, Gadre, Samir, Bansal, Hritik, Guha, Etash, Keh, Sedrick, Arora, Kushal, Garg, Saurabh, Xin, Rui, Muennighoff, Niklas, Heckel, Reinhard, Mercat, Jean, Chen, Mayee, Gururangan, Suchin, Wortsman, Mitchell, Albalak, Alon, Bitton, Yonatan, Nezhurina, Marianna, Abbas, Amro, Hsieh, Cheng-Yu, Ghosh, Dhruba, Gardner, Josh, Kilian, Maciej, Zhang, Hanlin, Shao, Rulin, Pratt, Sarah, Sanyal, Sunny, Ilharco, Gabriel, Daras, Giannis, Marathe, Kalyani, Gokaslan, Aaron, Zhang, Jieyu, Chandu, Khyathi, Nguyen, Thao, Vasiljevic, Igor, Kakade, Sham, Song, Shuran, Sanghavi, Sujay, Faghri, Fartash, Oh, Sewoong, Zettlemoyer, Luke, Lo, Kyle, El-Nouby, Alaaeldin, Pouransari, Hadi, Toshev, Alexander, Wang, Stephanie, Groeneveld, Dirk, Soldaini, Luca, Koh, Pang Wei, Jitsev, Jenia, Kollar, Thomas, Dimakis, Alexandros G., Carmon, Yair, Dave, Achal, Schmidt, Ludwig, Shankar, Vaishaal
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.
- North America > United States > Texas > Travis County > Austin (0.27)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
- (30 more...)
- Research Report > New Finding (0.87)
- Research Report > Experimental Study (0.65)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Learning Activation Functions for Sparse Neural Networks
Loni, Mohammad, Mohan, Aditya, Asadi, Mehdi, Lindauer, Marius
Sparse Neural Networks (SNNs) can potentially demonstrate similar performance to their dense counterparts while saving significant energy and memory at inference. However, the accuracy drop incurred by SNNs, especially at high pruning ratios, can be an issue in critical deployment conditions. While recent works mitigate this issue through sophisticated pruning techniques, we shift our focus to an overlooked factor: hyperparameters and activation functions. Our analyses have shown that the accuracy drop can additionally be attributed to (i) Using ReLU as the default choice for activation functions unanimously, and (ii) Fine-tuning SNNs with the same hyperparameters as dense counterparts. Thus, we focus on learning a novel way to tune activation functions for sparse networks and combining these with a separate hyperparameter optimization (HPO) regime for sparse networks. By conducting experiments on popular DNN models (LeNet-5, VGG-16, ResNet-18, and EfficientNet-B0) trained on MNIST, CIFAR-10, and ImageNet-16 datasets, we show that the novel combination of these two approaches, dubbed Sparse Activation Function Search, short: SAFS, results in up to 15.53%, 8.88%, and 6.33% absolute improvement in the accuracy for LeNet-5, VGG-16, and ResNet-18 over the default training protocols, especially at high pruning ratios. Our code can be found at https://github.com/automl/SAFS
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Sweden (0.04)
- (2 more...)
µTransfer: A technique for hyperparameter tuning of enormous neural networks - Microsoft Research
Great scientific achievements cannot be made by trial and error alone. Every launch in the space program is underpinned by centuries of fundamental research in aerodynamics, propulsion, and celestial bodies. In the same way, when it comes to building large-scale AI systems, fundamental research forms the theoretical insights that drastically reduce the amount of trial and error necessary and can prove very cost-effective. In this post, we relay how our fundamental research enabled us, for the first time, to tune enormous neural networks that are too expensive to train more than once. We achieved this by showing that a particular parameterization preserves optimal hyperparameters across different model sizes. This is the µ-Parametrization (or µP, pronounced "myu-P") that we introduced in a previous paper, where we showed that it uniquely enables maximal feature learning in the infinite-width limit.