Goto

Collaborating Authors

 duplicate


The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only The Falcon LLMTeam

Neural Information Processing Systems

This curation process is believed to be necessary to produce 5 performant models with broad zero-shot generalization abilities. However, as larger 6 models requiring pretraining on trillions of tokens are considered, it is unclear how 7 scalable is curation, and whether we will run out of unique high-quality data soon.


Supplementary AViT 3B model

Neural Information Processing Systems

The ViT model we use in this work is based on a standard Vision Transformer [7] model scaled to577 nearly 3 billion parameters, using a patch size of 14, 16 heads, 64 blocks, an MLP dimension of 8192578 and a hidden dimension of 2048. The model is defined and trained in Lingvo [32]; we additionally579 employ GSPMD [41] for training. The model is pre-trained on JFT-3B [35] using training settings580 that optimize for performance on JFT-3B rather than for fine-tuning on ImageNet; notably, we do not581 use the training recipe that helps few-shot transfer performance [44]. BReview tools586 We include screenshots of the reviewing tools we built to analyze model mistakes. Figure 3 shows587 the UI for reviewing model predictions and Figure 4 shows the UI that displays the labeling guide588 and slide bar to browse images for a particular class.


1abed6ee581b9ceb4e2ddf37822c7fcb-Supplemental-Conference.pdf

Neural Information Processing Systems

A.1 Graph-building strategies The graphs were built using the IsayevNN class from the pymatgen [48] package. It implements the commonly used Voronoi tessalation to define neighbors. Two atoms are considered bonded if they share a face in the Voronoi tessalation of the supercell and their distance is less than the sum of the atomic Cordero radii (a measure of the atomic radius) plus a cutoff =0 .5ร…. This value of the cutoff was increase compared to [32] to reduce the number of disconnected graphs. We provide statistics for the graphs obtained by the method described in Section 5. A hard cutoff on atomic distances of 6ร… is also imposed on atomic distances. Figure 5: Histogram of the number of primitive cell sites per material in the processed Materials Project dataset.


Appendix

Neural Information Processing Systems

In this section we motivate the design choices and inductive biases that we encode into our neural encoder network e, which is the network that is used to model the relative accuracies of the weak supervision sources ฮป. Recall that we model the probability of a particular sample x X having the class label y Y = {1,...,C}as Pฮธ(y|ฮป) = softmax(s)yP(y), (4) s = ฮธ(ฮป,x)Tฮป RC . Connection to prior PGM models We now motivate this choice by deriving a less expressive variant of it from the standard Markov Random Field (MRF) used in the related work. If we view the attention scores ฮธ(ฮป,x) Rm, that assign sample-dependent accuracies to each labeling function, as sample-independent parameters ฮธ1 and, by that, drop the features from the equation - as is done in the related work [30, 32, 19, 11] - we can rewrite Eq. 4 as exp ฮธT1 1 {ฮป = y} P We can recognize Pฮธ as a distribution from the exponential familiy, and more specifically as a pairwise MRF, or factor graph, with canonical parameters ฮธ = (ฮธ1,ฮธ2) and corresponding sufficient statistics, or factors, ฯ†(ฮป,y) = (ฯ†1(ฮป,y),ฯ†2(ฮป)), as well as the log partition function Zฮธ. The accuracy factors and parameters ฯ†1,ฮธ1 are the core component of this model and sometimes take the form ฯ†1(ฮปy) = ฮปy in binary models as in [30, 19, 11]. The label-independent factors ฯ†2(ฮป) have, as can be seen from the derivation above, no direct influence on the latent label posterior, but are often used to model labeling propensities 1 {ฮป 6= 0}and correlation dependencies 1 {ฮปi = ฮปj}, which can be important for PGM parameter learning, but are susceptible to misspecifications [39, 11, 8].


License of the assets

Neural Information Processing Systems

Licence for the codes We use the code for MS-TCN [13], ASRF [24], LAS [9], all of which are under MITLicense according to https://opensource.org/licenses/MIT. For the Jigsaws [18] dataset, we follow the data use agreeement according to https://cs.jhu. Action classification: Action classification is the task of identifying a single action, as opposed to a sequence of actions. Several methods use 2DCNNs to extract frame-wise features from an input video, which are then combined to predict a coarse action taking place in the video [56, 39, 59]. There also exist several works that perform action classification from kinematic data [2, 12]. Action segmentation: Action segmentation is the problem of segmenting an input stream of data, labeling each frame according to the action that is being carried out. Earlier methods for action segmentation employed hidden Markov models [33, 22]. More recently, convolutional neural networks [58, 26] and recurrent neural networks [50] have been applied to this problem Inspired by the success of temporal convolutional networks (TCNs) in speech synthesis, [37] adapted these models to action segmentation. MS-TCN [13], which uses a multi-stage TCN architecture, has become one of the most widely used architecture for action segmentation. Although these methods achieve high frame-wise accuracy, they still produce a significant number of over-segmentation errors. In order to address this, several boundary-aware methods have been developed which perform temporal smoothing of the frame-wise predictions [57, 24]. These methods use ground-truth boundary information to train a binary classification network to perform boundary detection. The boundary estimates are then used to aggregate the frame-wise predictions either in a soft manner (boundary-aware pooling) or by setting a hard threshold. However, for elemental actions with a short duration, such as the functional primitives in the StrokeRehab dataset, the duration of each action is very short. As a result, the boundaries between actions can be hard to detect or even hard to define (see Figure 4). Sequence-to-sequence models: Our proposed method is based on sequence-to-sequence (seq2seq) models. These models allow us to learn a mapping of a variable-length input sequence to a variablelength output sequence [53].


Intrinsic Self-Supervision for Data Quality Audits

Neural Information Processing Systems

Requests for name changes in the electronic proceedings will be accepted with no questions asked. However name changes may cause bibliographic tracking issues. Authors are asked to consider this carefully and discuss it with their co-authors prior to requesting a name change in the electronic proceedings. Use the Report an Issue link to request a name change.




Copycats

Neural Information Processing Systems

In the past, MI datasets were frequently proprietary, confined to particular institutions, and stored in private repositories. In this particular setting, there is a pressing need for alternative models of data sharing, documentation, and governance. Within this context,theemergence ofCommunityContributed Platforms (CCPs) presented a potential for the public sharing of medical datasets.