figurec
Appendix
Weheldoutavalidation setfromthetraining set,andusedthisvalidation settoselecttheL2 regularization hyperparameter,which weselected from 45logarithmically spaced values between 10 6 and 105, applied to the sum of the per-example losses. Because the optimization problem is convex, we used the previous weights as a warm start as we increased theL2 regularization hyperparameter. Wemeasured eithertop-1ormean per-class accuracy, depending on which was suggested by the dataset creators. A.3 Fine-tuning In our fine-tuning experiments in Table 2, we used standard ImageNet-style data augmentationand trained for 20,000 steps with SGD with momentum of0.9 and cosine annealing [ 20]without restarts. Each curve represents a different model.
Supplemental Material: CHIP: AHawkes Process Model for Continuous-time Networkswith Scalable and Consistent Estimation
A.1 CommunityDetection The spectral clustering algorithm for directed networks that we consider in this paper is shown in Algorithm A.1. It can be applied either to the weighted adjacency (count) matrixN or the unweighted adjacency matrixA, where Aij =1{Nij >0} and 1{ } denotes the indicator function of the argument. This algorithm is used for the community detection step in our proposed CHIP estimationprocedure. For undirectednetworks, which we use for the theoreticalanalysisin Section 4, spectral clustering is performed by running k-means clustering on the rows of theeigenvector matrix of N or A, not the rows of the concatenated singular vector matrix. A.2 Estimation of Hawkes process parameters Ozaki (1979) derived the log-likelihood function for Hawkes processes with exponential kernels, which takes the form: logL= µT+ The threeparameters µ,α,β can be estimatedby maximizing (A.1) using standard numerical methods for non-linear optimization (Nocedal & Wright, 2006). We provide closed-form equations for estimating mab =αab/βab and µab in (2).
- North America > United States > Ohio (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
SupplementaryMaterial
R φqφ(z)dz = 0. Thus, the gradient of the log-variance loss becomes equaltothegradientofthe KL divergence. Therefore, for large enough D, the condition from Proposition 3 (see Eq. 19), is fulfilled and the statement follows immediately. This result isexpected to extend to the multivariate cases as well. For all the experiments listed in the main text, we use the VarGrad estimator for the gradients of the logistic regression models. VarGrad achieves considerable variance reduction over the adaptive (RELAX) and non-adaptive (ControlledReinforce)model-agnosticestimators.
Supplementto" Learning IndividualizedTreatment RuleswithManyTreatments: ASupervised ClusteringApproachUsingAdaptiveFusion "
For parametric models, we assume the linear main effect M0pZq " Z η where η PRp. For nonparametric regression, we follow [3] to divide the training data into M folds based on the assignedtreatment. In addition, since β PΘn, we have β PΘn as well. Hence, with similar derivations, we have }β i β j}1 " a2λn.BasedonAssumption4, Therefore, only the treatments that belong to the same group contribute to Γ2.
Appendix
For vision transformers, we train linear probes on representations from individual tokens or on the representation averaged over all tokens, at the output of different transformer layers (each layer meaning a full transformer block including self-attention and MLP). Moreover, ResNets differ from ViTs in that the number of channels changes throughout the model, with fewer channels in the earlier layers. Wetrain alinear probe on each individual token and plot the average accuracy over the test set, in percent. Here we plot the results for each token a subset of layers in 3models: ViT-B/32 trained with aclassification token (CLS) or global average pooling (GAP), as well as a ResNet50. There are two main observations tobemade.