Goto

Collaborating Authors

 Education



Online learning with Erdล‘s-Rรฉnyi side-observation graphs

arXiv.org Machine Learning

We consider adversarial multi-armed bandit problems where the learner is allowed to observe losses of a number of arms beside the arm that it actually chose. We study the case where all non-chosen arms reveal their loss with a fixed but unknown probability $r$, independently of each other and the action of the learner. We propose two algorithms that work for different ranges of $r$. We show that after $T$ rounds in a bandit problem with $N$ arms, the expected regret of our first algorithm is $O(\sqrt{(T /r) \log N })$ whenever $r\ge(\log T)/(2N)$, while our second algorithm achieves a regret of $O(\sqrt{(T/r) \log (N+T)})$ for smaller values of $r$. We also give a quick estimation procedure that decides the range of~$r$. All our bounds are within logarithmic factors of the best achievable performance of any algorithm that is even allowed to know~$r$.


Spectral bandits

arXiv.org Machine Learning

Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this work, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as content-based recommendation. In this problem, each item we can recommend is a node of an undirected graph and its expected rating is similar to the one of its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret with respect to the optimal policy would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in real-world graphs, and propose three algorithms for solving our problem that scale linearly and sublinearly in this dimension. Our experiments on content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens of node evaluations.


Supplementary Materials for the Paper " L2T-DLN: Learning to Teach with Dynamic Loss Network "

Neural Information Processing Systems

In this supplementary material, we provide the proofs of convergence analysis in Section 1, 1-vs-1 transformation employed in the classification and semantic segmentation tasks in Section 2, the coordinate-wise and the preprocessing method of the LSTM teacher in Section 3, the loss functions of YOLO-v3 in Section 4, more experiments of image classification in Section 5, and the inferences of semantic segmentation in Section 6. A differentiable function e()is L-smooth with gradient Lipschitz constant C (uniformly Lipschitz continuous), if e(x) e(y) C x y, x,y. The function is called block-wise smooth with gradient Lipschitz Ci, if i e(x i,xi) ie(x i,x i) Ci xi x i, x,x (1) or with gradient Lipschitz constants { Ci}, if i e(x i,xi) ie(x i,xi) Ci x i x i, x,x (2) Further, Let Gmax max{Ci, Ci, k} C. Definition 2. For a differentiable function e(), if e(x) = 0, then x is a first-order stationary solution (SS1). For a differentiable function e(), if x is a SS1, and there exists ฯต > 0 so that for any y in the ฯต-neighborhood of x, we have e(x) e(y), then xis a local minimum. A saddle point xis an SS1 that is not a local minimum. If ฮปmin( 2e(x)) < 0, x is a strict (non-degenerate) saddle point.


L2T-DLN: Learning to Teach with Dynamic Loss Network

Neural Information Processing Systems

With the concept of teaching being introduced to the machine learning community, a teacher model start using dynamic loss functions to teach the training of a student model. The dynamic intends to set adaptive loss functions to different phases of student model learning. In existing works, the teacher model 1) merely determines the loss function based on the present states of the student model, i.e., disregards the experience of the teacher; 2) only utilizes the states of the student model, e.g., training iteration number and loss/accuracy from training/validation sets, while ignoring the states of the loss function. In this paper, we first formulate the loss adjustment as a temporal task by designing a teacher model with memory units, and, therefore, enables the student learning to be guided by the experience of the teacher model. Then, with a dynamic loss network, we can additionally use the states of the loss to assist the teacher learning in enhancing the interactions between the teacher and the student model. Extensive experiments demonstrate our approach can enhance student learning and improve the performance of various deep models on real-world tasks, including classification, objective detection, and semantic segmentation scenarios.



Sample Complexity of Forecast Aggregation

Neural Information Processing Systems

We consider a Bayesian forecast aggregation model where nexperts, after observing private signals about an unknown binary event, report their posterior beliefs about the event to a principal, who then aggregates the reports into a single prediction for the event. The signals of the experts and the outcome of the event follow a joint distribution that is unknown to the principal, but the principal has access to i.i.d. "samples" from the distribution, where each sample is a tuple of the experts' reports (not signals) and the realization of the event. Using these samples, the principal aims to find an ฮต-approximately optimal aggregator, where optimality is measured in terms of the expected squared distance between the aggregated prediction and the realization of the event. We show that the sample complexity of this problem is at least โ„ฆ(mn 2/ฮต) for arbitrary discrete distributions, where m is the size of each expert's signal space. This sample complexity grows exponentially in the number of experts n. But, if the experts' signals are independent conditioned on the realization of the event, then the sample complexity is significantly reduced, to O(1/ฮต2), which does not depend on n. Our results can be generalized to non-binary events. The proof of our results uses a reduction from the distribution learning problem and reveals the fact that forecast aggregation is almost as difficult as distribution learning.




6a42b45af2b72e6e5b5e3a6fe695809f-Supplemental-Datasets_and_Benchmarks.pdf

Neural Information Processing Systems

The model can easily distinguish A and B according to the background (i.e., the so-called geometric skews [26]), but not according to the features of the class instance itself. However, if there is another class C, which is also in black background. In this tri-classification task (distinguishing A,B, and C), an ideal model should focus on the feature of the instance itself but not the background. This is one of the difficulties: distribution bias on samples, that some beneficial features (e.g., background) may be good for the classification, but not good for understanding the class (in a compositional way). Another difficulty is entanglement of the labels. We provide the labels in a relative way that the label of A is '0' and of B is '1', but not their true textual meanings (e.g., white paper and green leaves). The concept information is entangled and embedded into the label, thus, it is hard for the model to tell which visual features capture the corresponding concepts (i.e., white refers to the color feature and paper refers to the texture feature). We hope our understanding of this issue can inspire researchers to focus more on compositionality and design excellent continual learners.