Goto

Collaborating Authors

 dml



Shadow Knowledge Distillation Bridging and Online Knowledge Transfer

Neural Information Processing Systems

Knowledge distillation can be generally divided into offline and online categories according to whether teacher model is pre-trained and persistent during the distillation process. Offline distillation can employ existing models yet always demonstrates inferior performance than online ones. In this paper, we first empirically show that the essential factor for their performance gap lies in the reversed distillation from student to teacher, rather than the training fashion. Offline distillation can achieve competitive performance gain by fine-tuning pre-trained teacher to adapt student with such reversed distillation. However, this fine-tuning process still costs lots of training budgets.


HyperPrism: An Adaptive Non-linear Aggregation Framework for Distributed Machine Learning over Non-IID Data and Time-varying Communication Links

Neural Information Processing Systems

While Distributed Machine Learning (DML) has been widely used to achieve decent performance, it is still challenging to take full advantage of data and devices distributed at multiple vantage points to adapt and learn, especially it is non-trivial to address dynamic and divergence challenges based on the linear aggregation framework as follows: (1) heterogeneous learning data at different devices (i.e., non-IID data) resulting in model divergence and (2) in the case of time-varying communication links, the limited ability for devices to reconcile model divergence. In this paper, we contribute a non-linear class aggregation framework HyperPrism that leverages distributed mirror descent with averaging done in the mirror descent dual space and adapts the degree of Weighted Power Mean (WPM) used in each round. Moreover, HyperPrism could adaptively choose different mapping for different layers of the local model with a dedicated hypernetwork per device, achieving automatic optimization of DML in high divergence settings. We perform rigorous analysis and experimental evaluations to demonstrate the effectiveness of adaptive, mirror-mapping DML. In particular, we extend the generalizability of existing related works and position them as special cases within HyperPrism. Our experimental results show that HyperPrism can improve the convergence speed up to 98.63% and scale well to more devices compared with the state-of-the-art, all with little additional computation overhead compared to traditional linear aggregation.



BML: A High-performance, Low-cost Gradient Synchronization Algorithm for DML Training

Neural Information Processing Systems

In distributed machine learning (DML), the network performance between machines significantly impacts the speed of iterative training. In this paper we propose BML, a new gradient synchronization algorithm with higher network performance and lower network cost than the current practice. BML runs on BCube network, instead of using the traditional Fat-Tree topology.


CharacterizingGeneralizationunder Out-Of-DistributionShiftsinDeepMetricLearning

Neural Information Processing Systems

However, common evaluation protocols only test a single, fixed data split in which train and test classes are assigned randomly. More realistic evaluations should consider abroad spectrum of distribution shifts with potentially varying degree and difficulty. In this work, we systematically construct train-test splits of increasing difficulty and present the ooDML benchmark to characterize generalization underout-of-distribution shifts inDML.ooDMLis


LearningDistinctandRepresentativeModes forImageCaptioning

Neural Information Processing Systems

While mode collapse is typically a side effect for generative modeling, it is somewhat "welcomed" in SoTA image captioning models as it usually facilitates a higher evaluation performance on reference-based metrics like CIDEr, BLEU and SPICE.




The Effects of Flipped Classrooms in Higher Education: A Causal Machine Learning Analysis

arXiv.org Machine Learning

This study uses double/debiased machine learning (DML) to evaluate the impact of transitioning from lecture-based blended teaching to a flipped classroom concept. Our findings indicate effects on students' self-conception, procrastination, and enjoyment. We do not find significant positive effects on exam scores, passing rates, or knowledge retention. This can be explained by the insufficient use of the instructional approach that we can identify with uniquely detailed usage data and highlights the need for additional teaching strategies. Methodologically, we propose a powerful DML approach that acknowledges the latent structure inherent in Likert scale variables and, hence, aligns with psychometric principles.