Plotting

Reviewer 1: " the statement in line 153 in the neighbourhood of z J

Neural Information Processing Systems

We are grateful to the reviewers for the insightful comments on our submission. All the minor comments will also be addressed in the revised manuscript. Response: We appreciate the reviewer's comment and suggestion. We will update line 153 to "f(x) does not change The domain of z can be easily adjusted by translation and dilation after the training process. Reviewer 1: "emphasize the need for gradient evaluations when you state the observation." Response: The observation statement (line 118-120) will be updated to "For a fixed pair (x, z) satisfying z = g(x), if Response: We agree with both reviewers that the caption of Figure 2 is not very clear.


Leveraging Contrastive Learning for Enhanced Node Representations in Tokenized Graph Transformers Jinsong Chen 1,2,3, John E. Hopcroft 3,4

Neural Information Processing Systems

While tokenized graph Transformers have demonstrated strong performance in node classification tasks, their reliance on a limited subset of nodes with high similarity scores for constructing token sequences overlooks valuable information from other nodes, hindering their ability to fully harness graph information for learning optimal node representations. To address this limitation, we propose a novel graph Transformer called GCFormer. Unlike previous approaches, GCFormer develops a hybrid token generator to create two types of token sequences, positive and negative, to capture diverse graph information. And a tailored Transformer-based backbone is adopted to learn meaningful node representations from these generated token sequences. Additionally, GCFormer introduces contrastive learning to extract valuable information from both positive and negative token sequences, enhancing the quality of learned node representations. Extensive experimental results across various datasets, including homophily and heterophily graphs, demonstrate the superiority of GCFormer in node classification, when compared to representative graph neural networks (GNNs) and graph Transformers.


Sub-optimal Experts mitigate Ambiguity in Inverse Reinforcement Learning

Neural Information Processing Systems

Inverse Reinforcement Learning (IRL) deals with the problem of deducing a reward function that explains the behavior of an expert agent who is assumed to act optimally in an underlying unknown task. Recent works have studied the IRL problem from the perspective of recovering the feasible reward set, i.e., the class of reward functions that are compatible with a unique optimal expert. However, in several problems of interest it is possible to observe the behavior of multiple experts with different degree of optimality (e.g., racing drivers whose skills ranges from amateurs to professionals). For this reason, in this work, we focus on the reconstruction of the feasible reward set when, in addition to demonstrations from the optimal expert, we observe the behavior of multiple sub-optimal experts. Given this problem, we first study the theoretical properties showing that the presence of multiple sub-optimal experts, in addition to the optimal one, can significantly shrink the set of compatible rewards, ultimately mitigating the inherent ambiguity of IRL. Furthermore, we study the statistical complexity of estimating the feasible reward set with a generative model and analyze a uniform sampling algorithm that turns out to be minimax optimal whenever the sub-optimal experts' performance level is sufficiently close to that of the optimal expert.


On the convergence of single-call stochastic extra-gradient methods

Neural Information Processing Systems

Variational inequalities have recently attracted considerable interest in machine learning as a flexible paradigm for models that go beyond ordinary loss function minimization (such as generative adversarial networks and related deep learning systems). In this setting, the optimal O(1/t) convergence rate for solving smooth monotone variational inequalities is achieved by the Extra-Gradient (EG) algorithm and its variants. Aiming to alleviate the cost of an extra gradient step per iteration (which can become quite substantial in deep learning applications), several algorithms have been proposed as surrogates to Extra-Gradient with a single oracle call per iteration. In this paper, we develop a synthetic view of such algorithms, and we complement the existing literature by showing that they retain a O(1/t) ergodic convergence rate in smooth, deterministic problems. Subsequently, beyond the monotone deterministic case, we also show that the last iterate of single-call, stochastic extra-gradient methods still enjoys a O(1/t) local convergence rate to solutions of non-monotone variational inequalities that satisfy a second-order sufficient condition.


A-FedPD: Aligning Dual-Drift is All Federated Primal-Dual Learning Needs

Neural Information Processing Systems

As a popular paradigm for juggling data privacy and collaborative training, federated learning (FL) is flourishing to distributively process the large scale of heterogeneous datasets on edged clients. Due to bandwidth limitations and security considerations, it ingeniously splits the original problem into multiple subproblems to be solved in parallel, which empowers primal dual solutions to great application values in FL. In this paper, we review the recent development of classical federated primal dual methods and point out a serious common defect of such methods in non-convex scenarios, which we say is a "dual drift" caused by dual hysteresis of those longstanding inactive clients under partial participation training. To further address this problem, we propose a novel Aligned Federated Primal Dual (A-FedPD) method, which constructs virtual dual updates to align global consensus and local dual variables for those protracted unparticipated local clients. Meanwhile, we provide a comprehensive analysis of the optimization and generalization efficiency for the A-FedPD method on smooth non-convex objectives, which confirms its high efficiency and practicality. Extensive experiments are conducted on several classical FL setups to validate the effectiveness of our proposed method.



General response to all reviewers regarding empirical study of SCG++

Neural Information Processing Systems

We thank the reviewers for their careful consideration and constructive feedback. Below, please find our responses. Since no experimental evidence are provided... A1: Please see our general response above. Q2: If'F' is available in closed form and the its gradients can be computed exactly.... A2: If the gradients can be In this case, both SCG and SCG++ reach a (1 1/e ษ›)-opt solution with O(1/ษ›) oracle calls. Q4: Extension to the discrete setting: I do not understand how one would compute the multilinear extension efficiently.


GRU-ODE-Bayes: Continuous modeling of sporadically-observed time series

Neural Information Processing Systems

Modeling real-world multidimensional time series can be particularly challenging when these are sporadically observed (i.e., sampling is irregular both in time and across dimensions)--such as in the case of clinical patient data. To address these challenges, we propose (1) a continuous-time version of the Gated Recurrent Unit, building upon the recent Neural Ordinary Differential Equations (Chen et al., 2018), and (2) a Bayesian update network that processes the sporadic observations. We bring these two ideas together in our GRU-ODE-Bayes method. We then demonstrate that the proposed method encodes a continuity prior for the latent process and that it can exactly represent the Fokker-Planck dynamics of complex processes driven by a multidimensional stochastic differential equation. Additionally, empirical evaluation shows that our method outperforms the state of the art on both synthetic data and real-world data with applications in healthcare and climate forecast. What is more, the continuity prior is shown to be well suited for low number of samples settings.


455cb2657aaa59e32fad80cb0b65b9dc-AuthorFeedback.pdf

Neural Information Processing Systems

We thank reviewers for the relevant comments. We first address general questions and then give brief individual answers. GRU-ODE integrates the dynamics of the hidden process h(t) in time. Those projected distributions vary smoothly as they are driven by an ODE. Point processes (Mei and Eisner, NeurIPS 2017; Gunawardana et al., NeurIPS 2011) are intrinsically continuous as Continuous-time Bayesian networks (Nodelman et al., UAI 2002) address a This joint modeling of continuous measurements and events was left for future work.


SocraticLM: Exploring Socratic Personalized Teaching with Large Language Models 1,2

Neural Information Processing Systems

Large language models (LLMs) are considered a crucial technology for advancing intelligent education since they exhibit the potential for an in-depth understanding of teaching scenarios and providing students with personalized guidance. Nonetheless, current LLM-based application in personalized teaching predominantly follows a "Question-Answering" paradigm, where students are passively provided with answers and explanations. In this paper, we propose SocraticLM, which achieves a Socratic "Thought-Provoking" teaching paradigm that fulfills the role of a real classroom teacher in actively engaging students in the thought process required for genuine problem-solving mastery. To build SocraticLM, we first propose a novel "Dean-Teacher-Student" multi-agent pipeline to construct a new dataset, SocraTeach, which contains 35K meticulously crafted Socratic-style multi-round (equivalent to 208K single-round) teaching dialogues grounded in fundamental mathematical problems. Our dataset simulates authentic teaching scenarios, interacting with six representative types of simulated students with different cognitive states, and strengthening four crucial teaching abilities. SocraticLM is then fine-tuned on SocraTeach with three strategies balancing its teaching and reasoning abilities. Moreover, we contribute a comprehensive evaluation system encompassing five pedagogical dimensions for assessing the teaching quality of LLMs. Extensive experiments verify that SocraticLM achieves significant improvements in the teaching performance, outperforming GPT4 by more than 12%.