A Appendix A.1 More Ablations and Visualizations Effect of Blocking Gradient of f(s
As mentioned in Section 3.2, we compare the performance of different detectors with or without blocking the gradient of f(s We attribute this to the unstable training caused by the gradient from the denominator, so they are blocked out by default in the experiments. Figure 1 visualizes the searched parameterized functions for different detectors on the COCO benchmark [5]. The dots on each line represent the control points for each parameterized function. It can be observed that loss functions for different detectors seem to differ from each other. Their intrinsic differences can lead to distinct loss functions.
A Label Noise: Effect of Identical Patches
Here, we show that false negatives that are identical to the positive - for example, patches of the sky - do not change the sign of gradient associated with the positive. Let q be the query, u be the positive, V be the set of negatives. It is easy to see that the contribution of the negatives that are identical to the positive do not reverse the sign of the positive gradient, i.e. The proposed method outperforms many supervised methods for video object segmentation, despite relying on a simple label propagation algorithm, not being trained for object segmentation, and not training on the DAVIS dataset. We also show comparisons to pretrained feature baselines with larger networks.
Space-Time Correspondence as a Contrastive Random Walk
This paper proposes a simple self-supervised approach for learning a representation for visual correspondence from raw video. We cast correspondence as prediction of links in a space-time graph constructed from video. In this graph, the nodes are patches sampled from each frame, and nodes adjacent in time can share a directed edge. We learn a representation in which pairwise similarity defines transition probability of a random walk, such that prediction of long-range correspondence is computed as a walk along the graph. We optimize the representation to place high probability along paths of similarity. Targets for learning are formed without supervision, by cycle-consistency: the objective is to maximize the likelihood of returning to the initial node when walking along a graph constructed from a palindrome of frames. Thus, a single path-level constraint implicitly supervises chains of intermediate comparisons. When used as a similarity metric without adaptation, the learned representation outperforms the self-supervised state-of-theart on label propagation tasks involving objects, semantic parts, and pose. Moreover, we demonstrate that a technique we call edge dropout, as well as self-supervised adaptation at test-time, further improve transfer for object-centric correspondence.
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices--the A-shape, Vertical-Slash, and Block-Sparse-- that can be leveraged for efficient sparse computation on GPUs.
48adb34f7ee39177c4c23a8e4253a492-Paper-Conference.pdf
The success of AlphaZero (AZ) has demonstrated that neural-network-based Go AIs can surpass human performance by a large margin. However, do these superhuman AZ agents truly learn some general basic knowledge that can be applied to any legal state? In this paper, we first extend the concept of adversarial examples to the game of Go: we generate perturbed states that are "semantically" equivalent to the original state by adding meaningless actions to the game, and an adversarial state is a perturbed state leading to an undoubtedly inferior action that is obvious even for amateur players. However, searching the adversarial state is challenging due to the large, discrete, and non-differentiable search space. To tackle this challenge, we develop the first adversarial attack on Go AIs that can efficiently search for adversarial states by strategically reducing the search space. This method can also be extended to other board games such as NoGo. Experimentally, we show that both Policy-Value neural network (PV-NN) and Monte Carlo tree search (MCTS) can be misled by adding one or two meaningless stones; for example, on 58% of the AlphaGo Zero self-play games, our method can make the widely used KataGo agent with 50 simulations of MCTS plays a losing action by adding two meaningless stones. We additionally evaluated the adversarial examples found by our algorithm with amateur human Go players, and 90% of examples indeed lead the Go agent to play an obviously inferior action. Our code is available at https://PaperCode.cc/GoAttack.