wild
An Image is Worth More Than a Thousand Words: Towards Disentanglement in The Wild
Unsupervised disentanglement has been shown to be theoretically impossible without inductive biases on the models and the data. As an alternative approach, recent methods rely on limited supervision to disentangle the factors of variation and allow their identifiability. While annotating the true generative factors is only required for a limited number of observations, we argue that it is infeasible to enumerate all the factors of variation that describe a real-world image distribution. To this end, we propose a method for disentangling a set of factors which are only partially labeled, as well as separating the complementary set of residual factors that are never explicitly specified. Our success in this challenging setting, demonstrated on synthetic benchmarks, gives rise to leveraging off-the-shelf image descriptors to partially annotate a subset of attributes in real image domains (e.g. of human faces) with minimal manual effort. Specifically, we use a recent language-image embedding model (CLIP) to annotate a set of attributes of interest in a zero-shot manner and demonstrate state-of-the-art disentangled image manipulation results.
Multi-modal Queried Object Detection in the Wild
We introduce MQ-Det, an efficient architecture and pre-training strategy design to utilize both textual description with open-set generalization and visual exemplars with rich description granularity as category queries, namely, Multi-modal Queried object Detection, for real-world detection with both open-vocabulary categories and various granularity. MQ-Det incorporates vision queries into existing well-established language-queried-only detectors. A plug-and-play gated class-scalable perceiver module upon the frozen detector is proposed to augment category text with class-wise visual information. To address the learning inertia problem brought by the frozen detector, a vision conditioned masked language prediction strategy is proposed. MQ-Det's simple yet effective architecture and training strategy design is compatible with most language-queried object detectors, thus yielding versatile applications. Experimental results demonstrate that multi-modal queries largely boost open-world detection. For instance, MQ-Det significantly improves the state-of-the-art open-set detector GLIP by +7.8% AP on the LVIS benchmark via multi-modal queries without any downstream finetuning, and averagely +6.3% AP on 13 few-shot downstream tasks, with merely additional 3% modulating time required by GLIP.
On the Importance of Gradients for Detecting Distributional Shifts in the Wild
Detecting out-of-distribution (OOD) data has become a critical component in ensuring the safe deployment of machine learning models in the real world. Existing OOD detection approaches primarily rely on the output or feature space for deriving OOD scores, while largely overlooking information from the gradient space. In this paper, we present GradNorm, a simple and effective approach for detecting OOD inputs by utilizing information extracted from the gradient space. GradNorm directly employs the vector norm of gradients, backpropagated from the KL divergence between the softmax output and a uniform probability distribution. Our key idea is that the magnitude of gradients is higher for in-distribution (ID) data than that for OOD data, making it informative for OOD detection. GradNorm demonstrates superior performance, reducing the average FPR95 by up to 16.33% compared to the previous best method.
MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild
This paper addresses the problem of 3D human pose estimation in the wild. A significant challenge is the lack of training data, i.e., 2D images of humans annotated with 3D poses. Such data is necessary to train state-of-the-art CNN architectures. Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations. We introduce an image-based synthesis engine that artificially augments a dataset of real images with 2D human pose annotations using 3D Motion Capture (MoCap) data.
Young birds get by with a little help from their…siblings
Parents are not the only ones who teach important survival skills. Breakthroughs, discoveries, and DIY tips sent every weekday. These special relationships can be filled with everything from fun and joy to cruel pranks and teasing. Witnessing each other's childhoods and sharing parents along with family secrets and advice makes it a relationship that is truly unlike any other. This bond is also not unique to our species, according to a new study published today in the journal .
Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets
Bendinelli, Tommaso, Dox, Artur, Holz, Christian
High-quality, error-free datasets are a key ingredient in building reliable, accurate, and unbiased machine learning (ML) models. However, real world datasets often suffer from errors due to sensor malfunctions, data entry mistakes, or improper data integration across multiple sources that can severely degrade model performance. Detecting and correcting these issues typically require tailor-made solutions and demand extensive domain expertise. Consequently, automation is challenging, rendering the process labor-intensive and tedious. In this study, we investigate whether Large Language Models (LLMs) can help alleviate the burden of manual data cleaning. We set up an experiment in which an LLM, paired with Python, is tasked with cleaning the training dataset to improve the performance of a learning algorithm without having the ability to modify the training pipeline or perform any feature engineering. We run this experiment on multiple Kaggle datasets that have been intentionally corrupted with errors. Our results show that LLMs can identify and correct erroneous entries, such as illogical values or outlier, by leveraging contextual information from other features within the same row, as well as feedback from previous iterations. However, they struggle to detect more complex errors that require understanding data distribution across multiple rows, such as trends and biases.
Reviews: Self-Supervised Intrinsic Image Decomposition
The paper presents an interesting approach on the intrinsic image decomposition problem: given an input rgb image, it decomposes it first into shape (normals), reflectance (albedo) and illumination (point light) using an encoder-decoder deep architecture with 3 outputs. Then there is another encoder-decoder that takes the predicted normals and light and outputs the shading of the shape. Finally, the result comes from a multiplication between the estimated reflectance (from the 1st encoder-decoder) with the estimated shading. The idea of having a reconstruction loss to recover the input image is interesting, but I believe that is only partially employed in the paper. The network architecture still needs labeled data for the initial training.
How this non-gamer fell in love with 'The Legend of Zelda: Breath of the Wild'
It was after a particularly grueling session with The Legend of Zelda: Breath of the Wild that I started to wonder: When did developers stop putting cheats into their games to help the less talented among us get through the tricky bits? When I was a kid, a little bit of Up Down Left Right A and Start together, and a little older, a little / noclip saved me no end of bother. These days, if you look for cheats for any modern game online, the best you'll get is to be sassily told to "git gud." Sorry, a little context: I play games, but I'm not a Gamer, or a Nintendo Person, so in 2023 I resolved to remedy this. So many discussions at work fly past me because while I've heard of Cliff Bleszinski and Hironobu Sakaguchi, I couldn't tell you their oeuvre without Googling.
Power Utilitarianism Continued (Pt.3) -- Why We 'Should' Release Superintelligent AI into the Wild
If you'd ever watched the opening scene to 2001: A Space Odyssey, you might have come out of the cinema feeling a bit confused as to the real meaning and artistic intent behind Kubrick's masterpiece-- even if your subconscious may have figured it out in a way that you've struggled to put into words, ever since. Before we really dive into it, you can rest assured, that it has little to do with Evolution. In fact, in the universe of 2001: Space Odyssey -- natural selection never quite takes place in the way we normally understand it. In fact -- the monolith presented in 2001: Space Odyssey is something they call a Bracewell Probe, or perhaps more of a Von Neumann Probe; a self-replicating, autonomous machine sent out by an extraterrestrial civilization in order to tamper with, and/or'uplift', a primitive and savage species of apes. The opening scene of the movie starts out with a gang of apes banding together in order to fight over a small pond they found in the middle of the desert, where resources are scarce.
I didn't get my son's favourite video game – but it got me Dominik Diamond
About a year ago I tried to bond with my 17-year-old over Sea of Thieves. Since then, he has harangued me about trying Outer Wilds, which he claims is the most profound gameplaying experience of his life. I have delayed to Hamletesque degrees: what will I do if another of his favourite games doesn't connect with me? Would that mean I can no longer connect with my son? As I discovered last month, it can sometimes be a struggle playing games in your 50s, and dropping down the difficulty can reduce the stress and help me enjoy myself more.