Goto

Collaborating Authors

 Object-Oriented Architecture


Object Recognition by Scene Alignment

Neural Information Processing Systems

Current object recognition systems can only recognize a limited number of object categories; scaling up to many categories is the next challenge. We seek to build a system to recognize and localize many different object categories in complex scenes. We achieve this through a simple approach: by matching the input im- age, in an appropriate representation, to images in a large training set of labeled images. Due to regularities in object identities across similar scenes, the retrieved matches provide hypotheses for object identities and locations. We build a prob- abilistic model to transfer the labels from the retrieval set to the input image.


Why The Brain Separates Face Recognition From Object Recognition

Neural Information Processing Systems

Many studies have uncovered evidence that visual cortex contains specialized regions involved in processing faces but not other object classes. Recent electrophysiology studies of cells in several of these specialized regions revealed that at least some of these regions are organized in a hierarchical manner with viewpoint-specific cells projecting to downstream viewpoint-invariant identity-specific cells (Freiwald and Tsao 2010). A separate computational line of reasoning leads to the claim that some transformations of visual inputs that preserve viewed object identity are class-specific. In particular, the 2D images evoked by a face undergoing a 3D rotation are not produced by the same image transformation (2D) that would produce the images evoked by an object of another class undergoing the same 3D rotation. However, within the class of faces, knowledge of the image transformation evoked by 3D rotation can be reliably transferred from previously viewed faces to help identify a novel face at a new viewpoint.


Semantic Kernel Forests from Multiple Taxonomies

Neural Information Processing Systems

When learning features for complex visual recognition problems, labeled image exemplars alone can be insufficient. While an \emph{object taxonomy} specifying the categories' semantic relationships could bolster the learning process, not all relationships are relevant to a given visual classification task, nor does a single taxonomy capture all ties that \emph{are} relevant. In light of these issues, we propose a discriminative feature learning approach that leverages \emph{multiple} hierarchical taxonomies representing different semantic views of the object categories (e.g., for animal classes, one taxonomy could reflect their phylogenic ties, while another could reflect their habitats). For each taxonomy, we first learn a tree of semantic kernels, where each node has a Mahalanobis kernel optimized to distinguish between the classes in its children nodes. Then, using the resulting \emph{semantic kernel forest}, we learn class-specific kernel combinations to select only those relationships relevant to recognize each object class.



ConStruct-VL: Data-Free Continual Structured VL Concepts Learning

arXiv.org Artificial Intelligence

Recently, large-scale pre-trained Vision-and-Language (VL) foundation models have demonstrated remarkable capabilities in many zero-shot downstream tasks, achieving competitive results for recognizing objects defined by as little as short text prompts. However, it has also been shown that VL models are still brittle in Structured VL Concept (SVLC) reasoning, such as the ability to recognize object attributes, states, and inter-object relations. This leads to reasoning mistakes, which need to be corrected as they occur by teaching VL models the missing SVLC skills; often this must be done using private data where the issue was found, which naturally leads to a data-free continual (no task-id) VL learning setting. In this work, we introduce the first Continual Data-Free Structured VL Concepts Learning (ConStruct-VL) benchmark and show it is challenging for many existing data-free CL strategies. We, therefore, propose a data-free method comprised of a new approach of Adversarial Pseudo-Replay (APR) which generates adversarial reminders of past tasks from past task models. To use this method efficiently, we also propose a continual parameter-efficient Layered-LoRA (LaLo) neural architecture allowing no-memory-cost access to all past models at train time. We show this approach outperforms all data-free methods by as much as ~7% while even matching some levels of experience-replay (prohibitive for applications where data-privacy must be preserved). Our code is publicly available at https://github.com/jamessealesmith/ConStruct-VL


CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects

arXiv.org Artificial Intelligence

We present CARTO, a novel approach for reconstructing multiple articulated objects from a single stereo RGB observation. We use implicit object-centric representations and learn a single geometry and articulation decoder for multiple object categories. Despite training on multiple categories, our decoder achieves a comparable reconstruction accuracy to methods that train bespoke decoders separately for each category. Combined with our stereo image encoder we infer the 3D shape, 6D pose, size, joint type, and the joint state of multiple unknown objects in a single forward pass. Our method achieves a 20.4% absolute improvement in mAP 3D IOU50 for novel instances when compared to a two-stage pipeline. Inference time is fast and can run on a NVIDIA TITAN XP GPU at 1 HZ for eight or less objects present. While only trained on simulated data, CARTO transfers to real-world object instances. Code and evaluation data is available at: http://carto.cs.uni-freiburg.de


ASIC: Aligning Sparse in-the-wild Image Collections

arXiv.org Artificial Intelligence

The above is also true for an image of a works assume either ground-truth keypoint annotations or "never-before-seen" object (as opposed to a common object a large dataset of images of a single object category. However, category such as cars) where humans demonstrate surprisingly neither of the above assumptions hold true for the longtail robust generalization despite lacking an object or category of the objects present in the world. We present a selfsupervised specific priors [6]. These correspondences in turn inform technique that directly optimizes on a sparse collection downstream inferences about the object such as shape, of images of a particular object/object category to affordances, and more. In this work, we tackle this problem obtain consistent dense correspondences across the collection. of "low-shot dense correspondence" - i.e. given only a small We use pairwise nearest neighbors obtained from deep in-the-wild image collection ( 10-30 images) of an object features of a pre-trained vision transformer (ViT) model as or object category, we recover dense and consistent correspondences noisy and sparse keypoint matches and make them dense across the entire collection.


Open-Vocabulary Object Detection using Pseudo Caption Labels

arXiv.org Artificial Intelligence

Recent open-vocabulary detection methods aim to detect novel objects by distilling knowledge from vision-language models (VLMs) trained on a vast amount of image-text pairs. To improve the effectiveness of these methods, researchers have utilized datasets with a large vocabulary that contains a large number of object classes, under the assumption that such data will enable models to extract comprehensive knowledge on the relationships between various objects and better generalize to unseen object classes. In this study, we argue that more fine-grained labels are necessary to extract richer knowledge about novel objects, including object attributes and relationships, in addition to their names. To address this challenge, we propose a simple and effective method named Pseudo Caption Labeling (PCL), which utilizes an image captioning model to generate captions that describe object instances from diverse perspectives. The resulting pseudo caption labels offer dense samples for knowledge distillation. On the LVIS benchmark, our best model trained on the de-duplicated VisualGenome dataset achieves an AP of 34.5 and an APr of 30.6, comparable to the state-of-the-art performance. PCL's simplicity and flexibility are other notable features, as it is a straightforward pre-processing technique that can be used with any image captioning model without imposing any restrictions on model architecture or training process.


Scene Graph Based Fusion Network For Image-Text Retrieval

arXiv.org Artificial Intelligence

A critical challenge to image-text retrieval is how to learn accurate correspondences between images and texts. Most existing methods mainly focus on coarse-grained correspondences based on co-occurrences of semantic objects, while failing to distinguish the fine-grained local correspondences. In this paper, we propose a novel Scene Graph based Fusion Network (dubbed SGFN), which enhances the images'/texts' features through intra- and cross-modal fusion for image-text retrieval. To be specific, we design an intra-modal hierarchical attention fusion to incorporate semantic contexts, such as objects, attributes, and relationships, into images'/texts' feature vectors via scene graphs, and a cross-modal attention fusion to combine the contextual semantics and local fusion via contextual vectors. Extensive experiments on public datasets Flickr30K and MSCOCO show that our SGFN performs better than quite a few SOTA image-text retrieval methods.


best way to be a machine learning engineer

#artificialintelligence

Becoming a machine learning engineer requires a combination of skills and knowledge in various areas such as mathematics, programming, data analysis, and machine learning algorithms. Learn the basics of mathematics and statistics: Machine learning requires a strong foundation in mathematics and statistics. You should be familiar with calculus, linear algebra, probability, and statistics. Master a programming language: You should learn a programming language such as Python or R, which are commonly used for machine learning. You should also be familiar with data structures, algorithms, and object-oriented programming.