AITopics | Gkioxari, Georgia

Collaborating Authors

Gkioxari, Georgia

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Is CLIP ideal? No. Can we fix it? Yes!

Kang, Raphi, Song, Yue, Gkioxari, Georgia, Perona, Pietro

arXiv.org Artificial IntelligenceMar-10-2025

Contrastive Language-Image Pre-Training (CLIP) is a popular method for learning multimodal latent spaces with well-organized semantics. Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions. Recent works attempt to address its shortcomings with data-centric or algorithmic approaches. But what if the problem is more fundamental, and lies in the geometry of CLIP? Toward this end, we rigorously analyze CLIP's latent space properties, and prove that no CLIP-like joint embedding space exists which can correctly do any two of the following at the same time: 1. represent basic descriptions and image content, 2. represent attribute binding, 3. represent spatial location and relationships, 4. represent negation. Informed by this analysis, we propose Dense Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models, which solves the fundamental limitations of CLIP by retaining the semantic topology of the image patches and text tokens. This method improves upon the performance of classical CLIP-like joint encoder models on a wide array of benchmarks. We share our code and data here for reproducibility: https://github.com/Raphoo/DCSM_Ideal_CLIP

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2503.08723

Country: North America > United States > California (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

CART: Caltech Aerial RGB-Thermal Dataset in the Wild

Lee, Connor, Anderson, Matthew, Raganathan, Nikhil, Zuo, Xingxing, Do, Kevin, Gkioxari, Georgia, Chung, Soon-Jo

arXiv.org Artificial IntelligenceMar-13-2024

We present the first publicly available RGB-thermal dataset designed for aerial robotics operating in natural environments. Our dataset captures a variety of terrains across the continental United States, including rivers, lakes, coastlines, deserts, and forests, and consists of synchronized RGB, long-wave thermal, global positioning, and inertial data. Furthermore, we provide semantic segmentation annotations for 10 classes commonly encountered in natural settings in order to facilitate the development of perception algorithms robust to adverse weather and nighttime conditions. Using this dataset, we propose new and challenging benchmarks for thermal and RGB-thermal semantic segmentation, RGB-to-thermal image translation, and visual-inertial odometry. We present extensive results using state-of-the-art methods and highlight the challenges posed by temporal and geographical domain shifts in our data.

machine learning, natural language, semantic segmentation, (15 more...)

arXiv.org Artificial Intelligence

2403.08997

Country:

North America > United States > California (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (1.00)

Industry:

Transportation (1.00)
Information Technology > Robotics & Automation (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
(3 more...)

Add feedback

TOTEM: TOkenized Time Series EMbeddings for General Time Series Analysis

Talukder, Sabera, Yue, Yisong, Gkioxari, Georgia

arXiv.org Artificial IntelligenceFeb-26-2024

The field of general time series analysis has recently begun to explore unified modeling, where a common architectural backbone can be retrained on a specific task for a specific dataset. In this work, we approach unification from a complementary vantage point: unification across tasks and domains. To this end, we explore the impact of discrete, learnt, time series data representations that enable generalist, cross-domain training. Our method, TOTEM, or TOkenized Time Series EMbeddings, proposes a simple tokenizer architecture that embeds time series data from varying domains using a discrete vectorized representation learned in a self-supervised manner. TOTEM works across multiple tasks and domains with minimal to no tuning. We study the efficacy of TOTEM with an extensive evaluation on 17 real world time series datasets across 3 tasks. We evaluate both the specialist (i.e., training a model on each domain) and generalist (i.e., training a single model on many domains) settings, and show that TOTEM matches or outperforms previous best methods on several popular benchmarks. The code can be found at: https://github.com/SaberaTalukder/TOTEM.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2402.16412

Country: North America > United States > California (0.28)

Genre: Research Report (0.50)

Industry: Banking & Finance (0.45)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Time Series Analysis (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Objaverse-XL: A Universe of 10M+ 3D Objects

Deitke, Matt, Liu, Ruoshi, Wallingford, Matthew, Ngo, Huong, Michel, Oscar, Kusupati, Aditya, Fan, Alan, Laforte, Christian, Voleti, Vikram, Gadre, Samir Yitzhak, VanderBilt, Eli, Kembhavi, Aniruddha, Vondrick, Carl, Gkioxari, Georgia, Ehsani, Kiana, Schmidt, Ludwig, Farhadi, Ali

arXiv.org Artificial IntelligenceJul-11-2023

Natural language processing and 2D vision models have attained remarkable proficiency on many tasks primarily by escalating the scale of training data. However, 3D vision tasks have not seen the same progress, in part due to the challenges of acquiring high-quality 3D data. In this work, we present Objaverse-XL, a dataset of over 10 million 3D objects. Our dataset comprises deduplicated 3D objects from a diverse set of sources, including manually designed objects, photogrammetry scans of landmarks and everyday items, and professional scans of historic and antique artifacts. Representing the largest scale and diversity in the realm of 3D datasets, Objaverse-XL enables significant new possibilities for 3D vision. Our experiments demonstrate the improvements enabled with the scale provided by Objaverse-XL. We show that by training Zero123 on novel view synthesis, utilizing over 100 million multi-view rendered images, we achieve strong zero-shot generalization abilities. We hope that releasing Objaverse-XL will enable further innovations in the field of 3D vision at scale.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2307.05663

Country: North America > United States (0.68)

Genre: Research Report (0.64)

Industry:

Government (1.00)
Information Technology (0.93)
Law (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

BKinD-3D: Self-Supervised 3D Keypoint Discovery from Multi-View Videos

Sun, Jennifer J., Karashchuk, Lili, Dravid, Amil, Ryou, Serim, Fereidooni, Sonia, Tuthill, John, Katsaggelos, Aggelos, Brunton, Bingni W., Gkioxari, Georgia, Kennedy, Ann, Yue, Yisong, Perona, Pietro

arXiv.org Artificial IntelligenceJun-2-2023

Quantifying motion in 3D is important for studying the behavior of humans and other animals, but manual pose annotations are expensive and time-consuming to obtain. Self-supervised keypoint discovery is a promising strategy for estimating 3D poses without annotations. However, current keypoint discovery approaches commonly process single 2D views and do not operate in the 3D space. We propose a new method to perform self-supervised keypoint discovery in 3D from multi-view videos of behaving agents, without any keypoint or bounding box supervision in 2D or 3D. Our method, BKinD-3D, uses an encoder-decoder architecture with a 3D volumetric heatmap, trained to reconstruct spatiotemporal differences across multiple views, in addition to joint length constraints on a learned 3D skeleton of the subject. In this way, we discover keypoints without requiring manual supervision in videos of humans and rats, demonstrating the potential of 3D keypoint discovery for studying behavior.

artificial intelligence, keypoint, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2212.07401

Country: South America (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Embodied Question Answering in Photorealistic Environments with Point Cloud Perception

Wijmans, Erik, Datta, Samyak, Maksymets, Oleksandr, Das, Abhishek, Gkioxari, Georgia, Lee, Stefan, Essa, Irfan, Parikh, Devi, Batra, Dhruv

arXiv.org Artificial IntelligenceApr-6-2019

To help bridge the gap between internet vision-style problems and the goal of vision for embodied perception we instantiate a large-scale navigation task - Embodied Question Answering [1] in photo-realistic environments (Matterport 3D). We thoroughly study navigation policies that utilize 3D point clouds, RGB images, or their combination. Our analysis of these models reveals several key findings. We find that two seemingly naive navigation baselines, forward-only and random, are strong navigators and challenging to outperform, due to the specific choice of the evaluation setting presented by [1]. We find a novel lossweighting Figure 1: We extend EmbodiedQA [1] to photorealstic environments, scheme we call Inflection Weighting to be important our agent is spawned in a perceptually and semantically when training recurrent models for navigation with behavior novel environment and tasked with answering a cloning and are able to out perform the baselines question about that environment. We examine the agent's with this technique. We find that point clouds provide a ability to navigate the environment and answer the question richer signal than RGB images for learning obstacle avoidance, by perceiving its environment through point clouds, RGB motivating the use (and continued study) of 3D deep images, or a combination of the two.

deep learning, neural network, point cloud, (20 more...)

arXiv.org Artificial Intelligence

1904.03461

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Neural Modular Control for Embodied Question Answering

Das, Abhishek, Gkioxari, Georgia, Lee, Stefan, Parikh, Devi, Batra, Dhruv

arXiv.org Artificial IntelligenceOct-25-2018

We present a modular approach for learning policies for navigation over long planning horizons from language input. Our hierarchical policy operates at multiple timescales, where the higher-level master policy proposes subgoals to be executed by specialized sub-policies. Our choice of subgoals is compositional and semantic, i.e. they can be sequentially combined in arbitrary orderings, and assume human-interpretable descriptions (e.g. 'exit room', 'find kitchen', 'find refrigerator', etc.). We use imitation learning to warm-start policies at each level of the hierarchy, dramatically increasing sample efficiency, followed by reinforcement learning. Independent reinforcement learning at each level of hierarchy enables sub-policies to adapt to consequences of their actions and recover from errors. Subsequent joint hierarchical training enables the master policy to adapt to the sub-policies.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

1810.11181

Country:

North America > United States (0.28)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.50)

Industry: Education (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.91)

Add feedback

Learning and Planning with a Semantic Model

Wu, Yi, Wu, Yuxin, Tamar, Aviv, Russell, Stuart, Gkioxari, Georgia, Tian, Yuandong

arXiv.org Artificial IntelligenceSep-27-2018

Building deep reinforcement learning agents that can generalize and adapt to unseen environments remains a fundamental challenge for AI. This paper describes progresses on this challenge in the context of man-made environments, which are visually diverse but contain intrinsic semantic regularities. We propose a hybrid model-based and model-free approach, LEArning and Planning with Semantics (LEAPS), consisting of a multi-target sub-policy that acts on visual inputs, and a Bayesian model over semantic structures. When placed in an unseen environment, the agent plans with the semantic model to make high-level decisions, proposes the next sub-target for the sub-policy to execute, and updates the semantic model based on new observations. We perform experiments in visual navigation tasks using House3D, a 3D environment that contains diverse human-designed indoor scenes with real-world objects. LEAPS outperforms strong baselines that do not explicitly plan using the semantic content. Deep reinforcement learning (DRL) has undoubtedly witnessed strong achievements in recent years (Silver et al., 2016; Mnih et al., 2015; Levine et al., 2016).

agent, deep learning, neural network, (20 more...)

arXiv.org Artificial Intelligence

1809.10842

Country: North America > United States (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
(2 more...)

Add feedback