AITopics | Kweon, In So

Collaborating Authors

Kweon, In So

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Any6D: Model-free 6D Pose Estimation of Novel Objects

Lee, Taeyeop, Wen, Bowen, Kang, Minjun, Kang, Gyuree, Kweon, In So, Yoon, Kuk-Jin

arXiv.org Artificial IntelligenceMar-24-2025

We introduce Any6D, a model-free framework for 6D object pose estimation that requires only a single RGB-D anchor image to estimate both the 6D pose and size of unknown objects in novel scenes. Unlike existing methods that rely on textured 3D models or multiple viewpoints, Any6D leverages a joint object alignment process to enhance 2D-3D alignment and metric scale estimation for improved pose accuracy. Our approach integrates a render-and-compare strategy to generate and refine pose hypotheses, enabling robust performance in scenarios with occlusions, non-overlapping views, diverse lighting conditions, and large cross-environment variations. We evaluate our method on five challenging datasets: REAL275, Toyota-Light, HO3D, YCBINEOAT, and LM-O, demonstrating its effectiveness in significantly outperforming state-of-the-art methods for novel object pose estimation. Project page: https://taeyeop.com/any6d

artificial intelligence, estimation, video understanding, (16 more...)

arXiv.org Artificial Intelligence

2503.18673

Genre: Research Report > Promising Solution (0.34)

Technology: Information Technology > Artificial Intelligence > Vision > Video Understanding (0.91)

Add feedback

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Oh, Youngtaek, Cho, Jae Won, Kim, Dong-Jin, Kweon, In So, Kim, Junmo

arXiv.org Artificial IntelligenceOct-7-2024

In this paper, we propose a new method to enhance compositional understanding in pre-trained vision and language models (VLMs) without sacrificing performance in zero-shot multi-modal tasks. Traditional fine-tuning approaches often improve compositional reasoning at the cost of degrading multi-modal capabilities, primarily due to the use of global hard negative (HN) loss, which contrasts global representations of images and texts. This global HN loss pushes HN texts that are highly similar to the original ones, damaging the model's multi-modal representations. To overcome this limitation, we propose Fine-grained Selective Calibrated CLIP (FSC-CLIP), which integrates local hard negative loss and selective calibrated regularization. These innovations provide fine-grained negative supervision while preserving the model's representational integrity. Our extensive evaluations across diverse benchmarks for both compositionality and multi-modal tasks show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities. Code is available at: https://github.com/ytaek-oh/fsc-clip.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.0521

Country: Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

360 in the Wild: Dataset for Depth Prediction and View Synthesis

Park, Kibaek, Rameau, Francois, Park, Jaesik, Kweon, In So

arXiv.org Artificial IntelligenceJul-4-2024

The large abundance of perspective camera datasets facilitated the emergence of novel learning-based strategies for various tasks, such as camera localization, single image depth estimation, or view synthesis. However, panoramic or omnidirectional image datasets, including essential information, such as pose and depth, are mostly made with synthetic scenes. In this work, we introduce a large scale 360$^{\circ}$ videos dataset in the wild. This dataset has been carefully scraped from the Internet and has been captured from various locations worldwide. Hence, this dataset exhibits very diversified environments (e.g., indoor and outdoor) and contexts (e.g., with and without moving objects). Each of the 25K images constituting our dataset is provided with its respective camera's pose and depth map. We illustrate the relevance of our dataset for two main tasks, namely, single image depth estimation and view synthesis.

artificial intelligence, image understanding, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2406.18898

Genre: Research Report (0.82)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Human Computer Interaction (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition

Oh, Youngtaek, Ahn, Pyunghwan, Kim, Jinhyung, Song, Gwangmo, Lee, Soonyoung, Kweon, In So, Kim, Junmo

arXiv.org Artificial IntelligenceJun-13-2024

Vision and language models (VLMs) such as CLIP have showcased remarkable zero-shot recognition abilities yet face challenges in visio-linguistic compositionality, particularly in linguistic comprehension and fine-grained image-text alignment. This paper explores the intricate relationship between compositionality and recognition -- two pivotal aspects of VLM capability. We conduct a comprehensive evaluation of existing VLMs, covering both pre-training approaches aimed at recognition and the fine-tuning methods designed to improve compositionality. Our evaluation employs 12 benchmarks for compositionality, along with 21 zero-shot classification and two retrieval benchmarks for recognition. In our analysis from 274 CLIP model checkpoints, we reveal patterns and trade-offs that emerge between compositional understanding and recognition accuracy. Ultimately, this necessitates strategic efforts towards developing models that improve both capabilities, as well as the meticulous formulation of benchmarks for compositionality. We open our evaluation framework at https://github.com/ytaek-oh/vl_compo.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2406.09388

Country: Europe > Ireland (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback

Towards Understanding Dual BN In Hybrid Adversarial Training

Zhang, Chenshuang, Zhang, Chaoning, Zhang, Kang, Niu, Axi, Kim, Junmo, Kweon, In So

arXiv.org Artificial IntelligenceMar-28-2024

There is a growing concern about applying batch normalization (BN) in adversarial training (AT), especially when the model is trained on both adversarial samples and clean samples (termed Hybrid-AT). With the assumption that adversarial and clean samples are from two different domains, a common practice in prior works is to adopt Dual BN, where BN and BN are used for adversarial and clean branches, respectively. A popular belief for motivating Dual BN is that estimating normalization statistics of this mixture distribution is challenging and thus disentangling it for normalization achieves stronger robustness. In contrast to this belief, we reveal that disentangling statistics plays a less role than disentangling affine parameters in model training. This finding aligns with prior work (Rebuffi et al., 2023), and we build upon their research for further investigations. We demonstrate that the domain gap between adversarial and clean samples is not very large, which is counter-intuitive considering the significant influence of adversarial perturbation on the model accuracy. We further propose a two-task hypothesis which serves as the empirical foundation and a unified framework for Hybrid-AT improvement. We also investigate Dual BN in test-time and reveal that affine parameters characterize the robustness during inference. Overall, our work sheds new light on understanding the mechanism of Dual BN in Hybrid-AT and its underlying justification.

artificial intelligence, dual bn, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2403.1915

Country: Asia > Middle East > Israel (0.14)

Genre: Research Report > New Finding (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object

Zhang, Chenshuang, Pan, Fei, Kim, Junmo, Kweon, In So, Mao, Chengzhi

arXiv.org Artificial IntelligenceMar-27-2024

We establish rigorous benchmarks for visual perception robustness. Synthetic images such as ImageNet-C, ImageNet-9, and Stylized ImageNet provide specific type of evaluation over synthetic corruptions, backgrounds, and textures, yet those robustness benchmarks are restricted in specified variations and have low synthetic quality. In this work, we introduce generative model as a data source for synthesizing hard images that benchmark deep models' robustness. Leveraging diffusion models, we are able to generate images with more diversified backgrounds, textures, and materials than any prior work, where we term this benchmark as ImageNet-D. Experimental results show that ImageNet-D results in a significant accuracy drop to a range of vision models, from the standard ResNet visual classifier to the latest foundation models like CLIP and MiniGPT-4, significantly reducing their accuracy by up to 60\%. Our work suggests that diffusion models can be an effective source to test vision models. The code and dataset are available at https://github.com/chenshuang-zhang/imagenet_d.

imagenet-d, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2403.18775

Country:

North America > United States > Michigan (0.14)
North America > Canada > Quebec (0.14)

Genre: Research Report > New Finding (0.35)

Industry:

Information Technology > Security & Privacy (0.46)
Leisure & Entertainment > Sports > Tennis (0.30)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

One-Shot Neural Fields for 3D Object Understanding

Blukis, Valts, Lee, Taeyeop, Tremblay, Jonathan, Wen, Bowen, Kweon, In So, Yoon, Kuk-Jin, Fox, Dieter, Birchfield, Stan

arXiv.org Artificial IntelligenceAug-8-2023

We present a unified and compact scene representation for robotics, where each object in the scene is depicted by a latent code capturing geometry and appearance. This representation can be decoded for various tasks such as novel view rendering, 3D reconstruction (e.g. recovering depth, point clouds, or voxel maps), collision checking, and stable grasp prediction. We build our representation from a single RGB input image at test time by leveraging recent advances in Neural Radiance Fields (NeRF) that learn category-level priors on large multiview datasets, then fine-tune on novel objects from one or few views. We expand the NeRF model for additional grasp outputs and explore ways to leverage this representation for robotics. At test-time, we build the representation from a single RGB input image observing the scene from only one viewpoint. We find that the recovered representation allows rendering from novel views, including of occluded object parts, and also for predicting successful stable grasps. Grasp poses can be directly decoded from our latent representation with an implicit grasp decoder. We experimented in both simulation and real world and demonstrated the capability for robust robotic grasping using such compact representation. Website: https://nerfgrasp.github.io

artificial intelligence, machine learning, representation, (17 more...)

arXiv.org Artificial Intelligence

2210.12126

Country: Asia > Japan > Honshū > Chūbu (0.14)

Genre: Research Report (0.64)

Industry: Energy > Oil & Gas > Upstream (0.54)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Manipulation (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Attack-SAM: Towards Attacking Segment Anything Model With Adversarial Examples

Zhang, Chenshuang, Zhang, Chaoning, Kang, Taegoo, Kim, Donghun, Bae, Sung-Ho, Kweon, In So

arXiv.org Artificial IntelligenceMay-8-2023

Segment Anything Model (SAM) has attracted significant attention recently, due to its impressive performance on various downstream tasks in a zero-short manner. Computer vision (CV) area might follow the natural language processing (NLP) area to embark on a path from task-specific vision models toward foundation models. However, deep vision models are widely recognized as vulnerable to adversarial examples, which fool the model to make wrong predictions with imperceptible perturbation. Such vulnerability to adversarial attacks causes serious concerns when applying deep models to security-sensitive applications. Therefore, it is critical to know whether the vision foundation model SAM can also be fooled by adversarial attacks. To the best of our knowledge, our work is the first of its kind to conduct a comprehensive investigation on how to attack SAM with adversarial examples. With the basic attack goal set to mask removal, we investigate the adversarial robustness of SAM in the full white-box setting and transfer-based black-box settings. Beyond the basic goal of mask removal, we further investigate and find that it is possible to generate any desired mask by the adversarial attack.

arxiv preprint arxiv, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2305.00866

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.78)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

One Small Step for Generative AI, One Giant Leap for AGI: A Complete Survey on ChatGPT in AIGC Era

Zhang, Chaoning, Zhang, Chenshuang, Li, Chenghao, Qiao, Yu, Zheng, Sheng, Dam, Sumit Kumar, Zhang, Mengchun, Kim, Jung Uk, Kim, Seong Tae, Choi, Jinwoo, Park, Gyeong-Moon, Bae, Sung-Ho, Lee, Lik-Hang, Hui, Pan, Kweon, In So, Hong, Choong Seon

arXiv.org Artificial IntelligenceApr-4-2023

OpenAI has recently released GPT-4 (a.k.a. ChatGPT plus), which is demonstrated to be one small step for generative AI (GAI), but one giant leap for artificial general intelligence (AGI). Since its official release in November 2022, ChatGPT has quickly attracted numerous users with extensive media coverage. Such unprecedented attention has also motivated numerous researchers to investigate ChatGPT from various aspects. According to Google scholar, there are more than 500 articles with ChatGPT in their titles or mentioning it in their abstracts. Considering this, a review is urgently needed, and our work fills this gap. Overall, this work is the first to survey ChatGPT with a comprehensive review of its underlying technology, applications, and challenges. Moreover, we present an outlook on how ChatGPT might evolve to realize general-purpose AIGC (a.k.a. AI-generated content), which will be a significant milestone for the development of AGI.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2304.06488

Country:

Asia > China (0.46)
North America > United States (0.46)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)
Instructional Material (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area (1.00)
(7 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)

Add feedback

A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI

Zhang, Chenshuang, Zhang, Chaoning, Zheng, Sheng, Zhang, Mengchun, Qamar, Maryam, Bae, Sung-Ho, Kweon, In So

arXiv.org Artificial IntelligenceApr-2-2023

Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction. With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement. This work conducts a survey on audio diffusion model, which is complementary to existing surveys that either lack the recent progress of diffusion-based speech synthesis or highlight an overall picture of applying diffusion model in multiple fields. Specifically, this work first briefly introduces the background of audio and diffusion model. As for the text-to-speech task, we divide the methods into three categories based on the stage where diffusion model is adopted: acoustic model, vocoder and end-to-end framework. Moreover, we categorize various speech enhancement tasks by either certain signals are removed or added into the input speech. Comparisons of experimental results and discussions are also covered in this survey.

artificial intelligence, arxiv preprint arxiv, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2303.13336

Country: Asia (0.68)

Genre:

Overview (1.00)
Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.62)

Add feedback