Goto

Collaborating Authors

 Lin, Yiqi


Parrot Captions Teach CLIP to Spot Text

arXiv.org Artificial Intelligence

Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50% of images are embedded with visual text content, and around 30% of captions words are in these embedded visual content. Based on such observation, we thoroughly inspect the different released versions of CLIP models and verify that the visual text is the dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.


On the instrumental variable estimation with many weak and invalid instruments

arXiv.org Machine Learning

Recently, estimation of causal effects with high-dimensional observational data has drawn much attention in many research fields such as economics, epidemiology and genomics. The instrumental variable (IV) method is widely used when the treatment variable of interest is endogenous. As shown in Figure 1, the ideal IV needs to be correlated with the endogenous treatment variable (C1), it should not have a direct effect on the outcome (C2) and should not be related to unobserved confounders that affect both outcome and treatment (C3). Figure 1: Relevance and Validity of IVs Our research is motivated by the difficulty of finding IVs that satisfy all the above conditions. In applications, invalid IVs (violation of C2 or C3) (Davey Smith and Ebrahim, 2003; Kang et al., 2016; Windmeijer et al., 2019) and weak IVs (concerning the weak correlation in C1) (Bound et al., 1995; Staiger and Stock, 1997) are prevalent. A strand of literature studies the "many weak IVs" problem (Stock et al., 2002; Chao and Swanson, 2005). With the increasing availability of large datasets, IV models are often high-dimensional (Belloni et al., 2012; Lin et al., 2015; Fan and Zhong, 2018), and have potentially weak IVs (Andrews et al., 2018), and invalid IVs (Guo et al., 2018; Windmeijer et al., 2021). Among those problems, we mainly focus on the invalid IV problem, while allowing for potential high-dimensionality and weak signals.


Priors in Deep Image Restoration and Enhancement: A Survey

arXiv.org Artificial Intelligence

Image restoration and enhancement is a process of improving the image quality by removing degradations, such as noise, blur, and resolution degradation. Deep learning (DL) has recently been applied to image restoration and enhancement. Due to its ill-posed property, plenty of works have been explored priors to facilitate training deep neural networks (DNNs). However, the importance of priors has not been systematically studied and analyzed by far in the research community. Therefore, this paper serves as the first study that provides a comprehensive overview of recent advancements in priors for deep image restoration and enhancement. Our work covers five primary contents: (1) A theoretical analysis of priors for deep image restoration and enhancement; (2) A hierarchical and structural taxonomy of priors commonly used in the DL-based methods; (3) An insightful discussion on each prior regarding its principle, potential, and applications; (4) A summary of crucial problems by highlighting the potential future directions, especially adopting the large-scale foundation models as prior, to spark more research in the community; (5) An open-source repository that provides a taxonomy of all mentioned works and code links.