thumbnail
ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models
Currently, a prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously. While effective, this method generates a large number of image tokens. When combined with the widely used Rotary Position Embedding (RoPE), its long-term decay property hinders the interaction between high-resolution tokens and thumbnail tokens, as well as between text and image. To address these issues, we propose ID-Align, which alleviates these problems by reordering position IDs. In this method, high-resolution tokens inherit IDs from their corresponding thumbnail token while constraining the overexpansion of positional indices. Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements, including a 6.09% enhancement on MMBench's relation reasoning tasks and notable gains across multiple benchmarks. Our code is available at the following link: https://github.com/zooblastlbz/ID-Align.
- Asia > China > Beijing > Beijing (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Baitradar: A Multi-Model Clickbait Detection Algorithm Using Deep Learning
Gamage, Bhanuka, Labib, Adnan, Joomun, Aisha, Lim, Chern Hong, Wong, KokSheik
Following the rising popularity of YouTube, there is an emerging problem on this platform called clickbait, which provokes users to click on videos using attractive titles and thumbnails. As a result, users ended up watching a video that does not have the content as publicized in the title. This issue is addressed in this study by proposing an algorithm called BaitRadar, which uses a deep learning technique where six inference models are jointly consulted to make the final classification decision. These models focus on different attributes of the video, including title, comments, thumbnail, tags, video statistics and audio transcript. The final classification is attained by computing the average of multiple models to provide a robust and accurate output even in situation where there is missing data. The proposed method is tested on 1,400 YouTube videos. On average, a test accuracy of 98% is achieved with an inference time of less than 2s.
- North America > United States (0.04)
- Asia > Malaysia (0.04)
Multimodal Clickbait Detection by De-confounding Biases Using Causal Representation Inference
Yu, Jianxing, Wang, Shiqi, Yin, Han, Sun, Zhenlong, Xie, Ruobing, Zhang, Bo, Rao, Yanghui
This paper focuses on detecting clickbait posts on the Web. These posts often use eye-catching disinformation in mixed modalities to mislead users to click for profit. That affects the user experience and thus would be blocked by content provider. To escape detection, malicious creators use tricks to add some irrelevant non-bait content into bait posts, dressing them up as legal to fool the detector. This content often has biased relations with non-bait labels, yet traditional detectors tend to make predictions based on simple co-occurrence rather than grasping inherent factors that lead to malicious behavior. This spurious bias would easily cause misjudgments. To address this problem, we propose a new debiased method based on causal inference. We first employ a set of features in multiple modalities to characterize the posts. Considering these features are often mixed up with unknown biases, we then disentangle three kinds of latent factors from them, including the invariant factor that indicates intrinsic bait intention; the causal factor which reflects deceptive patterns in a certain scenario, and non-causal noise. By eliminating the noise that causes bias, we can use invariant and causal factors to build a robust model with good generalization ability. Experiments on three popular datasets show the effectiveness of our approach.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > Austria > Vienna (0.14)
- (21 more...)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- (5 more...)
TutoAI: A Cross-domain Framework for AI-assisted Mixed-media Tutorial Creation on Physical Tasks
Chen, Yuexi, Morariu, Vlad I., Truong, Anh, Liu, Zhicheng
Mixed-media tutorials, which integrate videos, images, text, and diagrams to teach procedural skills, offer more browsable alternatives than timeline-based videos. However, manually creating such tutorials is tedious, and existing automated solutions are often restricted to a particular domain. While AI models hold promise, it is unclear how to effectively harness their powers, given the multi-modal data involved and the vast landscape of models. We present TutoAI, a cross-domain framework for AI-assisted mixed-media tutorial creation on physical tasks. First, we distill common tutorial components by surveying existing work; then, we present an approach to identify, assemble, and evaluate AI models for component extraction; finally, we propose guidelines for designing user interfaces (UI) that support tutorial creation based on AI-generated components. We show that TutoAI has achieved higher or similar quality compared to a baseline model in preliminary user studies.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.05)
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- (3 more...)
- Workflow (1.00)
- Research Report (1.00)
- Instructional Material > Course Syllabus & Notes (1.00)
- Questionnaire & Opinion Survey (0.86)
- Education > Educational Technology (0.97)
- Education > Educational Setting > Online (0.93)
- Health & Medicine (0.69)
BI-LAVA: Biocuration with Hierarchical Image Labeling through Active Learning and Visual Analysis
Trelles, Juan, Wentzel, Andrew, Berrios, William, Marai, G. Elisabeta
In the biomedical domain, taxonomies organize the acquisition modalities of scientific images in hierarchical structures. Such taxonomies leverage large sets of correct image labels and provide essential information about the importance of a scientific publication, which could then be used in biocuration tasks. However, the hierarchical nature of the labels, the overhead of processing images, the absence or incompleteness of labeled data, and the expertise required to label this type of data impede the creation of useful datasets for biocuration. From a multi-year collaboration with biocurators and text-mining researchers, we derive an iterative visual analytics and active learning strategy to address these challenges. We implement this strategy in a system called BI-LAVA Biocuration with Hierarchical Image Labeling through Active Learning and Visual Analysis. BI-LAVA leverages a small set of image labels, a hierarchical set of image classifiers, and active learning to help model builders deal with incomplete ground-truth labels, target a hierarchical taxonomy of image modalities, and classify a large pool of unlabeled images. BI-LAVA's front end uses custom encodings to represent data distributions, taxonomies, image projections, and neighborhoods of image thumbnails, which help model builders explore an unfamiliar image dataset and taxonomy and correct and generate labels. An evaluation with machine learning practitioners shows that our mixed human-machine approach successfully supports domain experts in understanding the characteristics of classes within the taxonomy, as well as validating and improving data quality in labeled and unlabeled collections.
- South America > Peru (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
Aerial image dataset automatically maps rooftop solar arrays – pv magazine International
Scientists at Mines Paris-PSL University in France have created a dataset of aerial images, segmentation masks, and installation metadata for rooftop PV systems. They conceived the dataset to set up installation registries by extracting small-scale PV metadata from overhead imagery. "Our dataset provides ground truth installation masks for 13303 images from Google Earth and 7686 images from the French national institute of geographical and forestry information (IGN)," the researchers said, noting that the metadata includes installed power, surface, tilt, and azimuth angles. "To address architectural differences, researchers can either use the coarse-grained location included in our dataset or use our dataset in conjunction with other training datasets that mapped different areas." The dataset provides thumbnails with a resolution of 400 400 pixels centered around the locations of PV systems.
Machine Learning enabled models for YouTube Ranking Mechanism and Views Prediction
Gupta, Vandit, Diwan, Akshit, Chadha, Chaitanya, Khanna, Ashish, Gupta, Deepak
With the continuous increase of internet usage in todays time, everyone is influenced by this source of the power of technology. Due to this, the rise of applications and games Is unstoppable. A major percentage of our population uses these applications for multiple purposes. These range from education, communication, news, entertainment, and many more. Out of this, the application that is making sure that the world stays in touch with each other and with current affairs is social media. Social media applications have seen a boom in the last 10 years with the introduction of smartphones and the internet being available at affordable prices. Applications like Twitch and Youtube are some of the best platforms for producing content and expressing their talent as well. It is the goal of every content creator to post the best and most reliable content so that they can gain recognition. It is important to know the methods of achieving popularity easily, which is what this paper proposes to bring to the spotlight. There should be certain parameters based on which the reach of content could be multiplied by a good factor. The proposed research work aims to identify and estimate the reach, popularity, and views of a YouTube video by using certain features using machine learning and AI techniques. A ranking system would also be used keeping the trending videos in consideration. This would eventually help the content creator know how authentic their content is and healthy competition to make better content before uploading the video on the platform will be ensured.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California (0.04)
- Europe > United Kingdom > England > Kent (0.04)
- (2 more...)
Creativity should have been the last win for AI. Surprisingly, it's the first
When OpenAI's DALL.E 2 was released two weeks back, the AI tool's ability to create images using sparse natural language instructions caused an online frenzy. Whatever its predecessor DALL.E could do, DALL.E 2 could do better. After the announcement, OpenAI's CEO Sam Altman spoke about the potential upsides of DALL.E 2 and the general direction that AI was moving towards in his blog. According to Altman, the general idea that AI's contributions would affect physical labour first, followed by cognitive labour and then eventually reach creative work has been reversed in reality. "It now looks like it's going to go in the opposite order," he noted.
- Leisure & Entertainment (0.75)
- Media > Film (0.54)
How AI is Changing Digital Marketing - DataScienceCentral.com
Oxford Languages defines AI as the theory and development of computer systems able to perform tasks that normally require human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages. For those of us working in the realm of digital marketing, the impact has become even more clear over the last few years. To put things into perspective, 61% of marketers say AI is the most important aspect of their data strategy, according to MemSQL. Have you ever searched for a particular product and then all advertisements you see after that search are for similar products? The power of DMPs (Data Management Platforms) allows AI to gather data from across the Internet – not just a particular website.
Is Netflix stalking you?
Humans have the attention span of a goldfish, giving companies like Netflix just a few seconds to woe you before it loses you to a competing service or something other activity. Netflix wants to grab your attention say like a new boyfriend, but does it do this without spamming you with texts or calls? Getting a little technical here, Netflix relies heavy on batch machine learning approaches information gathered by algorithms that reflect A/B testing. Okay, Okay I know this was too much so imagine you are someone who watches more thrillers or mysteries you will see a thumbnail with Archie, Betty, and Veronica looking at you all intense emoting suspense. Now imagine me as someone who watches more romance and high-school drama, actually scrap that KNOW ME as someone who loves it -- I'm all about the notebook, the vow, and letters to Julliet.
- Media > Television (1.00)
- Media > Film (1.00)
- Information Technology > Services (1.00)