roboflow
Appendix of GLIPv2: Unifying Localization and Vision-Language Understanding
In Section 1, we provide more visualizations of our model's predictions on various localization and VL understanding tasks. In Section 2, we describe all our evaluated tasks and their dataset in detail. In Section 8, we give out a comparison for the model's inference speed. It has about 900k bounding box annotations for 80 object categories, with about 7.3 We predict use any-box protocol specified in MDETR. L VIS uses the same images as in COCO, re-annotated with more object categories.
DyCAF-Net: Dynamic Class-Aware Fusion Network
Jahin, Md Abrar, Soudeep, Shahriar, Mridha, M. F., Fahad, Nafiz, Hossen, Md. Jakir
Recent advancements in object detection rely on modular architectures with multi-scale fusion and attention mechanisms. However, static fusion heuristics and class-agnostic attention limit performance in dynamic scenes with occlusions, clutter, and class imbalance. We introduce Dynamic Class-Aware Fusion Network (DyCAF-Net) that addresses these challenges through three innovations: (1) an input-conditioned equilibrium-based neck that iteratively refines multi-scale features via implicit fixed-point modeling, (2) a dual dynamic attention mechanism that adaptively recalibrates channel and spatial responses using input- and class-dependent cues, and (3) class-aware feature adaptation that modulates features to prioritize discriminative regions for rare classes. Through comprehensive ablation studies with YOLOv8 and related architectures, alongside benchmarking against nine state-of-the-art baselines, DyCAF-Net achieves significant improvements in precision, mAP@50, and mAP@50-95 across 13 diverse benchmarks, including occlusion-heavy and long-tailed datasets. The framework maintains computational efficiency ($\sim$11.1M parameters) and competitive inference speeds, while its adaptability to scale variance, semantic overlaps, and class imbalance positions it as a robust solution for real-world detection tasks in medical imaging, surveillance, and autonomous systems.
FishDet-M: A Unified Large-Scale Benchmark for Robust Fish Detection and CLIP-Guided Model Selection in Diverse Aquatic Visual Domains
Abujabal, Muayad, Saoud, Lyes Saad, Hussain, Irfan
Accurate fish detection in underwater imagery is essential for ecological monitoring, aquaculture automation, and robotic perception. However, practical deployment remains limited by fragmented datasets, heterogeneous imaging conditions, and inconsistent evaluation protocols. To address these gaps, we present \textit{FishDet-M}, the largest unified benchmark for fish detection, comprising 13 publicly available datasets spanning diverse aquatic environments including marine, brackish, occluded, and aquarium scenes. All data are harmonized using COCO-style annotations with both bounding boxes and segmentation masks, enabling consistent and scalable cross-domain evaluation. We systematically benchmark 28 contemporary object detection models, covering the YOLOv8 to YOLOv12 series, R-CNN based detectors, and DETR based models. Evaluations are conducted using standard metrics including mAP, mAP@50, and mAP@75, along with scale-specific analyses (AP$_S$, AP$_M$, AP$_L$) and inference profiling in terms of latency and parameter count. The results highlight the varying detection performance across models trained on FishDet-M, as well as the trade-off between accuracy and efficiency across models of different architectures. To support adaptive deployment, we introduce a CLIP-based model selection framework that leverages vision-language alignment to dynamically identify the most semantically appropriate detector for each input image. This zero-shot selection strategy achieves high performance without requiring ensemble computation, offering a scalable solution for real-time applications. FishDet-M establishes a standardized and reproducible platform for evaluating object detection in complex aquatic scenes. All datasets, pretrained models, and evaluation tools are publicly available to facilitate future research in underwater computer vision and intelligent marine systems.
Deep Learning-Driven Heat Map Analysis for Evaluating thickness of Wounded Skin Layers
GR, Devakumar, Kaarthikeyan, JB, T, Dominic Immanuel, Pravin, Sheena Christabel
Understanding the appropriate skin layer thickness in wounded sites is an important tool to move forward on wound healing practices and treatment protocols. Methods to measure depth often are invasive and less specific. This paper introduces a novel method that is non-invasive with deep learning techniques using classifying of skin layers that helps in measurement of wound depth through heatmap analysis. A set of approximately 200 labeled images of skin allows five classes to be distinguished: scars, wounds, and healthy skin, among others. Each image has annotated key layers, namely the stratum cornetum, the epidermis, and the dermis, in the software Roboflow. In the preliminary stage, the Heatmap generator VGG16 was used to enhance the visibility of tissue layers, based upon which their annotated images were used to train ResNet18 with early stopping techniques. It ended up at a very high accuracy rate of 97.67%. To do this, the comparison of the models ResNet18, VGG16, DenseNet121, and EfficientNet has been done where both EfficientNet and ResNet18 have attained accuracy rates of almost 95.35%. For further hyperparameter tuning, EfficientNet and ResNet18 were trained at six different learning rates to determine the best model configuration. It has been noted that the accuracy has huge variations with different learning rates. In the case of EfficientNet, the maximum achievable accuracy was 95.35% at the rate of 0.0001. The same was true for ResNet18, which also attained its peak value of 95.35% at the same rate. These facts indicate that the model can be applied and utilized in actual-time, non-invasive wound assessment, which holds a great promise to improve clinical diagnosis and treatment planning.
Fire Detection From Image and Video Using YOLOv5
Islam, Arafat, Habib, Md. Imtiaz
For the detection of fire-like targets in indoor, outdoor and forest fire images, as well as fire detection under different natural lights, an improved YOLOv5 fire detection deep learning algorithm is proposed. The YOLOv5 detection model expands the feature extraction network from three dimensions, which enhances feature propagation of fire small targets identification, improves network performance, and reduces model parameters. Furthermore, through the promotion of the feature pyramid, the top-performing prediction box is obtained. Fire-YOLOv5 attains excellent results compared to state-of-the-art object detection networks, notably in the detection of small targets of fire and smoke with mAP 90.5% and f1 score 88%. Overall, the Fire-YOLOv5 detection model can effectively deal with the inspection of small fire targets, as well as fire-like and smoke-like objects with F1 score 0.88. When the input image size is 416 x 416 resolution, the average detection time is 0.12 s per frame, which can provide real-time forest fire detection. Moreover, the algorithm proposed in this paper can also be applied to small target detection under other complicated situations. The proposed system shows an improved approach in all fire detection metrics such as precision, recall, and mean average precision.
Building your own Object Detector from scratch with Tensorflow
In this story, we talk about how to build a Deep Learning Object Detector from scratch using TensorFlow. Instead of using a predefined model, we will define each layer in the network and then we will train our model to detect both the object bound box and its class. Finally, we will evaluate the model using IoU metric. TL;DR: need the code right now? Check this colab notebook or this github repository Object Detection is a task concerned in automatically finding semantic objects in an image. Today Object Detectors like YOLO v4/v5 /v7 and v8 achieve state-of-art in terms of accuracy at impressive real time FPS rate.
Generalized Decoding for Pixel, Image, and Language
Zou, Xueyan, Dou, Zi-Yi, Yang, Jianwei, Gan, Zhe, Li, Linjie, Li, Chunyuan, Dai, Xiyang, Behl, Harkirat, Wang, Jianfeng, Yuan, Lu, Peng, Nanyun, Wang, Lijuan, Lee, Yong Jae, Gao, Jianfeng
We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing). Code, demo, video, and visualization are available at https://x-decoder-vl.github.io.
Amazon SageMaker Studio Lab continues to democratize ML with more scale and functionality
To make machine learning (ML) more accessible, Amazon launched Amazon SageMaker Studio Lab at AWS re:Invent 2021. Today, tens of thousands of customers use it every day to learn and experiment with ML for free. We made it simple to get started with just an email address, without the need for installs, setups, credit cards, or an AWS account. SageMaker Studio Lab resonates with customers who want to learn in either an informal or formal setting, as indicated by a recent survey that suggests 49% of our current customer base is learning on their own, whereas 21% is taking a formal ML class. Higher learning institutions have started to adopt it, because it helps them teach ML fundamentals beyond the notebook, like environment and resource management, which are critical areas for successful ML projects.
Grounded Language-Image Pre-training
Li, Liunian Harold, Zhang, Pengchuan, Zhang, Haotian, Yang, Jianwei, Li, Chunyuan, Zhong, Yiwu, Wang, Lijuan, Yuan, Lu, Zhang, Lei, Hwang, Jenq-Neng, Chang, Kai-Wei, Gao, Jianfeng
This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head. Code will be released at https://github.com/microsoft/GLIP.