ISCUTE: Instance Segmentation of Cables Using Text Embedding

Kozlovsky, Shir, Joglekar, Omkar, Di Castro, Dotan

Feb-27-2024–arXiv.org Artificial Intelligence

CLIPSeg generates a 22 22 64 embedding tensor, which embeds a semantic mask that aligns with the input image spatially and is conditioned on text. To maintain a consistent embedding size throughout the pipeline, we employ an MLP (bottom left MLP in Figure 1) to upscale the 64-dimensional embedding to 256 dimensions, followed by a self-attention layer, which learns interpatch correlations to focus on the relevant patches. CLIPSeg's embedding output is enhanced with Dense Positional Encoding (DPE) to ensure that the self-attention layer has access to crucial geometric information. To this end, the DPE values are added to the embedding vector even after participating in the self-attention layer. To generate our DPE, we use an identical frequency matrix as SAM. This ensures that every element within each vector of the DPE conveys consistent information, that is aligned with what SAM's decoder has been trained to interpret.

arxiv, dataset, segmentation, (16 more...)

arXiv.org Artificial Intelligence

Feb-27-2024

arXiv.org PDF

Add feedback

Country:
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East
  - Israel > Haifa District > Haifa (0.04)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (0.89)
  - Artificial Intelligence
    - Vision (1.00)
    - Natural Language (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found