ISCUTE: Instance Segmentation of Cables Using Text Embedding
Kozlovsky, Shir, Joglekar, Omkar, Di Castro, Dotan
–arXiv.org Artificial Intelligence
CLIPSeg generates a 22 22 64 embedding tensor, which embeds a semantic mask that aligns with the input image spatially and is conditioned on text. To maintain a consistent embedding size throughout the pipeline, we employ an MLP (bottom left MLP in Figure 1) to upscale the 64-dimensional embedding to 256 dimensions, followed by a self-attention layer, which learns interpatch correlations to focus on the relevant patches. CLIPSeg's embedding output is enhanced with Dense Positional Encoding (DPE) to ensure that the self-attention layer has access to crucial geometric information. To this end, the DPE values are added to the embedding vector even after participating in the self-attention layer. To generate our DPE, we use an identical frequency matrix as SAM. This ensures that every element within each vector of the DPE conveys consistent information, that is aligned with what SAM's decoder has been trained to interpret.
arXiv.org Artificial Intelligence
Feb-27-2024
- Country:
- Asia > Middle East
- Israel > Haifa District > Haifa (0.04)
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East
- Genre:
- Research Report > New Finding (0.46)
- Technology: