OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

Open in new window