Earballs: Neural Transmodal Translation

Port, Andrew, Kim, Chelhwon, Patel, Mitesh

Jun-5-2020–arXiv.org Machine Learning

As is expressed in the adage "a picture is worth a thousand words", when using spoken language to communicate visual information, brevity can be a challenge. This work describes a novel technique for leveraging machine learned feature embeddings to translate visual (and other types of) information into a perceptual audio domain, allowing users to perceive this information using only their aural faculty. The system uses a pretrained image embedding network to extract visual features and embed them in a compact subset of Euclidean space -- this converts the images into feature vectors whose $L^2$ distances can be used as a meaningful measure of similarity. A generative adversarial network (GAN) is then used to find a distance preserving map from this metric space of feature vectors into the metric space defined by a target audio dataset equipped with either the Euclidean metric or a mel-frequency cepstrum-based psychoacoustic distance metric. We demonstrate this technique by translating images of faces into human speech-like audio. For both target audio metrics, the GAN successfully found a metric preserving mapping, and in human subject tests, users were able to accurately classify audio translations of faces.

artificial intelligence, feature vector, machine learning, (15 more...)

arXiv.org Machine Learning

Jun-5-2020

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California
    - Santa Cruz County > Santa Cruz (0.14)
    - Santa Clara County > Palo Alto (0.04)
- Europe > Sweden
  - Stockholm > Stockholm (0.04)

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine (0.68)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Neural Networks (1.00)
  - Supervised Learning > Representation Of Examples (0.99)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found