AITopics | Keshet, Joseph

Collaborating Authors

Keshet, Joseph

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Designing Scheduling for Diffusion Models via Spectral Analysis

Benita, Roi, Elad, Michael, Keshet, Joseph

arXiv.org Machine LearningJan-31-2025

Diffusion models (DMs) have emerged as powerful tools for modeling complex data distributions and generating realistic new samples. Over the years, advanced architectures and sampling methods have been developed to make these models practically usable. However, certain synthesis process decisions still rely on heuristics without a solid theoretical foundation. In our work, we offer a novel analysis of the DM's inference process, introducing a comprehensive frequency response perspective. Specifically, by relying on Gaussianity and shift-invariance assumptions, we present the inference process as a closed-form spectral transfer function, capturing how the generated signal evolves in response to the initial noise. We demonstrate how the proposed analysis can be leveraged for optimizing the noise schedule, ensuring the best alignment with the original dataset's characteristics. Our results lead to scheduling curves that are dependent on the frequency content of the data, offering a theoretical justification for some of the heuristics taken by practitioners.

artificial intelligence, machine learning, noise schedule, (15 more...)

arXiv.org Machine Learning

2502.0018

Country: North America > United States (0.14)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)

Add feedback

Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR

Segal-Feldman, Yael, Shamsian, Aviv, Navon, Aviv, Hetz, Gill, Keshet, Joseph

arXiv.org Artificial IntelligenceSep-24-2024

Large transformer-based models have significant potential for speech transcription and translation. Their self-attention mechanisms and parallel processing enable them to capture complex patterns and dependencies in audio sequences. However, this potential comes with challenges, as these large and computationally intensive models lead to slow inference speeds. Various optimization strategies have been proposed to improve performance, including efficient hardware utilization and algorithmic enhancements. In this paper, we introduce Whisper-Medusa, a novel approach designed to enhance processing speed with minimal impact on Word Error Rate (WER). The proposed model extends the OpenAI's Whisper architecture by predicting multiple tokens per iteration, resulting in a 50% reduction in latency. We showcase the effectiveness of Whisper-Medusa across different learning setups and datasets.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2409.15869

Country: North America > United States (0.97)

Genre: Research Report > Promising Solution (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback

HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing

Turetzky, Arnon, Tal, Or, Segal-Feldman, Yael, Dissen, Yehoshua, Zeldes, Ella, Roth, Amit, Cohen, Eyal, Shrem, Yosi, Chernyak, Bronya R., Seleznova, Olga, Keshet, Joseph, Adi, Yossi

arXiv.org Artificial IntelligenceJul-10-2024

We present HebDB, a weakly supervised dataset for spoken language processing in the Hebrew language. HebDB offers roughly 2500 hours of natural and spontaneous speech recordings in the Hebrew language, consisting of a large variety of speakers and topics. We provide raw recordings together with a pre-processed, weakly supervised, and filtered version. The goal of HebDB is to further enhance research and development of spoken language processing tools for the Hebrew language. Hence, we additionally provide two baseline systems for Automatic Speech Recognition (ASR): (i) a self-supervised model; and (ii) a fully supervised model. We present the performance of these two methods optimized on HebDB and compare them to current multi-lingual ASR alternatives. Results suggest the proposed method reaches better results than the evaluated baselines considering similar model sizes. Dataset, code, and models are publicly available under https://pages.cs.huji.ac.il/adiyoss-lab/HebDB/.

natural language, speech recognition, weakly supervised dataset, (3 more...)

arXiv.org Artificial Intelligence

2407.07566

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.87)

Add feedback

Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network

Dissen, Yehoshua, Yonash, Shiry, Cohen, Israel, Keshet, Joseph

arXiv.org Artificial IntelligenceJun-27-2024

In the realm of automatic speech recognition (ASR), robustness in noisy environments remains a significant challenge. Recent ASR models, such as Whisper, have shown promise, but their efficacy in noisy conditions can be further enhanced. This study is focused on recovering from packet loss to improve the word error rate (WER) of ASR models. We propose using a front-end adaptation network connected to a frozen ASR model. The adaptation network is trained to modify the corrupted input spectrum by minimizing the criteria of the ASR model in addition to an enhancement loss function. Our experiments demonstrate that the adaptation network, trained on Whisper's criteria, notably reduces word error rates across domains and languages in packet-loss scenarios. This improvement is achieved with minimal affect to Whisper model's foundational performance, underscoring our method's practicality and potential in enhancing ASR models in challenging acoustic environments.

artificial intelligence, asr model, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2406.18928

Country: Europe > Germany (0.14)

Genre: Research Report (1.00)

Industry: Telecommunications > Networks (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.89)
Information Technology > Communications > Networks (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.54)

Add feedback

Keyword-Guided Adaptation of Automatic Speech Recognition

Shamsian, Aviv, Navon, Aviv, Glazer, Neta, Hetz, Gill, Keshet, Joseph

arXiv.org Artificial IntelligenceJun-4-2024

Automatic Speech Recognition (ASR) technology has made significant progress in recent years, providing accurate transcription across various domains. However, some challenges remain, especially in noisy environments and specialized jargon. In this paper, we propose a novel approach for improved jargon word recognition by contextual biasing Whisper-based models. We employ a keyword spotting model that leverages the Whisper encoder representation to dynamically generate prompts for guiding the decoder during the transcription process. We introduce two approaches to effectively steer the decoder towards these prompts: KG-Whisper, which is aimed at fine-tuning the Whisper decoder, and KG-Whisper-PT, which learns a prompt prefix. Our results show a significant improvement in the recognition accuracy of specified keywords and in reducing the overall word error rates. Specifically, in unseen language generalization, we demonstrate an average WER improvement of 5.1% over Whisper.

artificial intelligence, keyword, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2406.02649

Genre:

Research Report > New Finding (0.54)
Research Report > Promising Solution (0.34)

Industry:

Transportation (0.95)
Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation

Benita, Roi, Elad, Michael, Keshet, Joseph

arXiv.org Artificial IntelligenceNov-6-2023

Diffusion models have recently been shown to be relevant for high-quality speech generation. Most work has been focused on generating spectrograms, and as such, they further require a subsequent model to convert the spectrogram to a waveform (i.e., a vocoder). This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform. The proposed model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one. Hence, our model can effectively synthesize an unlimited speech duration while preserving high-fidelity synthesis and temporal coherence. We implemented the proposed model for unconditional and conditional speech generation, where the latter can be driven by an input sequence of phonemes, amplitudes, and pitch values. Working on the waveform directly has some empirical advantages. Specifically, it allows the creation of local acoustic behaviors, like vocal fry, which makes the overall waveform sounds more natural. Furthermore, the proposed diffusion model is stochastic and not deterministic; therefore, each inference generates a slightly different waveform variation, enabling abundance of valid realizations. Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.

artificial intelligence, diffar, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2310.01381

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.84)

Add feedback

Combining Language Models For Specialized Domains: A Colorful Approach

Eitan, Daniel, Pirchi, Menachem, Glazer, Neta, Meital, Shai, Ayach, Gil, Krendel, Gidon, Shamsian, Aviv, Navon, Aviv, Hetz, Gil, Keshet, Joseph

arXiv.org Artificial IntelligenceNov-1-2023

General purpose language models (LMs) encounter difficulties when processing domain-specific jargon and terminology, which are frequently utilized in specialized fields such as medicine or industrial settings. Moreover, they often find it challenging to interpret mixed speech that blends general language with specialized jargon. This poses a challenge for automatic speech recognition systems operating within these specific domains. In this work, we introduce a novel approach that integrates domain-specific or secondary LM into general-purpose LM. This strategy involves labeling, or "coloring", each word to indicate its association with either the general or the domain-specific LM. We develop an optimized algorithm that enhances the beam search algorithm to effectively handle inferences involving colored words. Our evaluations indicate that this approach is highly effective in integrating jargon into language tasks. Notably, our method substantially lowers the error rate for domain-specific words without compromising performance in the general domain.

artificial intelligence, combining language model, speech recognition, (2 more...)

arXiv.org Artificial Intelligence

2310.19708

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.87)
Information Technology > Artificial Intelligence > Natural Language (0.60)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.53)

Add feedback

Open-vocabulary Keyword-spotting with Adaptive Instance Normalization

Navon, Aviv, Shamsian, Aviv, Glazer, Neta, Hetz, Gill, Keshet, Joseph

arXiv.org Artificial IntelligenceSep-13-2023

Open vocabulary keyword spotting is a crucial and challenging task in automatic speech recognition (ASR) that focuses on detecting user-defined keywords within a spoken utterance. Keyword spotting methods commonly map the audio utterance and keyword into a joint embedding space to obtain some affinity score. In this work, we propose AdaKWS, a novel method for keyword spotting in which a text encoder is trained to output keyword-conditioned normalization parameters. These parameters are used to process the auditory input. We provide an extensive evaluation using challenging and diverse multi-lingual benchmarks and show significant improvements over recent keyword spotting and ASR baselines. Furthermore, we study the effectiveness of our approach on low-resource languages that were unseen during the training. The results demonstrate a substantial performance improvement compared to baseline methods.

artificial intelligence, open-vocabulary keyword-spotting, speech recognition, (1 more...)

arXiv.org Artificial Intelligence

2309.08561

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.53)

Add feedback

Fairness in the Eyes of the Data: Certifying Machine-Learning Models

Segal, Shahar, Adi, Yossi, Pinkas, Benny, Baum, Carsten, Ganesh, Chaya, Keshet, Joseph

arXiv.org Artificial IntelligenceSep-3-2020

We present a framework that allows to certify the fairness degree of a model based on an interactive and privacy-preserving test. The framework verifies any trained model, regardless of its training process and architecture. Thus, it allows us to evaluate any deep learning model on multiple fairness definitions empirically. We tackle two scenarios, where either the test data is privately available only to the tester or is publicly known in advance, even to the model creator. We investigate the soundness of the proposed approach using theoretical analysis and present statistical guarantees for the interactive test. Finally, we provide a cryptographic technique to automate fairness testing and certified inference with only black-box access to the model at hand while hiding the participants' sensitive data.

deep learning, fairness, neural network, (21 more...)

arXiv.org Artificial Intelligence

2009.01534

Country: North America > United States (0.93)

Genre: Research Report (0.82)

Industry:

Information Technology > Security & Privacy (1.00)
Government (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback

Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation

Kreuk, Felix, Keshet, Joseph, Adi, Yossi

arXiv.org Machine LearningAug-6-2020

We propose a self-supervised representation learning model for the task of unsupervised phoneme boundary detection. The model is a convolutional neural network that operates directly on the raw waveform. It is optimized to identify spectral changes in the signal using the Noise-Contrastive Estimation principle. At test time, a peak detection algorithm is applied over the model outputs to produce the final boundaries. As such, the proposed model is trained in a fully unsupervised manner with no manual annotations in the form of target boundaries nor phonetic transcriptions. We compare the proposed approach to several unsupervised baselines using both TIMIT and Buckeye corpora. Results suggest that our approach surpasses the baseline models and reaches state-of-the-art performance on both data sets. Furthermore, we experimented with expanding the training set with additional examples from the Librispeech corpus. We evaluated the resulting model on distributions and languages that were not seen during the training phase (English, Hebrew and German) and showed that utilizing additional untranscribed data is beneficial for model performance.

boundary, deep learning, neural network, (19 more...)

arXiv.org Machine Learning

2007.13465

Country: North America (0.28)

Genre: Research Report > New Finding (0.35)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback