Facebook AI has introduced Generative Spoken Language Model (GSLM), the first high-performance NLP model which leverages state-of-the-art representation learning to work with raw audio signals without labels or text. By using GSLM, you can develop NLP models that incorporate the full range of expressivity found in spoken language. GSLM begins by building a baseline model and evaluating it on two simple end-to-end tasks: discrete resynthesis, where an input wave is encoded into pseudo-text that they call units; speech generation, when the language model uses these texts to sample new inputs. The Facebook research team trained their encoder and unit-based language model (uLM) on 6,000 hours of Libri-Light and Librispeech. The entire stack was self-supervised from raw audio with no text or labels. The research group plans to apply GSLM to casual and spontaneous speech data sets where text-based methods struggle. They also plan on showing that this method can be effective for pretraining downstream tasks with few labeled data, like spoken summarization or information retrieval.