This project implements the Listening-while-Speaking Language Model (LSLM) as described in the paper "Language Model Can Listen While Speaking" by Ma et al. (2024). LSLM is an innovative approach to full duplex modeling in interactive speech language models, enabling real-time interaction and turn-taking in spoken dialogues.
- Full duplex modeling capability
- Real-time streaming for SSL encoder
- Token-based TTS generation
- Interruption handling
- Noise robustness
- Multi-fusion strategies (early, middle, late)
- Command-based and voice-based full duplex modeling
- Python 3.7+
- PyTorch 1.8+
- Transformers library
- Torchaudio
- Matplotlib
- NumPy
git clone https://github.com/sanowl/LSLM-Listening-while-Speaking-Language-Model.git
To train and evaluate the model:
python main.py
This script will:
- Create and preprocess the dataset
- Train the LSLM model
- Evaluate on validation and test sets
- Perform ablation studies
- Conduct command-based and voice-based FDM tests
- Analyze turn-taking performance
- Generate sample speech output
- Visualize attention weights and audio quantization
main.py
: Main execution scriptmodel/
: Contains model architecture componentsutils/
: Utility functions for data processing and evaluationdata/
: Data loading and preprocessing scripts
If you use this implementation in your research, please cite the original paper:
@article{ma2024language, title={Language Model Can Listen While Speaking}, author={Ma, Ziyang and Song, Yakun and Du, Chenpeng and Cong, Jian and Chen, Zhuo and Wang, Yuping and Wang, Yuxuan and Chen, Xie}, journal={arXiv preprint arXiv:2408.02622}, year={2024} }
This project is licensed under the MIT License - see the LICENSE file for details.
i would like to express our gratitude to the authors of the original paper for their innovative work in full duplex modeling for interactive speech language models.