Emotion conversion in speech has attracted recent attention owing to its importance in human-machine interaction and the current high quality of speech synthesis. Most existing approaches rely on parallel data, which is not available in many real-time applications. We propose a non-parallel emotion conversion approach based on the cycle generative adversarial network (cycleGAN) framework. We introduce new variants of cycleGAN that use recurrent neural networks and multi-kernel convolutional neural networks for modeling prosodic features along with spectral features for emotion conversion in speech. Subjective evaluation results show the effectiveness of our approach in converting natural speech, and also unseen synthesized speech samples to different target emotive states.
Article ID: 2021L22
Publisher: Canadian Artificial Intelligence Association