Expressive speech synthesis
Given a text to synthesize, taking into account expressiveness consists in (i) being able to predict parameters of the different description levels according to a given expressiveness, and (ii) being able to integrate these parameters in the speech generation process (for example, in the unit selection process).
Hence, the main emphasis in speech processing is on:
- expressiveness characterization and representation in speech,
- knowledge extraction from expressive data,
- generation of expressive speech,
- evaluation through realistic use cases.
The higher level of information, devoted to the study of linguistic phenomena, are mutualized with the text mining axis.
- Unit selection algorithms. Unit selection can be formulated as a best path search in a graph composed of millions of nodes. The classical approach is to use a Beam-Search algorithm to improve speed under a non-optimal assumption. In our case, we have proposed to use A* to solve this problem (see [Guennec2014] for details). Current work in this topic is now more focused on the cost functions used to do the path search.
- Duration and prosody modelling. Prosody control to have a voice adapted to the context of interaction is a major issue in speech synthesis. Current work is being acheived to bring more control on phonemes durations and on melody [Avanzi2014][Delais2014]
- Pronunciation modelling. Pronunciation models covers both grapheme-to-phoneme conversion and phoneme-to-phoneme transformation (pronunciation adaptation). Current work in this topic relies on statistical techniques and especially conditional random fields (CRFs), and mainly studies French and English. See [Qader2014] and [Lecorvé2015] for more details.
- Phonology modelling. Work on phonology modelling, beside pronunciation, mainly aims at studying and being able to predict disfluencies in spoken utterances. Only preliminary work has been achieved in this field. See [Qader2014] for more details.
The main applications domains are:
- Speaker characterization and voice personalization: models that can be adapted to a speaker thus taking into account its mood, personality or origins. Complete process of voice creation taking into account personalization of voice.
- Linguistic corpus design and corpus creation process: this application domain covers both the design of recording scripts and restriction of audio corpora to address specific tasks.
- High-quality multimedia content generation generation: this application is really meaningful in the framework of speech synthesis as it needs a fine control of expressiveness in order to keep user’s attention.
Expressiveness tends to make users accept TTS outputs by producing less impersonal speech. Thus, it plays a fundamental role in a large number of concrete applications. Among all applications, we can mention:
- high-quality audiobook generation;
- online learning and in particular autonomous language learning;
- device personalization for disabled people, for who expressive voice creation is an important need;
- video games.
ANR Phorevox (2012-2014)
Phorevox is an inter-disciplinary research project whose main objective is to propose learning assistance tools for written French by using speech technologies.See the scheme below or the website for more explanations (in French only).
ANR SynPaFlex (2015-2019)
SynPaflex is a research project whose aim is to improve text-to-speech engines by addressing pronunciation variants generation and adapted prosodic modeling. See website for more details.