Abstract
The concept of using visual information as part of audio speech processing has been of significant recent interest. This paper presents a data driven approach that considers estimating audio speech acoustics using only temporal visual information without considering linguistic features such as phonemes and visemes. Audio (log filterbank) and visual (2D-DCT) features are extracted, and various configurations of MLP and datasets are used to identify optimal results, showing that given a sequence of prior visual frames an equivalent reasonably accurate audio frame estimation can be mapped.
| Original language | English |
|---|---|
| Title of host publication | Advances in Brain Inspired Cognitive Systems |
| Subtitle of host publication | 8th International Conference, BICS 2016, Beijing, China, November 28-30, 2016, Proceedings |
| Editors | Cheng-Lin Liu, Amir Hussain, Bin Luo, Kay Chen Tan, Yi Zeng, Zhaoxiang Zhang |
| Place of Publication | Cham, Switzerland |
| Publisher | Springer-Verlag |
| Pages | 331-342 |
| Number of pages | 12 |
| ISBN (Electronic) | 9783319496856 |
| ISBN (Print) | 9783319496849 |
| DOIs | |
| Publication status | Published - 13 Nov 2016 |
| Event | 8th International Conference on Brain Inspired Cognitive Systems, BICS 2016 - Beijing, China Duration: 28 Nov 2016 → 30 Nov 2016 |
Publication series
| Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
|---|---|
| Volume | 10023 LNAI |
| ISSN (Print) | 0302-9743 |
| ISSN (Electronic) | 1611-3349 |
Conference
| Conference | 8th International Conference on Brain Inspired Cognitive Systems, BICS 2016 |
|---|---|
| Country/Territory | China |
| City | Beijing |
| Period | 28/11/16 → 30/11/16 |
Funding
This work was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) Grant No. EP/M026981/1 (CogAVHearing- http://cogavhearing.cs.stir.ac.uk ). In accordance with EPSRC policy, all experimental data used in the project simulations is available at http://hdl.handle.net/11667/81 . The authors would also like to gratefully acknowledge Prof. Leslie Smith and Dr Ahsan Adeel at the University of Stirling, Dr Kristína Malinovská at Comenius University in Bratislava, and the anonymous reviewers for their helpful comments and suggestions.
Keywords
- ANNs
- audiovisual
- speech mapping
- speech processing