International Journal of Image, Graphics and Signal Processing(IJIGSP)

ISSN: 2074-9074 (Print), ISSN: 2074-9082 (Online)

Published By: MECS Press

IJIGSP Vol.6, No.11, Oct. 2014

Automatic Speech Segmentation Based On Audio and Optical Flow Visual Classification

Full Text (PDF, 247KB), PP.43-49

Views:81   Downloads:3


Behnam Torabi, Ahmad Reza Naghsh Nilchi

Index Terms

Optical Flow;Speech Segmentation;Video and Audio Fusion;Optical Flow


Automatic speech segmentation as an important part of speech recognition system (ASR) is highly noise dependent. Noise is made by changes in the communication channel, background, level of speaking etc. In recent years, many researchers have proposed noise cancelation techniques and have added visual features from speaker’s face to reduce the effect of noise on ASR systems. Removing noise from audio signals depends on the type of the noise; so it cannot be used as a general solution. Adding visual features improve this lack of efficiency, but advanced methods of this type need manual extraction of visual features. In this paper we propose a completely automatic system which uses optical flow vectors from speaker’s image sequence to obtain visual features. Then, Hidden Markov Models are trained to segment audio signals from image sequences and audio features based on extracted optical flow. The developed segmentation system based on such method acts totally automatic and become more robust to noise.

Cite This Paper

Behnam Torabi, Ahmad Reza Naghsh Nilchi,"Automatic Speech Segmentation Based On Audio and Optical Flow Visual Classification", IJIGSP, vol.6, no.11, pp.43-49, 2014.DOI: 10.5815/ijigsp.2014.11.06


[1]Cruttenden, Alan. Gimson's pronunciation of English. Routledge, 2013.

[2]Bin Amin, T., and Iftekhar Mahmood. "Speech recognition using dynamic time warping." Advances in Space Technologies, 2008. ICAST 2008. 2nd International Conference on. IEEE, 2008. 

[3]Nair, Nishanth Ulhas, and T. V. Sreenivas. "Multi pattern dynamic time warping for automatic speech recognition." TENCON 2008-2008 IEEE Region 10 Conference. IEEE, 2008. 

[4]Heracleous, Panikos, et al. "Analysis and recognition of NAM speech using HMM distances and visual information." Audio, Speech, and Language Processing, IEEE Transactions on 18.6 (2010): 1528-1538. 

[5]Yun, Hyun-Kyu, Aaron Smith, and Harvey Silverman. "Speech recognition HMM training on reconfigurable parallel processor." Field-Programmable Custom Computing Machines, 1997. Proceedings., The 5th Annual IEEE Symposium on. IEEE, 1997. 

[6]Akdemir, Eren, and Tolga Ciloglu. "Bimodal automatic speech segmentation based on audio and visual information fusion." Speech Communication 53.6 (2011): 889-902.

[7]Jiang, Dongmei, et al. "Audio Visual Speech Recognition and Segmentation Based on DBN Models." Robust Speech Recognition and Understanding: 139.

[8]Naghsh-Nilchi, Ahmad R., and Mohammad Roshanzamir. "An Efficient Algorithm for Motion Detection Based Facial Expression Recognition using Optical Flow." Enformatika 14 (2006). 

[9]Shin, Jongju, Jin Lee, and Daijin Kim. "Real-time lip reading system for isolated Korean word recognition." Pattern Recognition 44.3 (2011): 559-571.

[10]Valstar, Michel, et al. "Facial point detection using boosted regression and graph models." Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010. 

[11]Horn, Berthold K., and Brian G. Schunck. "Determining optical flow." 1981 Technical Symposium East. International Society for Optics and Photonics, 1981.

[12]Sanderson, Conrad, and K. K. Paliwal. "The VidTIMIT Database." IDIAP Communication (2002): 02-06.

[13]Evermann, Gunnar, et al. The HTK book. Vol. 2. Cambridge: Entropic Cambridge Research Laboratory, 1997. 

[14]Myers, Cory, Lawrence Rabiner, and Aaron E. Rosenberg. "Performance tradeoffs in dynamic time warping algorithms for isolated word recognition." Acoustics, Speech and Signal Processing, IEEE Transactions on 28.6 (1980): 623-635. 

[15]Axelrod, Scott, and Benoıt Maison. "Combination of hidden Markov models with dynamic time warping for speech recognition." Proc. ICASSP. Vol. 1. 2004..

[16]Yow, Kin Choong, and Roberto Cipolla. "Feature-based human face detection." Image and vision computing 15.9 (1997): 713-735.

[17]Han, Chin-Chuan, et al. "Fast face detection via morphology-based pre-processing." Pattern Recognition 33.10 (2000): 1701-1712.

[18]Dai, Ying, and Yasuaki Nakano. "Face-texture model based on SGLD and its application in face detection in a color scene." Pattern recognition 29.6 (1996): 1007-1017.

[19]Singh, Sanjay Kr, et al. "A robust skin color based face detection algorithm." Tamkang Journal of Science and Engineering 6.4 (2003): 227-234.

[20]Gong, Yifan. "Speech recognition in noisy environments: A survey." Speech communication 16.3 (1995): 261-291.

[21]Moreno, Pedro J., Bhiksha Raj, and Richard M. Stern. "A vector Taylor series approach for environment-independent speech recognition." Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on. Vol. 2. IEEE, 1996.

[22]Nádas, Arthur, David Nahamoo, and Michael A. Picheny. "Speech recognition using noise-adaptive prototypes." Acoustics, Speech and Signal Processing, IEEE Transactions on 37.10 (1989): 1495-1503. 

[23]Neti, C. "Neuromorphic speech processing for noisy environments." Neural Networks, 1994. IEEE World Congress on Computational Intelligence., 1994 IEEE International Conference on. Vol. 7. IEEE, 1994.

[24]Okawa, Shigeki, Enrico Bocchieri, and Alexandros Potamianos. "Multi-band speech recognition in noisy environments." Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on. Vol. 2. IEEE, 1998.