- {{link.text}}
Speech Processing
Our goal in Speech Technology Research is twofold: to make speaking to devices around you (home, in car), devices you wear (watch), devices with you (phone, tablet) ubiquitous and seamless.
Our research focuses on what makes Google unique: computing scale and data. Using large scale computing resources pushes us to rethink the architecture and algorithms of speech recognition, and experiment with the kind of methods that have in the past been considered prohibitively expensive. We also look at parallelism and cluster computing in a new light to change the way experiments are run, algorithms are developed and research is conducted. The field of speech recognition is data-hungry, and using more and more data to tackle a problem tends to help performance but poses new challenges: how do you deal with data overload? How do you leverage unsupervised and semi-supervised techniques at scale? Which class of algorithms merely compensate for lack of data and which scale well with the task at hand? Increasingly, we find that the answers to these questions are surprising, and steer the whole field into directions that would never have been considered, were it not for the availability of significantly higher orders of magnitude of data.
We are also in a unique position to deliver very user-centric research. Researchers have the wealth of millions of users talking to Voice Search or the Android Voice Input every day. and can conduct live experiments to test and benchmark new algorithms directly in a realistic controlled environment. Whether these are algorithmic performance improvements or user experience and human-computer interaction studies, we keep our users very close to make sure we solve real problems and have real impact.
We have a huge commitment to the diversity of our users, and have made it a priority to deliver the best performance to every language on the planet. We currently have systems operating in more than 55 languages and we keep expanding our reach to more and more users. The challenges of internationalizing at scale is immense and rewarding. Many speakers of the languages we reach never had the experience of speaking to a computer before, and breaking this new ground brings up new research on how to better serve this wide variety of users. Combined with the unprecedented translation capabilities of Google Translate, we are now at the forefront of research in speech-to-speech translation and one step closer to a universal translator.
In terms of a challenge, indexing and transcribing the web’s audio content is another challenge we have set for ourself, and is nothing short of gargantuan, both in scope and difficulty. The videos uploaded every day on YouTube range from lectures, to newscasts, music videos and of course... cat videos. Making sense of them takes the challenges of noise robustness, music recognition, speaker segmentation, language detection to new levels of difficulty. The payoff is immense: imagine making every lecture on the web accessible to every language; this is the kind of impact we are striving for.
264 Publications
(Almost) Zero-Shot Cross-Lingual Spoken Language Understanding
Shyam Upadhyay, Manaal Faruqui , Gokhan Tur , Dilek Hakkani-Tur , Larry Heck
Proceedings of the IEEE ICASSP (2018)
An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model
Anjuli Kannan , Yonnghui Wu , Patrick Nguyen, Tara N. Sainath , Zhifeng Chen , Rohit Prabhavalkar
ICASSP (2018)
Decoding the auditory brain with canonical component analysis
Alain de Cheveigné, Daniel D. E. Wong, Giovanni M. Di Liberto, Jens Hjortkjaer, Malcolm Slaney , Edmund Lalor
NeuroImage (2018)
Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models
Rohit Prabhavalkar , Tara Sainath , Yonghui Wu , Patrick Nguyen, Zhifeng Chen , Chung-Cheng Chiu , Anjuli Kannan
ICASSP 2018 (to appear)
Multilingual Speech Recognition with a Single End-to-End Model
Shubham Toshniwal, Tara N. Sainath , Ron Weiss , Bo Li , Pedro Moreno , Eugene Weinsten , Kanishka Rao
ON USING BACKPROPAGATION FOR SPEECH TEXTURE GENERATION AND VOICE CONVERSION
Jan Chorowski, Ron J. Weiss , Rif A. Saurous , Samy Bengio
Sound source separation using phase difference and reliable mask selection
Chanwoo Kim , Anjali Menon, Michiel Bacchiani , Richard M. Stern
ICASSP (2018) (to appear)
Spectral distortion model for training phase-sensitive deep-neural networks for far-field speech recognition
Chanwoo Kim , Tara Sainath , Arun Narayanan , Ananya Misra , Rajeev Nongpiur, Michiel Bacchiani
ICASSP 2018 (2018)
State-of-the-art Speech Recognition With Sequence-to-Sequence Models
Chung-Cheng Chiu , Tara Sainath , Yonghui Wu , Rohit Prabhavalkar , Patrick Nguyen, Zhifeng Chen , Anjuli Kannan , Ron J. Weiss , Kanishka Rao , Katya Gonina, Navdeep Jaitly, Bo Li , Jan Chorowski, Michiel Bacchiani
A Cascade Architecture for Keyword Spotting on Mobile Devices
Alexander Gruenstein , Raziel Alvarez , Chris Thornton, Mohammadali Ghodrat
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA (2017)
A Comparison of Sequence-to-Sequence Models for Speech Recognition
Rohit Prabhavalkar , Kanishka Rao , Tara Sainath , Bo Li , Leif Johnson , Navdeep Jaitly
Interspeech 2017, ISCA (2017)
A Segmental Framework for Fully-Unsupervised Large-Vocabulary Speech Recognition
Herman Kamper, Aren Jansen , Sharon Goldwater
Computer Speech and Language (2017) (to appear)
A more general method for pronunciation learning
Antoine Bruguier , Dan Gnanapragasam , Francoise Beaufays , Kanishka Rao , Leif Johnson
Interspeech 2017 (2017)
Acoustic Modeling for Google Home
Bo Li , Tara Sainath , Arun Narayanan , Joe Caroselli, Michiel Bacchiani , Ananya Misra , Izhak Shafran , Hasim Sak , Golan Pundak , Kean Chin, Khe Chai Sim, Ron J. Weiss , Kevin Wilson , Ehsan Variani , Chanwoo Kim , Olivier Siohan , Mitchel Weintraub, Erik McDermott , Rick Rose , Matt Shannon
INTERSPEECH 2017 (2017)
An Analysis of "Attention" in Sequence-to-Sequence Models
Rohit Prabhavalkar , Tara Sainath , Bo Li , Kanishka Rao , Navdeep Jaitly
Approaches for Neural-Network Language Model Adaptation
Fadi Biadsy , Michael Alexander Nirschl , Min Ma, Shankar Kumar
Interspeech 2017, Stockholm, Sweden (2017)
Areal and Phylogenetic Features for Multilingual Speech Synthesis
Alexander Gutkin , Richard Sproat
Proc. of Interspeech 2017, ISCA, August 20–24, 2017, Stockholm, Sweden, pp. 2078-2082
Attention-Based Models for Text-Dependent Speaker Verification
F A Rezaur Rahman Chowdhury, Quan Wang , Ignacio Lopez Moreno , Li Wan
Binaural processing for robust speech recognition of degraded speech
Anjali Menon, Chanwoo Kim , Umpei Kurokawa, Richard M. Stern
IEEE Automatic Speech Recognition and Understanding Workshop (2017)
Effectively Building Tera Scale MaxEnt Language Models Incorporating Non-Linguistic Signals
Fadi Biadsy , Mohammadreza Ghodsi , Diamantino Caseiro
Interpspeech 2017 (2017)
Efficient Implementation of the Room Simulator for Training Deep Neural Network Acoustic Models
Chanwoo Kim , Ehsan Variani , Arun Narayanan , Michiel Bacchiani
arxiv (2017)
End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow
Ehsan Variani , Tom Bagby, Erik McDermott , Michiel Bacchiani
Endpoint detection using grid long short-term memory networks for streaming speech recognition
Bo Li , Carolina Parada , Gabor Simko , Shuo-yiin Chang , Tara Sainath
In Proc. Interspeech 2017 (to appear)
Generalized End-to-End Loss for Speaker Verification
Li Wan , Quan Wang , Alan Papir , Ignacio Lopez Moreno
Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home
Chanwoo Kim , Ananya Misra , Kean Chin, Thad Hughes , Arun Narayanan , Tara Sainath , Michiel Bacchiani
interspeech 2017 (2017), pp. 379-383
Generative Model-Based Text-to-Speech Synthesis
Google's next-generation real-time unit-selection synthesizer using sequence-to-sequence LSTM-based autoencoders
Vincent Wan , Yannis Agiomyrgiannakis , Hanna Silen, Jakub Vit
Interspeech (2017)
Highway-LSTM and Recurrent Highway Networks for Speech Recognition
Golan Pundak , Tara Sainath
Proc. Interspeech 2017, ISCA
Human and Machine Hearing: Extracting Meaning from Sound
Richard F. Lyon
Cambridge University Press (2017)
Improved end-of-query detection for streaming speech recognition
Carolina Parada , Gabor Simko , Matt Shannon, Shuo-yiin Chang
Proc. Interspeech 2017 (2017) (to appear)
Incoherent idempotent ambisonics rendering
W. Bastiaan Kleijn, Andrew Allen , Jan Skoglund , Felicia Lim
2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (2017)
Joint Wideband Source Localization and Acquisition Based on a Grid-Shift Approach
Christos Tzagkarakis, Bastiaan Kleijn, Jan Skoglund
Keyword Spotting for Google Assistant Using Contextual Speech Recognition
Assaf Michaely , Carolina Parada , Frank Zhang, Gabor Simko , Petar Aleksic
ASRU 2017, IEEE
Language Modeling in the Era of Abundant Data
Ciprian Chelba
AI With the Best online conference. (2017)
Latent Sequence Decompositions
William Chan , Yu Zhang , Quoc Le , Navdeep Jaitly
ICLR (2017)
Multi-Accent Speech Recognition with Hierarchical Grapheme Based Models
Hasim Sak , Kanishka Rao
ICASSP 2017 (to appear)
Multichannel Signal Processing with Deep Neural Networks for Automatic Speech Recognition
Tara Sainath , Ron J. Weiss , Kevin Wilson , Bo Li , Arun Narayanan , Ehsan Variani , Michiel Bacchiani , Izhak Shafran , Andrew Senior , Kean Chin, Ananya Misra , Chanwoo Kim
IEEE /ACM Transactions on Audio, Speech, and Language Processing, vol. 25 (2017), pp. 965 - 979
On Lattice Generation for Large Vocabulary Speech Recognition
David Rybach , Johan Schalkwyk, Michael Riley
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan (2017)
Optimizing expected word error rate via sampling for speech recognition
Matt Shannon
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
Aäron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals , Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis Carlos Cobo Rus, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen , Nal Kalchbrenner, Heiga Zen , Alexander Graves, Helen King, Thomas Walters , Dan Belov, Demis Hassabis
NA, Google Deepmind, NA (2017)
Practically Efficient Nonlinear Acoustic Echo Cancellers Using Cascaded Block RLS and FLMS Adaptive Filters
Yiteng (Arden) Huang, Jan Skoglund , Alejandro Luebs
ICASSP (2017)
Raw Multichannel Processing Using Deep Neural Networks
Tara N. Sainath , Ron J. Weiss , Kevin W. Wilson , Arun Narayanan , Michiel Bacchiani , Bo Li , Ehsan Variani , Izhak Shafran , Andrew Senior , Kean Chin, Ananya Misra , Chanwoo Kim
New Era for Robust Speech Recognition: Exploiting Deep Learning, Springer (2017)
Robust Speech Recognition Based on Binaural Auditory Processing
Anjali Menon, Chanwoo Kim , Richard M. Stern
INTERSPEECH 2017 (2017), pp. 3872-3876
Robust and low-complexity blind source separation for meeting rooms
W. Bastiaan Kleijn, Felicia Lim
Proceedings Fifth Joint Workshop on Hands-free Speech Communication and Microphone Arrays (2017)
Sparse Non-negative Matrix Language Modeling: Maximum Entropy Flexibility on the Cheap
Ciprian Chelba , Diamantino Caseiro, Fadi Biadsy
The 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp. 2725-2729 (to appear)
Speaker Diarization with LSTM
Quan Wang , Carlton Downey, Li Wan , Philip Andrew Mansfield, Ignacio Lopez Moreno
Streaming Small-Footprint Keyword Spotting Using Sequence-to-Sequence Models
Yanzhang (Ryan) He, Rohit Prabhavalkar , Kanishka Rao , Wei Li, Anton Bakhtin , Ian McGraw
Automatic Speech Recognition and Understanding (ASRU), 2017 IEEE Workshop on
Syllable-Based Acoustic Modeling with CTC-SMBR-LSTM
Zhongdi Qu, Parisa Haghani, Eugene Weinstein , Pedro Moreno
Tacotron: Towards End-to-End Speech Synthesis
Yuxuan Wang , RJ Skerry-Ryan , Daisy Stanton , Yonghui Wu , Ron J. Weiss , Navdeep Jaitly, Zongheng Yang, Ying Xiao , Zhifeng Chen , Samy Bengio , Quoc Le , Yannis Agiomyrgiannakis , Rob Clark , Rif A. Saurous
Trainable Frontend For Robust and Far-Field Keyword Spotting
Yuxuan Wang , Pascal Getreuer , Thad Hughes , Richard F. Lyon , Rif A. Saurous
Proc. IEEE ICASSP 2017, New Orleans, LA
Uncovering Latent Style Factors for Expressive Speech Synthesis
Yuxuan Wang , RJ Skerry-Ryan , Ying Xiao , Daisy Stanton , Joel Shor , Eric Battenberg , Rob Clark , Rif A. Saurous
NIPS Workshop on Machine Learning for Audio Signal Processing (ML4Audio) (2017) (to appear)
Uniform Multilingual Multi-Speaker Acoustic Model for Statistical Parametric Speech Synthesis of Low-Resourced Languages
Alexander Gutkin
Proc. of Interspeech 2017, ISCA, August 20–24, Stockholm, Sweden, pp. 2183-2187
Very Deep Convolutional Networks for End-to-End Speech Recognition
Yu Zhang , William Chan , Navdeep Jaitly
Wavenet based low rate speech coding
W. Bastiaan Kleijn, Felicia S. C. Lim , Alejandro Luebs , Jan Skoglund , Florian Stimberg, Quan Wang , Thomas C. Walters
arXiv preprint arXiv:1712.01120 (2017)
A subband-based stationary-component suppression method using harmanics and power ratio for reverberant speech recognition
Byung Joon Cho, Haeyong Kwon, Ji-Won Cho, Chanwoo Kim , Richard M. Stern, Hyung-Min Park
IEEE SIGNAL PROCESSING LETTERS, vol. 23 (2016), pp. 780-784
AN ACOUSTIC KEYSTROKE TRANSIENT CANCELER FOR SPEECH COMMUNICATION TERMINALS USING A SEMI-BLIND ADAPTIVE FILTER MODEL
Herbert Buchner, Simon Godsill, Jan Skoglund
ICASSP (2016)
AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech
Brian Patton , Yannis Agiomyrgiannakis , Michael Terry, Kevin Wilson , Rif A. Saurous , D. Sculley
NIPS 2016 End-to-end Learning for Speech and Audio Processing Workshop (to appear)
Automatic Optimization of Data Perturbation Distributions for Multi-Style Training in Speech Recognition
Mortaza Doulaty, Richard Rose , Olivier Siohan
Proceedings of the IEEE 2016 Workshop on Spoken Language Technology (SLT2016)
BI-MAGNITUDE PROCESSING FRAMEWORK FOR NONLINEAR ACOUSTIC ECHO CANCELLATION ON ANDROID DEVICES
Yiteng (Arden) Huang , Jan Skoglund , Alejandro Luebs
International Workshop on Acoustic Signal Enhancement 2016 (IWAENC2016)
Building Statistical Parametric Multi-speaker Synthesis for Bangladeshi Bangla
Alexander Gutkin , Linne Ha, Martin Jansche , Oddur Kjartansson, Knot Pipatsrisawat, Richard Sproat
SLTU-2016 5th Workshop on Spoken Language Technologies for Under-resourced languages, 09-12 May 2016, Yogyakarta, Indonesia; Procedia Computer Science, Elsevier B.V., pp. 194-200
Complex Linear Projection (CLP): A Discriminative Approach to Joint Feature Extraction and Acoustic Modeling
Ehsan Variani , Tara N. Sainath , Izhak Shafran , Michiel Bacchiani
Interspeech 2016 (2016)
Contextual prediction models for speech recognition
Yoni Halpern, Keith Hall , Vlad Schogol, Michael Riley , Brian Roark , Gleb Skobeltsyn , Martin Baeuml
Proceedings of Interspeech 2016
Cross-lingual projection for class-based language models
Beat Gfeller, Vlad Schogol, Keith Hall
Directly Modeling Voiced and Unvoiced Components in Speech Waveforms by Neural Networks
Keiichi Tokuda, Heiga Zen
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2016), pp. 5640-5644
Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition
Austin Waters , Yevgen Chebotar
Interspeech (2016)
Distributed representation and estimation of WFST-based n-gram models
Cyril Allauzen , Michael Riley , Brian Roark
Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata (StatFSM) (2016), pp. 32-41
End-to-End Text-Dependent Speaker Verification
Georg Heigold , Ignacio Moreno , Samy Bengio , Noam M. Shazeer
International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2016)
Factored Spatial and Spectral Multichannel Raw Waveform CLDNNs
Tara N. Sainath , Ron J. Weiss , Kevin W. Wilson , Arun Narayanan , Michiel Bacchiani
Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices
Heiga Zen , Yannis Agiomyrgiannakis , Niels Egberts, Fergus Henderson , Przemysław Szczepaniak
Proc. Interspeech, San Francisco, CA, USA (2016)
Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection
Ruben Zazo, Tara N. Sainath , Gabor Simko , Carolina Parada
Flatstart-CTC: a new acoustic model training procedure for speech recognition
Andrew Senior , Hasim Sak , Kanishka Rao
ICASSP 2016
GLOBALLY OPTIMIZED LEAST-SQUARES POST-FILTERING FOR MICROPHONE ARRAY SPEECH ENHANCEMENT
Yiteng (Arden) Huang , Alejandro Luebs , Jan Skoglund , W. Bastiaan Kleijn
High quality agreement-based semi-supervised training data for acoustic modeling
Félix de Chaumont Quitry , Asa Oines, Pedro Moreno , Eugene Weinstein
2016 IEEE Workshop on Spoken Language Technology
Learning Compact Recurrent Neural Networks
Zhiyun Lu, Vikas Sindhwani , Tara Sainath
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2016
Learning N-gram Language Models from Uncertain Data
Vitaly Kuznetsov , Hank Liao , Mehryar Mohri , Michael Riley , Brian Roark
Learning Personalized Pronunciations for Contact Names Recognition
Tony Bruguier , Fuchun Peng , Francoise Beaufays
Interspeech 2016 (to appear)
Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition
William Chan , Navdeep Jaitly, Quoc V. Le , Oriol Vinyals
Lower Frame Rate Neural Network Acoustic Models
Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks
Tara N. Sainath , Bo Li
Proc. Interspeech, ISCA (2016) (to appear)
Multi-Language Multi-Speaker Acoustic Modeling for LSTM-RNN based Statistical Parametric Speech Synthesis
Bo Li , Heiga Zen
Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition
Bo Li , Tara N. Sainath , Ron J. Weiss , Kevin W. Wilson , Michiel Bacchiani
Proc. Interspeech, ISCA (2016)
Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition
Hagen Soltau, Hank Liao , Hasim Sak
ArXiv e-prints (2016)
ON PRE-FILTERING STRATEGIES FOR THE GCC-PHAT ALGORITHM
Hong-Goo Kang, Michael Graczyk, Jan Skoglund
International Workshop on Acoustic Signal Enhancement 2016 (IWAENC 2016)
On The Compression Of Recurrent Neural Networks With An Application To LVCSR Acoustic Modeling For Embedded Speech Recognition
Rohit Prabhavalkar , Ouais Alsharif , Antoine Bruguier , Ian McGraw
Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2016)
On the Efficient Representation and Execution of Deep Acoustic Models
Raziel Alvarez , Rohit Prabhavalkar , Anton Bakhtin
Proceedings of Annual Conference of the International Speech Communication Association (Interspeech) (2016)
Personalized Speech Recognition On Mobile Devices
Ian McGraw, Rohit Prabhavalkar , Raziel Alvarez , Montse Gonzalez Arenas, Kanishka Rao , David Rybach , Ouais Alsharif , Hasim Sak , Alexander Gruenstein , Françoise Beaufays , Carolina Parada
Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition
Chanwoo Kim , Richard M. Stern
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING,, vol. 24 (2016), pp. 1315-1329
Predicting Pronunciations with Syllabification and Stress with Recurrent Neural Networks
Daan van Esch, Kanishka Rao , Mason Chua
Proceedings of InterSpeech 2016 (to appear)
Pynini: A Python library for weighted finite-state grammar compilation
Kyle Gorman
Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata (2016), pp. 75-80
Recent Advances in Google Real-time HMM-driven Unit Selection Synthesizer
Xavi Gonzalvo , Siamak Tazari, Chun-an Chan, Markus Becker, Alexander Gutkin , Hanna Silen
INTERSPEECH 2016, Sep 8-12, San Francisco, USA, pp. 2238-2242
Reducing the Computational Complexity of Multimicrophone Acoustic Models with Integrated Feature Extraction
Tara N. Sainath , Arun Narayanan , Ron J. Weiss , Ehsan Variani , Kevin W. Wilson , Michiel Bacchiani , Izhak Shafran
Robust Estimation of Reverberation Time Using Polynomial Roots
Ian Kelly , Francis Boland, Jan Skoglund
AES 60th Conference on Dereverberation and Reverberation of Audio, Music, and Speech, Google Ireland Ltd. (2016)
Selection and Combination of Hypotheses for Dialectal Speech Recognition
Victor Soto, Olivier Siohan , Mohamed Elfeky , Pedro J. Moreno
Semantic Model for Fast Tagging of Word Lattices
Leonid Velikovich
IEEE Spoken Language Technology (SLT) Workshop (2016) (to appear)
THE MATCHING-MINIMIZATION ALGORITHM, THE INCA ALGORITHM AND A MATHEMATICAL FRAMEWORK FOR VOICE CONVERSION WITH UNALIGNED CORPORA.
Yannis Agiomyrgiannakis
ICASSP, IEEE (2016)
TTS for Low Resource Languages: A Bangla Synthesizer
Alexander Gutkin , Linne Ha, Martin Jansche , Knot Pipatsrisawat, Richard Sproat
10th edition of the Language Resources and Evaluation Conference, 23-28 May 2016, European Language Resources Association (ELRA), Portorož, Slovenia, pp. 2005-2010
Towards Acoustic Model Unification Across Dialects
Austin Waters , Meysam Bastani, Mohamed G. Elfeky , Pedro Moreno , Xavier Velez
Unsupervised Context Learning For Speech Recognition
Assaf Michaely , Justin Scheiner, Mohammadreza Ghodsi , Petar Aleksic , Zelin Wu
Spoken Language Technology (SLT) Workshop, IEEE (2016)
Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings
Aren Jansen , Herman Kamper, Sharon Goldwater
IEEE Transactions on Audio, Speech, and Language Processing (2016)
Using instantaneous frequency and aperiodicity detection to estimate FO for high-quality speech synthesis
Hideki Kawahara, Yannis Agiomyrgiannakis , Heiga Zen
Proc. ISCA SSW9 (2016), pp. 238-245
VOICE MORPHING THAT IMPROVES TTS QUALITY USING AN OPTIMAL DYNAMIC FREQUENCY WARPING-AND-WEIGHTING TRANSFORM
Yannis Agiomyrgiannakis , Zoe Roupakia
A 6 µW per Channel Analog Biomimetic Cochlear Implant Processor Filterbank Architecture With Across Channels AGC
Guang Wang, Richard F. Lyon , Emmanuel M. Drakakis
IEEE Transactions on Biomedical Circuits and Systems, vol. 9 (2015), pp. 72-86
A Gaussian Mixture Model Layer Jointly Optimized with Discriminative Features within A Deep Neural Network Architecture
Ehsan Variani , Erik McDermott , Georg Heigold
ICASSP, IEEE (2015)
Acoustic Modeling for Speech Synthesis: from HMM to RNN
IEEE ASRU, Scottsdale, Arizona, U.S.A. (2015)
Acoustic Modeling in Statistical Parametric Speech Synthesis - From HMM to LSTM-RNN
Proc. MLSLP (2015)
Acoustic Modelling with CD-CTC-SMBR LSTM RNNS
Andrew Senior , Hasim Sak , Felix de Chaumont Quitry , Tara N. Sainath , Kanishka Rao
ASRU (2015)
Automatic Gain Control and Multi-style Training for Robust Small-Footprint Keyword Spotting with Deep Neural Networks
Rohit Prabhavalkar , Raziel Alvarez , Carolina Parada , Preetum Nakkiran, Tara Sainath
Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2015), pp. 4704-4708
Automatic Pronunciation Verification for Speech Recognition
Kanishka Rao , Fuchun Peng , Françoise Beaufays
ICASSP (2015)
Bringing Contextual Information to Google Speech Recognition
Petar Aleksic , Mohammadreza Ghodsi , Assaf Michaely , Cyril Allauzen , Keith Hall , Brian Roark , David Rybach , Pedro Moreno
Interspeech 2015, International Speech Communications Association
Composition-based on-the-fly rescoring for salient n-gram biasing
Keith Hall , Eunjoon Cho, Cyril Allauzen , Francoise Beaufays , Noah Coccaro, Kaisuke Nakajima, Michael Riley , Brian Roark , David Rybach , Linda Zhang
Compressing Deep Neural Networks using a Rank-Constrained Topology
Preetum Nakkiran, Raziel Alvarez , Rohit Prabhavalkar , Carolina Parada
Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), ISCA (2015), pp. 1473-1477
Context dependent phone models for LSTM RNN acoustic modelling
Andrew W. Senior , Hasim Sak , Izhak Shafran
ICASSP (2015), pp. 4585-4589
Convolutional Neural Networks for Small-Footprint Keyword Spotting
Tara Sainath , Carolina Parada
Interspeech (2015)
Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks
Tara Sainath , Oriol Vinyals , Andrew Senior , Hasim Sak
DETECTION AND SUPPRESSION OF KEYBOARD TRANSIENT NOISE IN AUDIO STREAMS WITH AUXILIARY KEYBED MICROPHONE
Simon Godsill, Herbert Buchner, Jan Skoglund
ICASSP 2015, IEEE
DIRECT-TO-REVERBERANT RATIO ESTIMATION USING A NULL-STEERED BEAMFORMER
James Eaton, Alastair Moore, Patrick Naylor, Jan Skoglund
Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends
Zhen-Hua Ling, Shiyin Kang, Heiga Zen , Andrew Senior , Mike Schuster , Xiao-Jun Qian, Helen Meng, Li Deng
IEEE Signal Processing Magazine, vol. 32 (2015), pp. 35-52
Directly Modeling Speech Waveforms by Neural Networks for Statistical Parametric Speech Synthesis
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2015), pp. 4215-4219
Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition
Hasim Sak , Andrew W. Senior , Kanishka Rao , Françoise Beaufays
CoRR, vol. abs/1507.06947 (2015)
Fix It Where It Fails: Pronunciation Learning by Mining Error Corrections from Speech Logs
Zhenzhen Kou, Daisy Stanton , Fuchun Peng , Françoise Beaufays , Trevor Strohman
Garbage Modeling for On-device Speech Recognition
Christophe Van Gysel, Leonid Velikovich , Ian McGraw, Françoise Beaufays
Interspeech 2015, International Speech Communications Association (to appear)
Geo-location for Voice Search Language Modeling
Ciprian Chelba , Xuedong Zhang, Keith Hall
Interspeech 2015, International Speech Communications Association, pp. 1438-1442
Grapheme-to-Phoneme Conversion Using Long Short-Term Memory Recurrent Neural Networks
Kanishka Rao , Fuchun Peng , Hasim Sak , Françoise Beaufays
Improved recognition of contact names in voice commands
Petar Aleksic , Cyril Allauzen , David Elson, Aleks Kracun, Diego Melendo Casado, Pedro J. Moreno
ICASSP 2015
Stanford Information Theory Forum (2015)
Large Vocabulary Automatic Speech Recognition for Children
Hank Liao , Golan Pundak , Olivier Siohan , Melissa Carroll, Noah Coccaro, Qi-Ming Jiang, Tara N. Sainath , Andrew Senior , Françoise Beaufays , Michiel Bacchiani
Large-scale, sequence-discriminative, joint adaptive training for masking-based robust ASR
Arun Narayanan , Ananya Misra , Kean Chin
INTERSPEECH-2015, ISCA, pp. 3571-3575
Learning acoustic frame labeling for speech recognition with recurrent neural networks
Hasim Sak , Andrew W. Senior , Kanishka Rao , Ozan Irsoy, Alex Graves, Françoise Beaufays, Johan Schalkwyk
ICASSP (2015), pp. 4280-4284
Learning the Speech Front-end with Raw Waveform CLDNNs
Tara Sainath , Ron J. Weiss , Kevin Wilson , Andrew W. Senior , Oriol Vinyals
Listen, Attend and Spell
CoRR, vol. abs/1508.01211 (2015)
Locally-Connected and Convolutional Neural Networks for Small Footprint Speaker Recognition
Yu-hsin Chen, Ignacio Lopez Moreno , Tara Sainath , Mirkó Visontai, Raziel Alvarez , Carolina Parada
Long Short-Term Memory Language Models with Additive Morphological Features for Automatic Speech Recognition
Daniel Renshaw, Keith B. Hall
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2015)
Multi-Dialectical Languages Effect on Speech Recognition
Mohamed Elfeky , Pedro J. Moreno , Victor Soto
International Conference on Natural Language and Speech Processing (2015)
Multitask learning and system combination for automatic speech recognition
Olivier Siohan , David Rybach
2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
Pruning Sparse Non-negative Matrix N-gram Language Models
Joris Pelemans, Noam M. Shazeer, Ciprian Chelba
Proceedings of Interspeech 2015, ISCA, pp. 1433-1437
Query-by-Example Keyword Spotting Using Long Short-Term Memory Networks
Guoguo Chen, Carolina Parada , Tara N. Sainath
Rapid Vocabulary Addition to Context-Dependent Decoder Graphs
Cyril Allauzen , Michael Riley
Interspeech 2015
Sequence-based Class Tagging for Robust Transcription in ASR
Lucy Vasserman , Vlad Schogol, Keith Hall
Sound source separation algorithm using phase difference and angle distribution modeling near the target
Chanwoo Kim , Kean Chin
INTERSPEECH 2015, pp. 751-755
Sparse Non-negative Matrix Language Modeling for Geo-annotated Query Session Data
Ciprian Chelba , Noam M. Shazeer
Automatic Speech Recognition and Understanding Workshop (ASRU 2015) Proceedings, IEEE, to appear (to appear)
Speaker Location and Microphone Spacing Invariant Acoustic Modeling from Raw Multichannel Waveforms
Tara N. Sainath , Ron J. Weiss , Kevin Wilson , Arun Narayanan , Michiel Bacchiani , Andrew Senior
Speech Acoustic Modeling from Raw Multichannel Waveforms
Yedid Hoshen, Ron Weiss , Kevin W Wilson
International Conference on Acoustics, Speech, and Signal Processing, IEEE (2015)
Statistical parametric speech synthesis: from HMM to LSTM-RNN
RTTH Summer School on Speech Technology -- A Deep Learning Perspective, Barcelona, Spain (2015)
Telluride Decoding Toolbox
Sahar Akram, Alain de Cheveigné, Peter Udo Diehl, Emily Graber, Carina Graversen, Jens Hjortkjaer, Nima Mesgarani, Lucas Parra, Ulrich Pomper, Shihab Shamma, Jonathan Simon, Malcolm Slaney , Daniel Wong
Institute for Neuroinformatics (2015)
Unidirectional Long Short-Term Memory Recurrent Neural Network with Recurrent Output Layer for Low-Latency Speech Synthesis
Heiga Zen , Hasim Sak
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2015), pp. 4470-4474
ViSQOL: an objective speech quality model
Andrew Hines, Jan Skoglund , Anil Kokaram , Naomi Harte
EURASIP Journal on Audio, Speech, and Music Processing, vol. 2015 (13) (2015), pp. 1-18
Vocaine the Vocoder and Applications in Speech Synthesis
ICASSP, IEEE (2015) (to appear)
A big data approach to acoustic model training corpus selection
Olga Kapralova , John Alex, Eugene Weinstein , Pedro Moreno , Olivier Siohan
Conference of the International Speech Communication Association (Interspeech) (2014)
An Analysis of the Effect of Larynx-Synchronous Averaging on Dereverberation of Voiced Speech
Alastair H Moore, Patrick A Naylor, Jan Skoglund
Proceedings of European Signal Processing Conference (EUSIPCO) 2014
Asynchronous Stochastic Optimization for Sequence Training of Deep Neural Networks
Georg Heigold , Erik McDermott , Vincent Vanhoucke , Andrew Senior , Michiel Bacchiani
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Firenze, Italy (2014)
Asynchronous Stochastic Optimization for Sequence Training of Deep Neural Networks: Towards Big Data
Erik McDermott , Georg Heigold , Pedro Moreno , Andrew Senior, Michiel Bacchiani
Interspeeech, ISCA (2014)
Asynchronous, Online, GMM-free Training of a Context Dependent Acoustic Model for Speech Recognition
M. Bacchiani , A. Senior , G. Heigold
Proceedings of the European Conference on Speech Communication and Technology (2014) (to appear)
Automatic Language Identification Using Deep Neural Networks
Ignacio Lopez-Moreno , Javier Gonzalez-Dominguez, Oldrich Plchot
Proc. ICASSP, IEEE (2014)
Automatic Language Identification using Long Short-Term Memory Recurrent Neural Networks
Javier Gonzalez-Dominguez, Ignacio Lopez-Moreno , Hasim Sak
Interspeech (2014)
Autoregressive Product of Multi-frame Predictions Can Improve the Accuracy of Hybrid Models
Navdeep Jaitly, Vincent Vanhoucke , Geoffrey Hinton
Proceedings of Interspeech 2014
Backoff Inspired Features for Maximum Entropy Language Models
Fadi Biadsy , Keith Hall , Pedro Moreno , Brian Roark
Proceedings of Interspeech, ISCA (2014)
Computer-aided quality assurance of an Icelandic pronunciation dictionary
Martin Jansche
LREC 2014, Reykjavik
Context Dependent State Tying for Speech Recognition using Deep Neural Network Acoustic Models
M. Bacchiani , D. Rybach
Proceedings of the International Conference on Acoustics,Speech and Signal Processing (2014)
Deep Mixture Density Networks for Acoustic Modeling in Statistical Parametric Speech Synthesis
Heiga Zen , Andrew Senior
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2014), pp. 3872-3876
Deep Neural Networks for Small Footprint Text-dependent Speaker Verification
Ehsan Variani , Xin Lei, Erik McDermott , Ignacio Lopez Moreno , Javier Gonzalez-Dominguez
Direct construction of compact context-dependency transducers from data
David Rybach , Michael Riley , Chris Alberti
Computer Speech & Language, vol. 28 (2014), pp. 177-191
Discriminative pronunciation modeling for dialectal speech recognition
Maider Lehr, Kyle Gorman , Izhak Shafran
Proc. Interspeech (2014) (to appear)
Encoding Linear Models As Weighted Finite-State Transducers
Ke Wu, Cyril Allauzen , Keith Hall , Michael Riley , Brian Roark
Interspeech 2014, ISCA, pp. 1258-1262
Fine Context, Low-rank, Softplus Deep Neural Networks for Mobile Speech Recognition
Andrew Senior , Xin Lei
Proc. ICASSP (2014) (to appear)
Frame by Frame Language Identification in Short Utterances using Deep Neural Networks
Javier Gonzalez-Dominguez, Ignacio Lopez-Moreno , Pedro J. Moreno , Joaquin Gonzalez-Rodriguez
Neural Networks Special Issue: Neural Network Learning in Big Data (2014)
GMM-Free DNN Training
A. Senior , G. Heigold , M. Bacchiani , H. Liao
Improving DNN Speaker Independence with I-vector Inputs
Andrew Senior , Ignacio Lopez-Moreno
JustSpeak: Enabling Universal Voice Control on Android
Yu Zhong , T. V. Raman , Casey Burkhardt , Fadi Biadsy , Jeffrey P. Bigham
Large-Scale Speaker Identification
Ludwig Schmidt, Matthew Sharifi, Ignacio Lopez-Moreno
Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition
Hasim Sak , Andrew W. Senior , Françoise Beaufays
CoRR, vol. abs/1402.1128 (2014)
Long short-term memory recurrent neural network architectures for large scale acoustic modeling
INTERSPEECH (2014), pp. 338-342
Pronunciation Learning for Named-Entities through Crowd-Sourcing
Attapol Rutherford, Fuchun Peng , Françoise Beaufays
Proceedings of Interspeech (2014)
Robust speech recognition in reverberant environments using subband-based steady-state monaural and binaural suppression
Hyung-Min Park, Matthew Maciejewski, Chanwoo Kim , Richard M. Stern
INTERSPEECH (2014), pp. 2715-2718
Robust speech recognition using temporal masking and thresholding algorithm
Chanwoo Kim , Kean Chin, Michiel Bacchiani , R. M. Stern
INTERSPEECH-2014, pp. 2734-2738
Sequence Discriminative Distributed Training of Long Short-Term Memory Recurrent Neural Networks
Hasim Sak , Oriol Vinyals , Georg Heigold , Andrew Senior, Erik McDermott , Rajat Monga , Mark Mao
Sinusoidal Interpolation Across Missing Data
W. Bastiaan Kleijn, Turaj Zakizadeh Shabestary, Jan Skoglund
International Workshop on Acoustic Signal Enhancement 2014 (IWAENC 2014), pp. 71-75
Small-Footprint Keyword Spotting using Deep Neural Networks
Guoguo Chen, Carolina Parada , Georg Heigold
ICASSP, IEEE (2014)
Statistical Parametric Speech Synthesis
UKSpeech Conference, Edinburgh, UK (2014)
Text-To-Speech with cross-lingual Neural Network-based grapheme-to-phoneme models
Xavi Gonzalvo , Monika Podsiadlo
Training Data Selection Based On Context-Dependent State Matching
Olivier Siohan
Proceedings of ICASSP 2014
Word Embeddings for Speech Recognition
Samy Bengio , Georg Heigold
Proceedings of the 15th Conference of the International Speech Communication Association, Interspeech (2014)
A FREQUENCY-WEIGHTED POST-FILTERING TRANSFORM FOR COMPENSATION OF THE OVER-SMOOTHING EFFECT IN HMM-BASED SPEECH SYNTHESIS
Yannis Agiomyrgiannakis , Florian Eyben
ICASSP, IEEE (2013)
Accurate and Compact Large Vocabulary Speech Recognition on Mobile Devices
Xin Lei, Andrew Senior , Alexander Gruenstein , Jeffrey Sorensen
Interspeech (2013)
An Empirical study of learning rates in deep neural networks for speech recognition
Andrew Senior , Georg Heigold , Marc'aurelio Ranzato, Ke Yang
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013) (to appear)
Deep Learning in Speech Synthesis
8th ISCA Speech Synthesis Workshop, Barcelona, Spain (2013)
Deep Neural Networks with Auxiliary Gaussian Mixture Models for Real-Time Speech Recognition
Xin Lei, Hui Lin , Georg Heigold
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013)
Empirical Exploration of Language Modeling for the google.com Query Stream as Applied to Mobile Voice Search
Ciprian Chelba , Johan Schalkwyk
Mobile Speech and Advanced Natural Language Solutions, Springer Science+Business Media, New York (2013), pp. 197-229
Language Model Verbalization for Automatic Speech Recognition
Hasim Sak , Françoise Beaufays , Kaisuke Nakajima, Cyril Allauzen
Proc ICASSP, IEEE (2013)
Language Modeling Capitalization
Françoise Beaufays , Brian Strope
Proc ICASSP, IEEE (2013) (to appear)
Large Scale Distributed Acoustic Modeling With Back-off N-grams
Ciprian Chelba , Peng Xu , Fernando Pereira , Thomas Richardson
IEEE Transactions on Audio, Speech and Language Processing, vol. 21 (2013), pp. 1158-1169
ICSI, Berkeley, California (2013)
Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription
Hank Liao , Erik McDermott , Andrew Senior
ASRU (2013)
Mixture of mixture n-gram language models
Hasim Sak , Cyril Allauzen , Kaisuke Nakajima, Françoise Beaufays
ASRU (2013), pp. 31-36
Monitoring the Effects of Temporal Clipping on VoIP Speech Quality
Interspeech 2013, pp. 1188-1192
Multiframe Deep Neural Networks for Acoustic Modeling
Vincent Vanhoucke , Matthieu Devin , Georg Heigold
Multilingual acoustic models using distributed deep neural networks
Georg Heigold , Vincent Vanhoucke , Andrew Senior , Patrick Nguyen, Marc'aurelio Ranzato, Matthieu Devin , Jeff Dean
On Rectified Linear Units For Speech Processing
M.D. Zeiler, M. Ranzato, R. Monga , M. Mao, K. Yang , Q.V. Le , P. Nguyen, A. Senior , V. Vanhoucke , J. Dean , G.E. Hinton
38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver (2013)
Pre-Initialized Composition for Large-Vocabulary Speech Recognition
Interspeech 2013, 666 – 670
RAPID ADAPTATION FOR MOBILE SPEECH APPLICATIONS
M. Bacchiani
Proceedings of the International Conference on Acoustics,Speech and Signal Processing (2013)
Rate-Distortion Optimization for Multichannel Audio Compression
Minyue Li, Jan Skoglund , W. Bastiaan Kleijn
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
Recurrent Neural Networks for Voice Activity Detection
Thad Hughes , Keir Mierle
ICASSP, IEEE (2013), pp. 7378-7382
Robustness of Speech Quality Metrics to Background Noise and Network Degradations: Comparing VISQOL, PESQ and POLQA
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2013), pp. 3697-3701
Search Results Based N-Best Hypothesis Rescoring With Maximum Entropy Classification
Fuchun Peng , Scott Roy, Ben Shahshahani, Françoise Beaufays
Proceedings of ASRU (2013)
Smoothed marginal distribution constraints for language modeling
Brian Roark , Cyril Allauzen , Michael Riley
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL) (2013), pp. 43-52
Speaker Adaptation of Context Dependent Deep Neural Networks
International Conference of Acoustics, Speech, and Signal Processing. (2013)
Speech and Natural Language: Where Are We Now And Where Are We Headed?
Mobile Voice Conference, San Francisco (2013)
Statistical Parametric Speech Synthesis Using Deep Neural Networks
Heiga Zen , Andrew Senior , Mike Schuster
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2013), pp. 7962-7966
Written-Domain Language Modeling for Automatic Speech Recognition
Hasim Sak , Yun-hsuan Sung , Françoise Beaufays , Cyril Allauzen
iVector-based Acoustic Data Selection
Olivier Siohan , Michiel Bacchiani
Proceedings of Interspeech (2013)
Application Of Pretrained Deep Neural Networks To Large Vocabulary Speech Recognition
Navdeep Jaitly, Patrick Nguyen, Andrew Senior , Vincent Vanhoucke
Proceedings of Interspeech 2012
Buildling adaptive dialogue systems via Bayes-adaptive POMDP
Shaowei Png , Joelle Pineau, B. Chaib-draa
IEEE Journal of Selected Topics in Signal Processing, vol. vol.6(8). 2012. (2012), pp. 917-927
Chapter 17: Uncertainty Decoding, In Virtanen, Singh, & Raj (Eds.) Techniques for Noise Robustness in Automatic Speech Recognition.
Wiley (2012), pp. 463-485
Continuous Space Discriminative Language Modeling
Puyang Xu, Sanjeev Khudanpur, Maider Lehr, Emily Prud’hommeaux, Nathan Glenn, Damianos Karakos, Brian Roark , Kenji Sagae, Murat Saraclar, Izhak Shafran , Dan Bikel, Chris Callison-Burch, Yuan Cao, Keith Hall , Eva Hasler, Philipp Koehn, Adam Lopez, Matt Post, Darcey Riley
ICASSP 2012
Deep Neural Networks for Acoustic Modeling in Speech Recognition
Geoffrey Hinton , Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior , Vincent Vanhoucke , Patrick Nguyen, Tara Sainath , Brian Kingsbury
Signal Processing Magazine (2012)
Distributed Acoustic Modeling with Back-off N-grams
Proceedings of ICASSP 2012, IEEE, pp. 4129-4132
Distributed Discriminative Language Models for Google Voice Search
Preethi Jyothi, Leif Johnson , Ciprian Chelba , Brian Strope
Proceedings of ICASSP 2012, IEEE, pp. 5017-5021
Estimating Word-Stability During Incremental Speech Recognition
Ian McGraw, Alexander Gruenstein
Interspeech (2012)
Exemplar-Based Processing for Speech Recognition: An Overview
Tara N. Sainath , Bhuvana Ramabhadran, David Nahamoo, Dimitri Kanevsky, Dirk Van Compernolle, Kris Demuynck, Jort F. Gemmeke , Jerome R. Bellegarda, Shiva Sundaram
IEEE Signal Process. Mag., vol. 29 (2012), pp. 98-113
Google's Cross-Dialect Arabic Voice Search
Fadi Biadsy , Pedro J. Moreno , Martin Jansche
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2012), pp. 4441-4444
Hallucinated N-Best Lists for Discriminative Language Modeling
Kenji Sagae, Maider Lehr, Emily Tucker Prud’hommeaux, Puyang Xu, Nathan Glenn, Damianos Karakos, Sanjeev Khudanpur, Brian Roark , Murat Saraçlar, Izhak Shafran , Daniel M. Bikel, Chris Callison-Burch, Yuan Cao, Keith Hall , Eva Hassler, Philipp Koehn, Adam Lopez, Matt Post, Darcey Riley
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2012)
Haptic Voice Recognition Grand Challenge
K. Sim, S. Zhao, K. Yu, H. Liao
14th ACM International Conference on Multimodal Interaction. (2012)
IMPROVED PREDICTION OF NEARLY-PERIODIC SIGNALS
Bastiaan Kleijn, Jan Skoglund
International Workshop on Acoustic Signal Enhancement 2012 (IWAENC2012)
Investigations on Exemplar-Based Features for Speech Recognition Towards Thousands of Hours of Unsupervised, Noisy Data
Georg Heigold , Patrick Nguyen, Mitchel Weintraub, Vincent Vanhoucke
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Kyoto, Japan (2012), pp. 4437-4440
Japanese and Korean Voice Search
Mike Schuster , Kaisuke Nakajima
International Conference on Acoustics, Speech and Signal Processing, IEEE (2012), pp. 5149-5152
Language Modeling for Automatic Speech Recognition Meets the Web: Google Search by Voice
Ciprian Chelba , Johan Schalkwyk, Boulos Harb , Carolina Parada , Cyril Allauzen , Leif Johnson , Michael Riley , Peng Xu , Preethi Jyothi, Thorsten Brants, Vida Ha, Will Neveitt
University of Toronto (2012)
Large Scale Language Modeling in Automatic Speech Recognition
Ciprian Chelba , Dan Bikel, Maria Shugrina, Patrick Nguyen, Shankar Kumar
Google (2012)
Large-scale Discriminative Language Model Reranking for Voice Search
Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, Association for Computational Linguistics, pp. 41-49
Learning improved linear transforms for speech recognition
Andrew Senior , Youngmin Cho, Jason Weston
ICASSP, IEEE (2012)
Music Models for Music-Speech Separation
Thad Hughes , Trausti Kristjansson
ICASSP, IEEE (2012), pp. 4917-4920
Optimal Size, Freshness and Time-frame for Voice Search Vocabulary
Maryam Kamvar , Ciprian Chelba
Recognition of Multilingual Speech in Mobile Applications
Hui Lin , Jui-Ting Huang, Francoise Beaufays , Brian Strope, Yun-hsuan Sung
ICASSP (2012)
Recurrent Neural Networks for Noise Reduction in Robust ASR
Andrew Maas, Quoc V. Le , Tyler M. O’Neil, Oriol Vinyals , Patrick Nguyen, Andrew Y. Ng
INTERSPEECH (2012)
Semi-supervised Discriminative Language Modeling for Turkish ASR
Murat Saraçlar, Daniel M. Bikel, Keith Hall , Kenji Sagae
2012 IEEE International Conference on Acoustics, Speech, and Signal Processing Proceedings, IEEE, Kyoto, Japan
Spectral Intersections for Non-Stationary Signal Separation
Trausti Kristjansson, Thad Hughes
Proceedings of InterSpeech 2012, Portland, OR
Speech/Nonspeech Segmentation in Web Videos
Ananya Misra
Proceedings of InterSpeech 2012
VISQOL: THE VIRTUAL SPEECH QUALITY OBJECTIVE LISTENER
Voice Query Refinement
Cyril Allauzen , Edward Benson, Ciprian Chelba , Michael Riley , Johan Schalkwyk
A Web-Based Tool for Developing Multilingual Pronunciation Lexicons
Samantha Ainsley , Linne Ha, Martin Jansche , Ara Kim, Masayuki Nanzawa
12th Annual Conference of the International Speech Communication Association (Interspeech 2011), pp. 3331-3332
Bayesian Language Model Interpolation for Mobile Speech Input
Interspeech 2011, pp. 1429-1432
Deploying Google Search by Voice in Cantonese
Yun-hsuan Sung , Martin Jansche , Pedro Moreno
12th Annual Conference of the International Speech Communication Association (Interspeech 2011), pp. 2865-2868
Discriminative Features for Language Identification
C. Alberti, M. Bacchiani
INTERSPEECH (2011)
Improving the speed of neural networks on CPUs
Vincent Vanhoucke , Andrew Senior , Mark Z. Mao
Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011
Ciprian Chelba , Johan Schalkwyk, Boulos Harb , Carolina Parada , Cyril Allauzen , Michael Riley , Peng Xu , Thorsten Brants, Vida Ha, Will Neveitt
OGI/OHSU Seminar Series, Portland, Oregon, USA (2011)
Recognizing English Queries in Mandarin Voice Search
Hung-An Chang, Yun-hsuan Sung , Brian Strope, Francoise Beaufays
ICASSP (2011)
Speech Retrieval
Ciprian Chelba , Timothy J. Hazen, Bhuvana Ramabhadran, Murat Saraçlar
Spoken Language Understanding, John Wiley and Sons, Ltd (2011), pp. 417-446
Summary of Opus listening test results
Christian Hoene, Jean-Marc Valin, Koen Vos, Jan Skoglund
IETF, IETF (2011)
TechWare: Mobile Media Search Resources [Best of the Web]
Z. Liu, M. Bacchiani
IEEE Signal Processing Magazine, vol. 28 (2011), pp. 142-145
Unsupervised Testing Strategies for ASR
Brian Strope, Doug Beeferman, Alexander Gruenstein , Xin Lei
Interspeech 2011, pp. 1685-1688
Challenges in Automatic Speech Recognition
Ciprian Chelba , Johan Schalkwyk, Michiel Bacchiani
Interspeech 2010
Decision Tree State Clustering with Word and Syllable Features
Hank Liao , Chris Alberti , Michiel Bacchiani , Olivier Siohan
Interspeech, ISCA (2010), 2958 – 2961
Discriminative Topic Segmentation of Text and Speech
Mehryar Mohri , Pedro Moreno , Eugene Weinstein
International Conference on Artificial Intelligence and Statistics (AISTATS) (2010)
Google Search by Voice: A Case Study
Johan Schalkwyk, Doug Beeferman, Francoise Beaufays , Bill Byrne , Ciprian Chelba , Mike Cohen, Maryam Garrett , Brian Strope
Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, Springer (2010), pp. 61-90
On-Demand Language Model Interpolation for Mobile Speech Input
Brandon Ballinger, Cyril Allauzen , Alexander Gruenstein , Johan Schalkwyk
Interspeech (2010), pp. 1812-1815
Search by Voice in Mandarin Chinese
Jiulong Shan, Genqing Wu, Zhihong Hu, Xiliu Tang, Martin Jansche , Pedro J. Moreno
Interspeech 2010, pp. 354-357
Unsupervised Discovery and Training of Maximally Dissimilar Cluster Models
Francoise Beaufays , Vincent Vanhoucke , Brian Strope
Proc Interspeech (2010)
A new quality measure for topic segmentation of text and speech
Mehryar Mohri , Pedro J. Moreno , Eugene Weinstein
Conference of the International Speech Communication Association (Interspeech) (2009)
Restoring Punctuation and Capitalization in Transcribed Speech
Agustín Gravano, Martin Jansche , Michiel Bacchiani
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2009), pp. 4741-4744
Revisiting Graphemes with Increasing Amounts of Data
Yun-Hsuan Sung , Thad Hughes , Francoise Beaufays , Brian Strope
ICASSP, IEEE (2009)
Web-derived Pronunciations
Arnab Ghoshal, Martin Jansche , Sanjeev Khudanpur, Michael Riley , Morgan Ulinski
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2009), pp. 4289-4292
Confidence Scores for Acoustic Model Adaptation
C. Gollan, M. Bacchiani
Proceedings of the International Conference on Acoustics,Speech and Signal Processing (2008)
Deploying GOOG-411: Early Lessons in Data, Measurement, and Testing
Michiel Bacchiani , Francoise Beaufays , Johan Schalkwyk, Mike Schuster , Brian Strope
Proc. ICASSP (2008)
Retrieval and Browsing of Spoken Content
Ciprian Chelba , Timothy J. Hazen, Murat Saraçlar
Signal Processing Magazine, IEEE, vol. 25 (2008), pp. 39-49
Speech Recognition with Weighted Finite-State Transducers
Mehryar Mohri , Fernando C. N. Pereira , Michael Riley
Handbook on Speech Processing and Speech Communication, Part E: Speech recognition, Springer-Verlag, Heidelberg, Germany (2008)
Handbook on Speech Processing and Speech Communication, Part E: Speech recognition, Springer-Verlag, Heidelberg, Germany (2007)
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Publications
- Account settings
Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .
- Advanced Search
- Journal List
- Sensors (Basel)
On the Security and Privacy Challenges of Virtual Assistants
1 School of Science, Environment and Engineering, The University of Salford, Salford M5 4WT, UK; [email protected] (T.B.); [email protected] (T.D.); [email protected] (S.B.)
Tooska Dargahi
Sana belguith, mabrook s. al-rakhami.
2 Research Chair of Pervasive and Mobile Computing, Information Systems Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
Ali Hassan Sodhro
3 Department of Computer and System Science, Mid Sweden University, SE-831 25 Östersund, Sweden; [email protected]
4 Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518000, China
5 Department of Electrical Engineering, Sukkur IBA University, Sukkur 65200, Pakistan
Since the purchase of Siri by Apple, and its release with the iPhone 4S in 2011, virtual assistants (VAs) have grown in number and popularity. The sophisticated natural language processing and speech recognition employed by VAs enables users to interact with them conversationally, almost as they would with another human. To service user voice requests, VAs transmit large amounts of data to their vendors; these data are processed and stored in the Cloud. The potential data security and privacy issues involved in this process provided the motivation to examine the current state of the art in VA research. In this study, we identify peer-reviewed literature that focuses on security and privacy concerns surrounding these assistants, including current trends in addressing how voice assistants are vulnerable to malicious attacks and worries that the VA is recording without the user’s knowledge or consent. The findings show that not only are these worries manifold, but there is a gap in the current state of the art, and no current literature reviews on the topic exist. This review sheds light on future research directions, such as providing solutions to perform voice authentication without an external device, and the compliance of VAs with privacy regulations.
1. Introduction
Within the last decade, there has been an increasing interest by governments and industry in developing smart homes. Houses are equipped with several internet-connected devices, such as smart meters, smart locks, and smart speakers to offer a range of services to improve quality of life. Virtual assistants (VAs)—often termed ‘smart speakers’—such as Amazon’s Alexa, Microsoft’s Cortana, and Apple’s Siri, simply described, are software applications that can interpret human speech as a question or instruction, perform tasks and respond using synthesised voices. These applications can run on personal computers, smartphones, tablets, and their dedicated hardware [ 1 ]. The user can interact with the VA in a natural and conversational manner: “Cortana, what is the weather forecast for Manchester tomorrow?”, “Alexa, set a reminder for the dentist”. The process requires no keyboards, microphones, or touchscreens [ 1 ]. This friction-free mode of operation is certainly gaining traction with users. In December 2017 there were 37 million smart speakers installed in the US alone; 12 months later this figure had risen to 66 million [ 2 ].
VAs and the companies behind them are not without their bad publicity. In 2018 the Guardian reported that an Alexa user from Portland, Oregon, asked Amazon to investigate when her device recorded a private conversation between her and her husband on the subject of hardwood floors and sent the audio to a contact in her address book—all without her knowing [ 3 ]. In 2019, the Daily Telegraph reported that Amazon employees were listening to Alexa users’ audio—including that which was recorded accidentally—at a rate of up to 1000 recordings per day [ 4 ]. As well as concerns about snooping by the VA, there are several privacy and security concerns around the information that VA companies store on their servers. The software application on the VA device is only a client—the bulk of the assistant’s work is done on a remote server, and every transaction and recording is kept by the VA company [ 5 ]. VAs have little in the way of voice authentication; they will respond to any voice that utters the wake word, meaning that one user could quite easily interrogate another’s VA to mine the stored personal information [ 1 ]. Additionally, Internet of Things (IoT) malware is becoming more common and more sophisticated [ 6 ]. There have been no reports yet of malware specifically targeting VAs ‘in the wild’ but it is surely a matter of time. A systematic review of research literature written on the security and privacy challenges of VAs and a critical analysis of these studies would give an insight into the current state of the art, and provide an understanding of any future directions new research might take.
1.1. Background
The most popular VAs on the market are Apple’s Siri, Amazon’s Alexa, Microsoft’s Cortana, and Google’s Assistant [ 1 ]; these assistants, often found in portable devices such as smartphones or tablets, can each be considered a ‘speech-based natural user interface’ (NUI) [ 7 ]; a system that can be operated by a user via intuitive, natural behaviour, i.e., voice instructions. Detailed, accurate information about the exact system and software architecture of commercial VAs is hard to come by. Given the sales numbers involved, VA providers are perhaps keen to protect their intellectual property. Figure 1 shows a high-level overview of the system architecture of Amazon’s Alexa VA.
Architecture of a voice assistant (Alexa) ( https://www.faststreamtech.com/blog/amazon-alexa-integrated-with-iot-ecosystem-service/ ). (access on 10 February 2021) [ 8 ].
An example request might follow these steps:
- The VA client—the ‘Echo Device’ in the diagram—is always listening for a spoken ‘wake word’; only when this is heard does any recording take place.
- The recording of the user’s request is sent to Amazon’s service platform where the speech is turned into text by speech recognition, and natural language processing is used to translate that text into machine-readable instructions.
- The recording and its text translation are sent to cloud storage, where they are kept.
- The service platform generates a voice recording response which is played to the user via a loudspeaker in the VA client. The request might activate a ‘skill’—a software extension—to play music via streaming service Spotify, for example.
- Further skills offer integration with IoT devices around the home; these can be controlled by messages sent from the service platform, via the Cloud.
- A companion smartphone app can see responses sent by the service platform; some smartphones can also act like a fully-featured client.
As with any distributed computing system, there are several technologies used. The endpoint of the system with which the user interacts, shown here as the Echo device, commonly takes the form of a dedicated smart speaker—a computer-driven by a powerful 32-bit ARM Cortex CPU. In addition, these speakers support WiFi, Bluetooth, and have internal memory and storage [ 9 ].
The speech recognition, natural language processing (NLP), and storage of interactions are based in the Cloud. Amazon’s speech recognition and NLP service, known collectively as Amazon Voice Services (AVS) is hosted on their platform-as-a-service provider, Amazon Web Services (AWS). As well as AVS, AWS also hosts the cloud storage in which data records of voice interactions, along with their audio, are kept [ 10 ]. Data are transferred between the user endpoint and AVS using Javascript Object Notation-encoded messages via, in Amazon’s case, an unofficial public REST API hosted at http://pitangui.amazon.com (access on 22 February 2021) [ 11 ].
1.2. Prior Research and Contribution
There is a very limited number of systematic literature reviews (SLRs) written on the subject of VAs. To the best of our knowledge, none appears to specifically address the security and privacy challenges associated with VAs. The nearest that could be found was an SLR written by de Barcelos Silva et al. [ 12 ], in which a review of all literature pertinent to VAs is studied, and a relatively broad set of questions is posited and answered. Topics include a review of the state of the art, VA usage and architectures, and a taxonomy of VA classification. From the perspective of VA users who are motor or visually impaired, Siebra et al. [ 8 ] provided a literature review in 2018 that analysed VAs as a resource of accessibility for mobile devices. The authors identified and analysed proposals for VAs that better enable smartphone interaction for blind, motor-impaired, dyslexic, and other users who might need assistance. The end goal of their research was to develop a VA with suitable functions to aid these users. The study concluded that the current state of the art did not provide such research and outlined a preliminary protocol as a springboard for future work.
The main aim of this paper is to answer a specific question: “Are there privacy, security, or usage challenges with virtual assistants?” through a systematic literature review. A methodology was established for selecting studies made on the broader subject of VAs, and categorising them into more specific subgroups, i.e., subject audience, security or privacy challenges, and research theme (including user behaviour, applications, exploits, snooping, authentication, and forensics). In total, 20 papers were selected as primary studies to answer the research questions posited in the following section.
1.3. Research Goals
The purpose of this research was to take suitable existing studies, analyse their findings, and summarise the research undertaken into the security and privacy bearings of popular virtual assistants. Considering the lack of existing literature reviews on this subject, we aimed, in this paper, to fill the gap in the current research by linking together those studies which have addressed the privacy and security aspects of VAs in isolation, whether they have been written with users or developers in mind. To that end, the research questions listed in Table 1 have been considered.
Research questions.
The rest of this paper is organised as follows: the research methodology used to select the studies is outlined in Section 2 , whereas Section 3 discusses the findings for the selection of studies, and categorises those papers. In Section 4 , the research questions are answered, followed by a discussion on the future research directions in Section 5 . Section 6 concludes the paper.
2. Research Methodology
In order to answer the research questions in Table 1 , the following stages were undertaken.
2.1. Selection of Primary Studies
A search for a set of primary studies was undertaken by searching the website of particular publishers and using the Google Scholar search engine. The set of keywords used was designed to elicit results pertaining to security and privacy topics associated with popular digital assistants, such as Apple’s Siri, Google’s Assistant, and Amazon’s Alexa. To ensure that no papers were missed that might otherwise have been of interest, the search term was widened to use three further common terms for a virtual assistant. Boolean operators were limited to AND and OR. The searches were limited to the keywords, abstracts, and titles of the documents. The search term used was:
(“digital assistant” OR “virtual assistant” OR “virtual personal assistant” OR “siri” OR “google assistant” OR “alexa”) AND (“privacy” OR “security”)
Alongside Google Scholar, the following databases were searched:
- IEEE Xplore Library
- ScienceDirect
- ACM Digital Library
2.2. Inclusion and Exclusion Criteria
For a study to be included in this SLR, it must present empirical findings; these could be technical research on security or more qualitative work on privacy. The study could apply to end-users, application developers, or the emerging work on VA forensics. The outcome of the study must contain data relating to tangible, technical privacy, and/or security aspects of VAs. General legal and ethical studies, although interesting, were excluded. For a paper to be selected, it had to be fully peer-reviewed research; therefore, results that were taken from blogs, industry magazines, or individual studies were excluded. Table 2 outlines the exact criteria chosen.
Inclusion and exclusion criteria for study selection.
2.3. Results Selection
Using the initial search criteria, 381 studies were singled out. These are broken down as follows:
- IEEE Xplore: 27
- ScienceDirect: 43
- ACM Digital Library: 117
- Google Scholar: 194
The inclusion and exclusion criteria ( Table 2 ) were applied, and a checklist was assembled to assess the quality of each study:
- Does the study clearly show the purpose of the research?
- Does the study adequately describe the background of the research and place it in context?
- Does the study present a research methodology?
- Does the study show results?
- Does the study describe a conclusion, placing the results in context?
- Does the study recommend improvements or further works?
EX2 (grey literature) removed 310 results, the bulk of the initial hits. Only one foreign-language paper was found amongst the results, which was also excluded. Throughout this process, eight duplicates were also found and excluded. With 63 results remaining for further study, these were read. A table was created using Excel and exclusion criterion EX1 (off-topic studies) was applied; following this, all three inclusion criteria were applied. Finally, 20 primary studies remained. Figure 2 shows how many studies remained after each stage of the process.
Attrition of papers at different processing stages.
2.4. Publications over Time
If we consider the first popular VA to be Apple’s Siri [ 13 ]—first made available with the release of the company’s iPhone model 4S in 2011—it is interesting to see that the remaining primary studies which reported concrete data only dated back to 2017, four years before this review. The potential reasons for this will be discussed in Section 4 . Figure 3 shows the number of publications by year.
Number of primary studies against time.
3. Findings
From the initial searches, a large number of studies were found, perhaps surprisingly, given that VA technology is relatively young. It is only ten years since the introduction of the first popular VA, Apple’s Siri [ 13 ]. However, the attrition process described in Figure 2 reduced this number to 20.
Instead of a single set of broad topics into which each of these studies could be categorised, we decided to approach each paper on three different levels, in line with the research questions posed in Section 1.3 . The papers were divided into three categories: Subject Audience, Security and Privacy, and Research Theme. Figure 4 shows a visual representation of the breakdown of the individual categories.
Visual representation of study classifications.
3.1. Category 1: Subject Audience
The first categorisation is based on whether the work of the study is focussed on end-users, developers, or both.
End-users and developers are defined as follows:
- End-user—a person who uses the VA in everyday life. This person may not have the technical knowledge and may be thought of as a ‘customer’ of the company whose VA they have adopted.
- Developer—one who writes software extensions, known as ‘skills’ (Amazon) and ‘apps’ (Google). These extensions are made available to the end-user via online marketplaces.
3.2. Category 2: Security or Privacy?
As this study covers both security (safeguarding data) and privacy (safeguarding user identity), each study was categorised as one or the other. Only three papers covered both security and privacy in the same paper [ 14 , 15 , 16 ].
3.3. Category 3: Research Theme
The third categorisation considers the research themes addressed in each paper as follows:
- Behaviour—the reviewed study looks at how users perceive selected aspects of VAs, and factors influencing the adoption of VAs. All except one of the behavioural studies were carried out on a control group of users [ 11 ].
- Apps—the paper focuses on the development of software extensions and associated security implications.
- Exploit—the reviewed paper looks at malicious security attacks (hacking, malware) where a VA is the target of the threat actor.
- Snooping—the study is concerned with unauthorised listening, where the uninvited listening is being carried out by the device itself, as opposed to ‘Exploit’, where said listening is performed by a malicious threat actor.
- Authentication—the study looks at ways in which a user might authenticate to the device to ensure the VA knows whom it is interacting with.
- Forensics—the study looks at ways in which digital forensic artefacts can be retrieved from the device and its associated cloud services, for the purposes of a criminal investigation.
A taxonomy tree showing these categories and how they relate to the studies to which they apply is shown in Figure 5 .
A taxonomy tree showing categories used to classify different reviewed papers.
It is worth noting that studies focusing on the theme of exploits—malware and hacking—were categorised as such if the VA was the target of the threat actor. Further classifying these studies’ audiences as end-users or developers also considers the nature of the exploit; both developers and end-users can be at risk from these attacks. When a malicious attack exploits a VA’s existing functionality, the study is categorised as ‘end-user’; it is the user who is affected by the exploit. Where the exploit requires new software to be written—for example, the creation of a malicious ‘Skill’—the study is categorised as both ‘developer’ and ‘end-user’ [ 10 , 17 , 18 ]. There was one study [ 19 ] that examined an exploit that required software to be written that exploited a vulnerability in other third-party software. Although the exploit may ultimately have affected the end-user, the focus there was on software development and so the paper was categorised as ‘developer’.
In terms of the subject audience, end-users were overwhelmingly the focus in 79% of papers; a further 11% included end-users with developers as the main focus, and 10% of papers were focussed only on developers. There was a fairly even split between security and privacy as the main thrust of the study; security was the subject of slightly more, at 47%, versus 42% for privacy. Few papers combined the study of both: only 11%. Examining the numbers in the research theme category, exploits were the focus of the majority of the studies; and behaviour was joint third alongside authentication as the focus of the remaining studies. The remainder—snooping, apps, and forensics—were split equally, with only one study dedicated to each. The primary studies are listed in Table 3 , along with their categorisations.
Key data reported by primary studies.
4. Discussion
A recurring theme throughout this review so far has been the relative immaturity of VA technology and the short timeframe in which it has become widely adopted. There is, however, an interesting spread of subjects amongst the primary studies. Another interesting prevalence amongst the studies was that of the particular VA used as the subject of the research; of the papers that focused only on a particular VA, Amazon’s Alexa was the most popular as a subject.
In order to answer the research questions, each paper was read and the results were analysed. Each question is restated below, with a summary of key findings and a more in-depth precis of the studies to add context to the key findings.
4.1. RQ 1: What Are the Emerging Security and Privacy Concerns Surrounding the Use of VAs?
4.1.1. key findings.
While reviewing the papers, the following main findings were deduced:
- Successful malicious attacks have been demonstrated using VAs as the target [ 15 , 18 , 19 , 20 , 24 ]. These attacks are becoming more sophisticated, and some of them use remote vectors. These attacks are exploring different ideas, not just one vector.
- Personally identifiable information can be extracted from an unsecured VA with ease.
- The GDPR appears to be of limited help in safeguarding users in its current form.
4.1.2. Discussion
From malicious attacks designed to impersonate a genuine skill or to bypass device authentication, to attacks designed to bypass VA device authentication, trends have emerged in both the security of VAs and the privacy of their users. Any attack that allows a malicious user to impersonate the user risks that user’s data falling into the wrong hands; attacks with a remote vector are of particular concern due to the comparative ease with which they could be launched without arousing the user’s suspicion. The cloud service platforms which power VAs store a lot of data and, should that data fall into the wrong hands, a serious privacy risk is exposed. The fact that two of the bigger vendors of VAs—Amazon and Google—have skill stores which allow the uploading of malicious applications deliberately designed to access a user’s data means that the user is unable to rely on the fact that the skill they downloaded and use is safe—a serious security concern.
The dolphin attack, as demonstrated by Zhang et al. [ 24 ], shows how Alexa can be manipulated by voice commands that are modulated to frequencies beyond the upper range of human hearing—an attack that requires planning, sophisticated equipment, and physical proximity to the VA device and therefore realistically poses a limited threat to the user. Turner et al. [ 18 ] showed that phoneme morphing could use audio of a source voice and transform it into an audio utterance that could unlock a device that used voice authentication. The original recording need not be that of the device user, which presents a security risk, but one that still relies on physical access to the VA device.
A man-in-the-middle attack called Lyexa was demonstrated in [ 19 ] by Mitev et al., in which a remote attacker uses a compromised IoT device in the user’s home, capable of emitting ultrasound signals, to ‘talk’ to the user’s VA. To further develop this idea from the dolphin attack [ 24 ], a malicious Alexa skill was used in tandem to both provide plausible feedback to the user from the VA to prevent the arousal of suspicion, and make this attack remote, thus increasing its threat potential. Kumar et al. [ 15 ] demonstrated a skill attack that is predicated on Alexa misinterpreting speech. It was shown that Alexa, in testing, correctly interpreted 68.9% of 572,319 words; 24 of these words were misinterpreted consistently, and when used by a malicious skill could be used to confuse genuine skills, thus providing a reliable, repeatable remote attack vector. In [ 27 ], Kennedy et al. demonstrated a particularly advanced form of an exploit that uses machine learning to derive patterns or ‘fingerprints’ and compares them with encrypted traffic between the VA and the server. Certain voice commands could be inferred from the encrypted traffic. This attack is a remote attack and consequently poses a serious security concern.
In conclusion, it was found that the VA is becoming the target of malicious attacks just as other connected computing devices have been in the past. These attacks show an interesting pattern: they are evolving. For any malicious attack to be effective and dangerous to the end user, it must be simple enough to be carried out by someone who has not made an extensive study of the VA’s internal architecture. Furthermore, an attack is made more dangerous by the lack of the need to be proximate to the device. Finally, any attack must be repeatable—if it only works once, in laboratory conditions for example, it poses little threat to the end user. A ready-coded, malicious skill could be exploited remotely by a threat actor with limited knowledge of computer science and it surely, at this point, cannot be long before these attacks are more commonplace.
Furey et al. [ 22 ] studied firstly how much personally identifiable information could be extracted from an Alexa device that had no authentication set. The authors then examined this in the context of GDPR, and how much leeway Amazon might have to offload their compliance responsibilities with carefully written user terms and conditions. Loideain et al. investigated how the female gendering of VAs might pose societal harm “insofar as they reproduce normative assumptions about the role of women as submissive and secondary to men” [ 26 ]. In both cases, the GDPR as it currently stands was found to be only partially successful in protecting VA users. The GDPR, designed expressly to protect the end user and their data, has been shown by two studies in this group to be of limited utility. A study of the GDPR itself or an analysis of the psychological repercussions of VA voice gendering are beyond the scope of this document. However, any flaws in GDPR are a particular concern, given the amount of data collected by VAs, and the increase in interest in exploiting vulnerabilities in VAs and their extensions in order to obtain these data by nefarious means.
4.2. RQ2: To What Degree Do Users’ Concerns Surrounding the Privacy and Security Aspects of VAs Affect Their Choice of VA and Their Behaviour around the Device?
4.2.1. key findings.
The review of the selected papers led to the following main findings:
- Rationalising of security and privacy concerns is more prevalent among those who choose to use a VA; those who don’t use one cite privacy and trust issues as factors affecting their decision.
- Conversely, amongst those who do choose to use a VA, privacy is the main factor in the acceptance of a particular model.
- ‘Unwanted’ recordings—those made by the VA without the user uttering the wake word—occur in significant numbers.
- Children see no difference between a connected toy and a VA designed for adult use.
4.2.2. Discussion
Lau et al. [ 17 ] found that worries differ between people who do and do not use a VA. Those who do not use an assistant, refusing to see the purpose of such a device, are more likely to be the subjects for whom privacy and trust are an issue. These users were “…deeply uncomfortable with the idea of a ‘microphone-based’ device that a speaker company, or an ‘other’ with malicious intent, could ostensibly use to listen in on their homes”. Amongst those who do adopt a VA, users rationalised their lack of concern regarding privacy with the belief that the VA company could be trusted with their data, or that there was no way another user could see their history. Burbach et al. considered the acceptance factors of different VAs amongst a control group of users; a choice-based conjoint analysis was used, having three attributes: natural language processing (NLP) performance, price, and privacy. Privacy was found to be the biggest concern of the three [ 14 ]. These findings appear to conflict with those presented by Lau et al. [ 21 ]; however, the construction of the surveys was different, as privacy was the primary goal of the study. Moreover, Burbach et al. [ 11 ] wrote their study a year later; a year in which several news stories broke in the media regarding privacy concerns of VAs, which may account for the apparent increase in concern over privacy.
Javed et al. [ 21 ] performed an in-depth study of what Alexa was recording. AlthoughAmazon claims that ‘she’ only listens when the wake-word is uttered by the user, their research found that among the control group of users, 91% had experienced an unwanted recording. This was investigated and it was found that benign sounds such as radio and TV and background noise, were recorded in the majority of these cases. Alarmingly, however, 29.2% of the study group reported that some of their unwanted recordings contained sensitive information, which presents a privacy breach. McReynolds et al. studied connected toys (Hello Barbie, Jibo) in conjunction with VAs to determine, amongst other questions, if children relate to ‘traditional’ smart assistants in the same way they do their toys [ 29 ]. A key finding was that having surveyed groups of parents and their children, VAs were used by children who interacted with them in the same way they might interact with a connected toy. VAs, however, are not designed for children and are not examined—at least in the US—for regulatory compliance in the same way connected toys are.
Although there has been an increase in user privacy concerns, there is still a group of users who have faith that the data companies are trustworthy; interestingly, a group of those users for whom privacy is a concern are still using a VA. The fact that privacy is a worry is evidently not sufficient to dissuade the user from having a VA in the house. It might be interesting to see if studies made over the coming years show the trend of privacy awareness continuing, especially in the light of the simple fact that users find VAs recording without their knowledge. Children relate to VAs as they would a toy with similar capabilities and, again, it would be of interest to see if this fact increased privacy concerns amongst parents who use an ‘adult’ VA.
4.3. RQ3: What Are the Security and Privacy Concerns Affecting First-Party and Third-Party Application Development for VA Software?
4.3.1. key findings.
The study of the selected papers led us to deduce the following main findings:
- The processes that check third-party extensions submitted to the app stores of both Amazon and Google do a demonstrably poor job of ensuring that the apps properly authenticate from the third-party server to the Alexa/Google cloud.
- Several novel methods of user authentication to the VA device have been proposed, each using a different secondary device to offer a form of two-factor authentication [ 16 , 23 , 31 ].
- Each of the user authentication methods do go some way to mitigating the voice/replay attacks outlined in the findings of RQ1.
4.3.2. Discussion
Zhang et al. [ 14 ] presented the only study which examined security vetting processes used by the VA manufacturers; these procedures are put in place to ensure that developers of third-party VA extensions (‘skills’, ‘apps’) are ensuring that proper security is implemented in their code. As their research demonstrates, vulnerable extensions—voice squatting attacks, written by the authors to specifically target a genuine skill—have been approved by both Amazon and Google. Combined with the findings in RQ1, in which several VA attacks were identified that relied on malicious extensions, this finding represents a significant security risk. The authors went so far as to inform both Amazon and Google of their findings and have consequently met with both companies in order to help the organisations better understand the novel security risks.
Moving away from extension application development, three novel approaches that might suggest a better way in which VA companies might improve security for end-users have been proposed. Feng et al. [ 23 ] presented what they call ‘VAuth’, a method of ‘continuous’ authentication, in which a wearable device collects unique body surface vibrations emanating from the user and matches them with the voice signal heard by the VA. Wang et al. [ 31 ] proposed another wearable that might provide two-factor authentication. In this approach, termed ‘WearID’, however, the wearable in this instance captures unique vibration patterns not from the user’s body but from the vibration domain of the user’s voice. These are then used in tandem with existing device authentication.
Cheng et al. [ 16 ] suggested ‘acoustic tagging’, whereby a secondary device emits a unique acoustic signal, or ‘watermark’, which is heard in tandem with the user’s voice. The VA—registered to the user—may then accept or reject voice audio instructions accordingly. All three of these methods of authentication go some way towards mitigating malicious attacks, such as the dolphin attack demonstrated by Zhang et al. [ 24 ]. They also provide an extra layer of security for those users concerned about privacy by making it much harder for another user to access a VA without permission. However, they can be considered a form of two-factor authentication, as each of the studies proposes a method that requires extra hardware. Two studies [ 23 , 31 ] involved the use of wearables which might not always be practical for multiple users, as well as adding extra expense and complication for the user.
To conclude, there are worrying security considerations around VAs. Methods of two-factor authentication with an external device, although sophisticated, are cumbersome for users. Interestingly, there were no works at the time of our study on authenticating a user entirely based on their voice fingerprint. Given the lack of vetting in the major vendors’ application stores, which is itself a vulnerability open to exploitation, securing the VA is absolutely essential.
5. Open Research Challenges and Future Directions
According to the results of this study, it can be seen that VAs, like any other computing device, are vulnerable to malicious attacks. A number of vulnerabilities have been studied, and several attacks have been crafted that take advantage of flaws in the design of the VA itself and its software extensions. It has also been shown that VAs can mishear their wake words and make recordings without the user’s knowledge and, even when the user is aware, the VA vendor is recording and storing a large amount of personal information. Therefore, the security and privacy of VAs are still challenging and require further investigation. Three main future research directions are identified and discussed in the following sections.
5.1. GDPR and the Extent of Its Protections
Although an increase in users’ privacy awareness can be seen, among significant numbers of users there is still an alarming—almost blind—reliance on vendors such as Amazon and Google to ‘do the right thing’ and treat the user’s data responsibly and fairly in accordance with GDPR or other local data regulations. Future work might examine whether or not the vendors are fully complying with data law or whether they are adhering to it as little as possible in order to make their businesses more profitable. The work might also study whether or not regulations, such as GDPR, are offering as much protection to the end-user as they should and, if not, where they are failing and need improvement.
5.2. Forensics
Although studies on the forensic aspects of VAs have to date concentrated on finding as much information as possible both from the device and the cloud service platform, little work appears to have been carried out on examining exactly what is stored. Future work could look at how VAs interact with their cloud service providers, and how open the interfaces between the device and server are. Furthermore, it is not clear how much the user is (or can be) aware of what is being stored. This presents an interesting imbalance; while it is possible for the user to see certain data that are stored, the vendors’ ‘privacy dashboards’ through which this information can be gleaned are not telling the whole story. Future work might study this imbalance and find ways in which the user might become more aware of the extent of the data that are being taken from them, stored, and manipulated for the vendors’ profit.
5.3. Voice Authentication without External Device
As discussed in this paper, VA user authentication is a concern, as with any other service that collects user data. A VA collects substantial amounts of personal data, as demonstrated in the forensics-focussed works studied in this paper. Several novel methods for authenticating a user to their device were presented in the primary studies. However, each used an external device to provide a form of two-factor authentication, which makes the resultant solution cumbersome and complicated for the user. An interesting future research direction could address this challenge by focusing on biometric voice analysis as a means of authenticating the user, rather than relying on an external device.
6. Conclusions
In this paper, based on a systematic literature review on the security and privacy challenges of virtual assistants, several gaps in the current research landscape were identified. Research has been carried out on the themes of user concerns, the threat of malicious attack, and improving authentication. However, these studies do not take an overarching view of how these themes may interact, leading to a potential disconnect between these areas. A number of studies concentrated on user behaviour, identifying privacy and security concerns; however, they did not mention how these concerns might be addressed, except [ 33 ], in which a few suggestions were provided for privacy and security design, including improvements to muting, privacy default settings, and audio log features, as well as adding security layers to voice recognition and providing offline capabilities. In addition, it was found that when one particular VA was the focus of the study, Amazon’s Alexa was the assistant that was chosen in the majority of these papers. Given Amazon’s sales dominance in the smart speaker sector, this is perhaps understandable. There are, however, many more VA systems that might be going uninvestigated as a consequence.
The results from answering research question 1 in this study showed that increasingly sophisticated malicious attacks on VAs are being demonstrated, and yet user awareness of this specific and worrying trend appears not to have been studied in any great detail. The three research questions posited were answered as follows. (1) There were several emerging security and privacy concerns, (2) security and privacy concerns do affect users’ adoption of VAs and adoption of a particular model of VA, and (3) there are worrying concerns and security lapses in the way third party software is vetted by manufacturers. It would be interesting to investigate further how these areas converge, as the current research, although it is of great use in its own subject area, can have a narrow focus. It would be fascinating if knock-on effects to other areas could be further researched by broadening the focus areas investigated.
Acknowledgments
The authors are grateful to the Deanship of Scientific Research, King Saud University for funding through Vice Deanship of Scientific Research Chairs, and grant of PIFI 2020 (2020VBC0002), China.
Author Contributions
T.B.; investigation, writing—original draft preparation, T.D. and S.B.; writing—review and supervision, M.S.A.-R. and A.H.S.; writing—editing. All authors have read and agreed to the published version of the manuscript.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Conflicts of interest.
The authors declare no conflict of interest.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Short Research on Voice Control System Based on Artificial Intelligence Assistant
Ieee account.
- Change Username/Password
- Update Address
Purchase Details
- Payment Options
- Order History
- View Purchased Documents
Profile Information
- Communications Preferences
- Profession and Education
- Technical Interests
- US & Canada: +1 800 678 4333
- Worldwide: +1 732 981 0060
- Contact & Support
- About IEEE Xplore
- Accessibility
- Terms of Use
- Nondiscrimination Policy
- Privacy & Opting Out of Cookies
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
AI-assisted writing is quietly booming in academic journals. Here’s why that’s OK
Lecturer in Bioethics, Monash University & Honorary fellow, Melbourne Law School, Monash University
Disclosure statement
Julian Koplin does not work for, consult, own shares in or receive funding from any company or organisation that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.
Monash University provides funding as a founding partner of The Conversation AU.
View all partners
If you search Google Scholar for the phrase “ as an AI language model ”, you’ll find plenty of AI research literature and also some rather suspicious results. For example, one paper on agricultural technology says:
As an AI language model, I don’t have direct access to current research articles or studies. However, I can provide you with an overview of some recent trends and advancements …
Obvious gaffes like this aren’t the only signs that researchers are increasingly turning to generative AI tools when writing up their research. A recent study examined the frequency of certain words in academic writing (such as “commendable”, “meticulously” and “intricate”), and found they became far more common after the launch of ChatGPT – so much so that 1% of all journal articles published in 2023 may have contained AI-generated text.
(Why do AI models overuse these words? There is speculation it’s because they are more common in English as spoken in Nigeria, where key elements of model training often occur.)
The aforementioned study also looks at preliminary data from 2024, which indicates that AI writing assistance is only becoming more common. Is this a crisis for modern scholarship, or a boon for academic productivity?
Who should take credit for AI writing?
Many people are worried by the use of AI in academic papers. Indeed, the practice has been described as “ contaminating ” scholarly literature.
Some argue that using AI output amounts to plagiarism. If your ideas are copy-pasted from ChatGPT, it is questionable whether you really deserve credit for them.
But there are important differences between “plagiarising” text authored by humans and text authored by AI. Those who plagiarise humans’ work receive credit for ideas that ought to have gone to the original author.
By contrast, it is debatable whether AI systems like ChatGPT can have ideas, let alone deserve credit for them. An AI tool is more like your phone’s autocomplete function than a human researcher.
The question of bias
Another worry is that AI outputs might be biased in ways that could seep into the scholarly record. Infamously, older language models tended to portray people who are female, black and/or gay in distinctly unflattering ways, compared with people who are male, white and/or straight.
This kind of bias is less pronounced in the current version of ChatGPT.
However, other studies have found a different kind of bias in ChatGPT and other large language models : a tendency to reflect a left-liberal political ideology.
Any such bias could subtly distort scholarly writing produced using these tools.
The hallucination problem
The most serious worry relates to a well-known limitation of generative AI systems: that they often make serious mistakes.
For example, when I asked ChatGPT-4 to generate an ASCII image of a mushroom, it provided me with the following output.
It then confidently told me I could use this image of a “mushroom” for my own purposes.
These kinds of overconfident mistakes have been referred to as “ AI hallucinations ” and “ AI bullshit ”. While it is easy to spot that the above ASCII image looks nothing like a mushroom (and quite a bit like a snail), it may be much harder to identify any mistakes ChatGPT makes when surveying scientific literature or describing the state of a philosophical debate.
Unlike (most) humans, AI systems are fundamentally unconcerned with the truth of what they say. If used carelessly, their hallucinations could corrupt the scholarly record.
Should AI-produced text be banned?
One response to the rise of text generators has been to ban them outright. For example, Science – one of the world’s most influential academic journals – disallows any use of AI-generated text .
I see two problems with this approach.
The first problem is a practical one: current tools for detecting AI-generated text are highly unreliable. This includes the detector created by ChatGPT’s own developers, which was taken offline after it was found to have only a 26% accuracy rate (and a 9% false positive rate ). Humans also make mistakes when assessing whether something was written by AI.
It is also possible to circumvent AI text detectors. Online communities are actively exploring how to prompt ChatGPT in ways that allow the user to evade detection. Human users can also superficially rewrite AI outputs, effectively scrubbing away the traces of AI (like its overuse of the words “commendable”, “meticulously” and “intricate”).
The second problem is that banning generative AI outright prevents us from realising these technologies’ benefits. Used well, generative AI can boost academic productivity by streamlining the writing process. In this way, it could help further human knowledge. Ideally, we should try to reap these benefits while avoiding the problems.
The problem is poor quality control, not AI
The most serious problem with AI is the risk of introducing unnoticed errors, leading to sloppy scholarship. Instead of banning AI, we should try to ensure that mistaken, implausible or biased claims cannot make it onto the academic record.
After all, humans can also produce writing with serious errors, and mechanisms such as peer review often fail to prevent its publication.
We need to get better at ensuring academic papers are free from serious mistakes, regardless of whether these mistakes are caused by careless use of AI or sloppy human scholarship. Not only is this more achievable than policing AI usage, it will improve the standards of academic research as a whole.
This would be (as ChatGPT might say) a commendable and meticulously intricate solution.
- Artificial intelligence (AI)
- Academic journals
- Academic publishing
- Hallucinations
- Scholarly publishing
- Academic writing
- Large language models
- Generative AI
Lecturer / Senior Lecturer - Marketing
Communications and Engagement Officer, Corporate Finance Property and Sustainability
Assistant Editor - 1 year cadetship
Executive Dean, Faculty of Health
Lecturer/Senior Lecturer, Earth System Science (School of Science)
Web publishers brace for carnage as Google adds AI answers
The tech giant is rolling out AI-generated answers that displace links to human-written websites, threatening millions of creators
Kimber Matherne’s thriving food blog draws millions of visitors each month searching for last-minute dinner ideas.
But the mother of three says decisions made at Google, more than 2,000 miles from her home in the Florida panhandle, are threatening her business. About 40 percent of visits to her blog, Easy Family Recipes , come through the search engine, which has for more than two decades served as the clearinghouse of the internet, sending users to hundreds of millions of websites each day.
Podcast episode
As the tech giant gears up for Google I/O, its annual developer conference, this week, creators like Matherne are worried about the expanding reach of its new search tool that incorporates artificial intelligence. The product, dubbed “Search Generative Experience,” or SGE, directly answers queries with complex, multi-paragraph replies that push links to other websites further down the page, where they’re less likely to be seen.
The shift stands to shake the very foundations of the web.
The rollout threatens the survival of the millions of creators and publishers who rely on the service for traffic. Some experts argue the addition of AI will boost the tech giant’s already tight grip on the internet, ultimately ushering in a system where information is provided by just a handful of large companies.
“Their goal is to make it as easy as possible for people to find the information they want,” Matherne said. “But if you cut out the people who are the lifeblood of creating that information — that have the real human connection to it — then that’s a disservice to the world.”
Google calls its AI answers “overviews” but they often just paraphrase directly from websites. One search for how to fix a leaky toilet provided an AI answer with several tips, including tightening tank bolts. At the bottom of the answer, Google linked to The Spruce, a home improvement and gardening website owned by web publisher Dotdash Meredith, which also owns Investopedia and Travel + Leisure. Google’s AI tips lifted a phrase from The Spruce’s article word-for-word.
A spokesperson for Dotdash Meredith declined to comment.
The links Google provides are often half-covered, requiring a user to click to expand the box to see them all. It’s unclear which of the claims made by the AI come from which link.
Tech research firm Gartner predicts traffic to the web from search engines will fall 25 percent by 2026. Ross Hudgens, CEO of search engine optimization consultancy Siege Media, said he estimates at least a 10 to 20 percent hit, and more for some publishers. “Some people are going to just get bludgeoned,” he said.
Raptive, which provides digital media, audience and advertising services to about 5,000 websites, including Easy Family Recipes, estimates changes to search could result in about $2 billion in losses to creators — with some websites losing up to two-thirds of their traffic. Raptive arrived at these figures by analyzing thousands of keywords that feed into its network, and conducting a side-by-side comparison of traditional Google search and the pilot version of Google SGE.
Michael Sanchez, the co-founder and CEO of Raptive, says that the changes coming to Google could “deliver tremendous damage” to the internet as we know it. “What was already not a level playing field … could tip its way to where the open internet starts to become in danger of surviving for the long term,” he said.
When Google’s chief executive Sundar Pichai announced the broader rollout during an earnings call last month, he said the company is making the change in a “measured” way, while “also prioritizing traffic to websites and merchants.” Company executives have long argued that Google needs a healthy web to give people a reason to use its service, and doesn’t want to hurt publishers. A Google spokesperson declined to comment further.
“I think we got to see an incredible blossoming of the internet, we got to see something that was really open and freewheeling and wild and very exciting for the whole world,” said Selena Deckelmann, the chief product and technology officer for Wikimedia, the foundation that oversees Wikipedia.
“Now, we’re just in this moment where I think that the profits are driving people in a direction that I’m not sure makes a ton of sense,” Deckelmann said. “This is a moment to take stock of that and say, ‘What is the internet we actually want?’”
People who rely on the web to make a living are worried.
Jake Boly, a strength coach based in Austin, has spent three years building up his website of workout shoe reviews. But last year, his traffic from Google dropped 96 percent. Google still seems to find value in his work, citing his page on AI-generated answers about shoes. The problem is, people read Google’s summary and don’t visit his site anymore, Boly said.
“My content is good enough to scrape and summarize,” he said. “But it’s not good enough to show in your normal search results, which is how I make money and stay afloat.”
Google first said it would begin experimenting with generative AI in search last year, several months after OpenAI released ChatGPT. At the time, tech pundits speculated that AI chatbots could replace Google search as the place to find information. Satya Nadella, the CEO of Google’s biggest competitor, Microsoft, added an AI chatbot to his company’s search engine and in February 2023 goaded Google to “ come out and show that they can dance .”
The search giant started dancing. Though it had invented much of the AI technology enabling chatbots and had used it to power tools like Google Translate, it started putting generative AI tech into its other products. Google Docs, YouTube’s video-editing tools and the company’s voice assistant all got AI upgrades.
But search is Google’s most important product, accounting for about 57 percent of its $80 billion in revenue in the first quarter of this year. Over the years, search ads have been the cash cow Google needed to build its other businesses, like YouTube and cloud storage, and to stay competitive by buying up other companies .
Google has largely avoided AI answers for the moneymaking searches that host ads, said Andy Taylor, vice president of research at internet marketing firm Tinuiti.
When it does show an AI answer on “commercial” searches, it shows up below the row of advertisements. That could force websites to buy ads just to maintain their position at the top of search results.
Google has been testing the AI answers publicly for the past year, showing them to a small percentage of its billions of users as it tries to improve the technology.
Still, it routinely makes mistakes. A review by The Washington Post published in April found that Google’s AI answers were long-winded, sometimes misunderstood the question and made up fake answers.
The bar for success is high. While OpenAI’s ChatGPT is a novel product, consumers have spent years with Google and expect search results to be fast and accurate. The rush into generative AI might also run up against legal problems. The underlying tech behind OpenAI, Google, Meta and Microsoft’s AI was trained on millions of news articles, blog posts, e-books, recipes, social media comments and Wikipedia pages that were scraped from the internet without paying or asking permission of their original authors.
OpenAI and Microsoft have faced a string of lawsuits over alleged theft of copyrighted works .
“If journalists did that to each other, we’d call that plagiarism,” said Frank Pine, the executive editor of MediaNews Group, which publishes dozens of newspapers around the United States, including the Denver Post, San Jose Mercury News and the Boston Herald. Several of the company’s papers sued OpenAI and Microsoft in April, alleging the companies used its news articles to train their AI.
If news organizations let tech companies, including Google, use their content to make AI summaries without payment or permission, it would be “calamitous” for the journalism industry, Pine said. The change could have an even bigger effect on newspapers than the loss of their classifieds businesses in the mid-2000s or Meta’s more recent pivot away from promoting news to its users, he said.
The move to AI answers, and the centralization of the web into a few portals isn’t slowing down. OpenAI has signed deals with web publishers — including Dotdash Meredith — to show their content prominently in its chatbot.
Matherne, of Easy Family Recipes, says she’s bracing for the changes by investing in social media channels and email newsletters.
“The internet’s kind of a scary place right now,” Matherne said. “You don’t know what to expect.”
A previous version of this story said MediaNews Group sued OpenAI and Microsoft. In fact, it was several of the company's newspapers that sued the tech companies. This story has been corrected.
Purdue Online Writing Lab Purdue OWL® College of Liberal Arts
Welcome to the Purdue Online Writing Lab
Welcome to the Purdue OWL
This page is brought to you by the OWL at Purdue University. When printing this page, you must include the entire legal notice.
Copyright ©1995-2018 by The Writing Lab & The OWL at Purdue and Purdue University. All rights reserved. This material may not be published, reproduced, broadcast, rewritten, or redistributed without permission. Use of this site constitutes acceptance of our terms and conditions of fair use.
The Online Writing Lab at Purdue University houses writing resources and instructional material, and we provide these as a free service of the Writing Lab at Purdue. Students, members of the community, and users worldwide will find information to assist with many writing projects. Teachers and trainers may use this material for in-class and out-of-class instruction.
The Purdue On-Campus Writing Lab and Purdue Online Writing Lab assist clients in their development as writers—no matter what their skill level—with on-campus consultations, online participation, and community engagement. The Purdue Writing Lab serves the Purdue, West Lafayette, campus and coordinates with local literacy initiatives. The Purdue OWL offers global support through online reference materials and services.
A Message From the Assistant Director of Content Development
The Purdue OWL® is committed to supporting students, instructors, and writers by offering a wide range of resources that are developed and revised with them in mind. To do this, the OWL team is always exploring possibilties for a better design, allowing accessibility and user experience to guide our process. As the OWL undergoes some changes, we welcome your feedback and suggestions by email at any time.
Please don't hesitate to contact us via our contact page if you have any questions or comments.
All the best,
Social Media
Facebook twitter.
Survey on Virtual Assistant: Google Assistant, Siri, Cortana, Alexa
- Conference paper
- First Online: 05 January 2019
- Cite this conference paper
- Amrita S. Tulshan 15 &
- Sudhir Namdeorao Dhage 15
Part of the book series: Communications in Computer and Information Science ((CCIS,volume 968))
Included in the following conference series:
- International Symposium on Signal Processing and Intelligent Recognition Systems
4210 Accesses
50 Citations
3 Altmetric
Virtual assistant is boon for everyone in this new era of 21st century. It has paved way for a new technology where we can ask questions to machine and can interact with IVAs as people do with humans. This new technology attracted almost whole world in many ways like smart phones, laptops, computers etc. Some of the significant VPs are like Siri, Google Assistant, Cortana, and Alexa. Voice recognition, contextual understanding and human interaction are the issues which are not solved yet in this IVAs. So, to solve those issues 100 users participated a survey for this research and shared their experiences. All users’ task was to ask questions from the survey to all personal assistants and from their experiences this research paper came up with the actual results. According to that results many services were covered by these assistants but still there are some improvements required in voice recognition, contextual understanding and hand free interaction. After addressing these improvements in IVAs will definitely increased its use is the main goal for this research paper.
This is a preview of subscription content, log in via an institution to check access.
Access this chapter
- Available as PDF
- Read on any device
- Instant download
- Own it forever
- Available as EPUB and PDF
- Compact, lightweight edition
- Dispatched in 3 to 5 business days
- Free shipping worldwide - see info
Tax calculation will be finalised at checkout
Purchases are for personal use only
Institutional subscriptions
Gong, L.: San Francisco, CA (US) United States US 2003.01671.67A1 (12) Patent Application Publication c (10) Pub. No.: US 2003/0167167 A1 Gong (43) Pub. Date: 4 September 2003 for Intelligent Virtual Assistant
Google Scholar
Sarikaya, R.: The technology behind personal digital assistants. IEEE Signal Process. Mag. 34 , 67–81 (2017). https://doi.org/10.1109/msp.2016.2617341
Article Google Scholar
Tsiao, J.C.-S., Tong, P.P., Chao, D.Y.: Natural-Language Voice-Activated Personal Assistant, United States Patent (10), Patent No.: US 7,216,080 B2 (45), 8 May 2007
Sirbi, K., Patankar, A.J.: Personal assistant with voice recognition intelligence. Int. J. Eng. Res. Technol. 10 (1), 416–419 (2017). ISSN 0974-3154
Kawamura, T., Ohsuga, A.: Flower voice: virtual assistant for open data
Elshafei, M.: Virtual personal assistant (VPA) for mobile users. Mitel Networks (2000–2002)
Chung, H., Iorga, M., Voas, J., Lee, S.: Alexa, can I trust you? In: 2017 IEEE Computer Security (2017)
Cowan, B.R.: What can i help you with?: infrequent users’ experiences of intelligent personal assistants. In: 2015 IEEE 10th International Conference on Industrial and Information Systems, ICIIS 2015, Sri Lanka (2015)
Weeratunga, A.M., Jayawardana, S.A.U., Hasindu, P.M.A.K, Prashan, W.P.M., Thelijjagoda, S.: Project Nethra - an intelligent assistant for the visually disabled to interact with internet services. In: 2015 IEEE 10th International Conference on Industrial and Information System (2015)
López, G., Quesada, L., Guerrero, L.A.: Alexa vs. siri vs. cortana vs. Google assistant: a comparison of speech-based natural user interfaces. In: Nunes, I. (ed.) AHFE 2017. AISC, vol. 592, pp. 241–250. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-60366-7_23
Chapter Google Scholar
Zhao, Y., Li, J., Zhang, S., Chen, L., Gong, Y.: Domain and speaker adaptation for Cortana speech recognition. In: ICASSP
Bellegarda, J.R.: Spoken language understanding for natural interaction: the siri experience. In: Mariani, J., Rosset, S., Garnier-Rizet, M., Devillers, L. (eds.) Natural Interaction with Robots, Knowbots and Smartphones, pp. 3–14. Springer, New York (2014). https://doi.org/10.1007/978-1-4614-8280-2_1
Google: Google Assistant. https://assisatnt.google.com
Purington, A., Taft, J.G., Sannon, S., Bazarova, N.N., Taylor, S.H.: Alexa is my new BFF: social roles, user satisfaction, and personification of the amazon echo. ACM, 6–11 May 2017. ISBN 978-1-4503-4656-6/17/05
Lopez, G., Quesada, L., Guerrero, L.A.: Alexa vs Siri vs Cortana vs Google assistant: a comparison of speech-based natural user interfaces. Conference Paper, January 2018
Kepuska, V., Bohouta, G.: Next generation of virtual personal assistants (Microsoft Cortana, Apple Siri, Amazon Alexa and Google Home). In: IEEE Conference (2018)
Download references
Author information
Authors and affiliations.
Department of Computer Engineering, Sardar Patel Institute of Technology, Mumbai, 400058, India
Amrita S. Tulshan & Sudhir Namdeorao Dhage
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to Amrita S. Tulshan .
Editor information
Editors and affiliations.
Indian Institute of Information Technology and Management, Kerala, India
Sabu M. Thampi
Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA
Oge Marques
Department of Electrical and Computer Engineering, Ryerson University, Toronto, ON, Canada
Sri Krishnan
Department of Computer Science and Information Engineering, Providence University, Taichung, Taiwan
Kuan-Ching Li
University of Naples Federico II, Naples, Italy
Domenico Ciuonzo
Electrical Engineering Department, Indian Institute of Technology Patna, Patna, India
Maheshkumar H. Kolekar
Rights and permissions
Reprints and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper.
Tulshan, A.S., Dhage, S.N. (2019). Survey on Virtual Assistant: Google Assistant, Siri, Cortana, Alexa. In: Thampi, S., Marques, O., Krishnan, S., Li, KC., Ciuonzo, D., Kolekar, M. (eds) Advances in Signal Processing and Intelligent Recognition Systems. SIRS 2018. Communications in Computer and Information Science, vol 968. Springer, Singapore. https://doi.org/10.1007/978-981-13-5758-9_17
Download citation
DOI : https://doi.org/10.1007/978-981-13-5758-9_17
Published : 05 January 2019
Publisher Name : Springer, Singapore
Print ISBN : 978-981-13-5757-2
Online ISBN : 978-981-13-5758-9
eBook Packages : Computer Science Computer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Publish with us
Policies and ethics
- Find a journal
- Track your research
share this!
May 13, 2024
This article has been reviewed according to Science X's editorial process and policies . Editors have highlighted the following attributes while ensuring the content's credibility:
fact-checked
trusted source
written by researcher(s)
AI-assisted writing is quietly booming in academic journals—here's why that's OK
by Julian Koplin, The Conversation
If you search Google Scholar for the phrase " as an AI language model ," you'll find plenty of AI research literature and also some rather suspicious results. For example, one paper on agricultural technology says,
"As an AI language model, I don't have direct access to current research articles or studies. However, I can provide you with an overview of some recent trends and advancements …"
Obvious gaffes like this aren't the only signs that researchers are increasingly turning to generative AI tools when writing up their research. A recent study examined the frequency of certain words in academic writing (such as "commendable," "meticulously" and "intricate"), and found they became far more common after the launch of ChatGPT—so much so that 1% of all journal articles published in 2023 may have contained AI-generated text.
(Why do AI models overuse these words? There is speculation it's because they are more common in English as spoken in Nigeria, where key elements of model training often occur.)
The aforementioned study also looks at preliminary data from 2024, which indicates that AI writing assistance is only becoming more common. Is this a crisis for modern scholarship, or a boon for academic productivity?
Who should take credit for AI writing?
Many people are worried by the use of AI in academic papers. Indeed, the practice has been described as " contaminating " scholarly literature.
Some argue that using AI output amounts to plagiarism. If your ideas are copy-pasted from ChatGPT, it is questionable whether you really deserve credit for them.
But there are important differences between "plagiarizing" text authored by humans and text authored by AI. Those who plagiarize humans' work receive credit for ideas that ought to have gone to the original author.
By contrast, it is debatable whether AI systems like ChatGPT can have ideas, let alone deserve credit for them. An AI tool is more like your phone's autocomplete function than a human researcher.
The question of bias
Another worry is that AI outputs might be biased in ways that could seep into the scholarly record. Infamously, older language models tended to portray people who are female, black and/or gay in distinctly unflattering ways, compared with people who are male, white and/or straight.
This kind of bias is less pronounced in the current version of ChatGPT.
However, other studies have found a different kind of bias in ChatGPT and other large language models : a tendency to reflect a left-liberal political ideology.
Any such bias could subtly distort scholarly writing produced using these tools.
The hallucination problem
The most serious worry relates to a well-known limitation of generative AI systems: that they often make serious mistakes.
For example, when I asked ChatGPT-4 to generate an ASCII image of a mushroom, it provided me with the following output.
It then confidently told me I could use this image of a "mushroom" for my own purposes.
These kinds of overconfident mistakes have been referred to as "AI hallucinations" and " AI bullshit ." While it is easy to spot that the above ASCII image looks nothing like a mushroom (and quite a bit like a snail), it may be much harder to identify any mistakes ChatGPT makes when surveying scientific literature or describing the state of a philosophical debate.
Unlike (most) humans, AI systems are fundamentally unconcerned with the truth of what they say. If used carelessly, their hallucinations could corrupt the scholarly record.
Should AI-produced text be banned?
One response to the rise of text generators has been to ban them outright. For example, Science—one of the world's most influential academic journals—disallows any use of AI-generated text .
I see two problems with this approach.
The first problem is a practical one: current tools for detecting AI-generated text are highly unreliable. This includes the detector created by ChatGPT's own developers, which was taken offline after it was found to have only a 26% accuracy rate (and a 9% false positive rate ). Humans also make mistakes when assessing whether something was written by AI.
It is also possible to circumvent AI text detectors. Online communities are actively exploring how to prompt ChatGPT in ways that allow the user to evade detection. Human users can also superficially rewrite AI outputs, effectively scrubbing away the traces of AI (like its overuse of the words "commendable," "meticulously" and "intricate").
The second problem is that banning generative AI outright prevents us from realizing these technologies' benefits. Used well, generative AI can boost academic productivity by streamlining the writing process. In this way, it could help further human knowledge. Ideally, we should try to reap these benefits while avoiding the problems.
The problem is poor quality control, not AI
The most serious problem with AI is the risk of introducing unnoticed errors, leading to sloppy scholarship. Instead of banning AI, we should try to ensure that mistaken, implausible or biased claims cannot make it onto the academic record.
After all, humans can also produce writing with serious errors, and mechanisms such as peer review often fail to prevent its publication.
We need to get better at ensuring academic papers are free from serious mistakes, regardless of whether these mistakes are caused by careless use of AI or sloppy human scholarship. Not only is this more achievable than policing AI usage, it will improve the standards of academic research as a whole.
This would be (as ChatGPT might say) a commendable and meticulously intricate solution.
Provided by The Conversation
Explore further
Feedback to editors
Machine learning model uncovers new drug design opportunities
2 hours ago
Astronomers find the biggest known batch of planet ingredients swirling around young star
How 'glowing' plants could help scientists predict flash drought
New GPS-based method can measure daily ice loss in Greenland
New candidate genes for human male infertility found by analyzing gorillas' unusual reproductive system
3 hours ago
Study uncovers technologies that could unveil energy-efficient information processing and sophisticated data security
4 hours ago
Scientists develop an affordable sensor for lead contamination
Chemists succeed in synthesizing a molecule first predicted 20 years ago
New optical tweezers can trap large and irregularly shaped particles
An easy pill to swallow—new 3D printing research paves way for personalized medication
5 hours ago
Relevant PhysicsForums posts
Is "college algebra" really just high school "algebra ii", physics education is 60 years out of date.
21 hours ago
Plagiarism & ChatGPT: Is Cheating with AI the New Normal?
Physics instructor minimum education to teach community college.
May 11, 2024
Studying "Useful" vs. "Useless" Stuff in School
Apr 30, 2024
Why are Physicists so informal with mathematics?
Apr 29, 2024
More from STEM Educators and Teaching
Related Stories
AI-generated academic science writing can be identified with over 99% accuracy
Jun 7, 2023
ChatGPT maker fields tool for spotting AI-written text
Feb 1, 2023
Is the genie out of the bottle? Can you trust ChatGPT in scientific writing?
Oct 19, 2023
What is ChatGPT: Here's what you need to know
Feb 16, 2023
Tool detects AI-generated text in science journals
Nov 7, 2023
Could artificial intelligence help or hurt medical research articles?
Feb 6, 2024
Recommended for you
Investigation reveals varied impact of preschool programs on long-term school success
May 2, 2024
Training of brain processes makes reading more efficient
Apr 18, 2024
Researchers find lower grades given to students with surnames that come later in alphabetical order
Apr 17, 2024
Earth, the sun and a bike wheel: Why your high-school textbook was wrong about the shape of Earth's orbit
Apr 8, 2024
Touchibo, a robot that fosters inclusion in education through touch
Apr 5, 2024
More than money, family and community bonds prep teens for college success: Study
Let us know if there is a problem with our content.
Use this form if you have come across a typo, inaccuracy or would like to send an edit request for the content on this page. For general inquiries, please use our contact form . For general feedback, use the public comments section below (please adhere to guidelines ).
Please select the most appropriate category to facilitate processing of your request
Thank you for taking time to provide your feedback to the editors.
Your feedback is important to us. However, we do not guarantee individual replies due to the high volume of messages.
E-mail the story
Your email address is used only to let the recipient know who sent the email. Neither your address nor the recipient's address will be used for any other purpose. The information you enter will appear in your e-mail message and is not retained by Phys.org in any form.
Newsletter sign up
Get weekly and/or daily updates delivered to your inbox. You can unsubscribe at any time and we'll never share your details to third parties.
More information Privacy policy
Donate and enjoy an ad-free experience
We keep our content available to everyone. Consider supporting Science X's mission by getting a premium account.
IMAGES
VIDEO
COMMENTS
Intelligent Virtual Assistant (IVA) is "an. software tha t e xploitation information, for example, the operator's voice… and ration al data to give he lp. by noticing inquiries in usual ...
Because of the increasing popularity of voice-controlled virtual assistants, such as Amazon's Alexa and Google Assistant, they should be considered a new medium for psychological and behavioral research. We developed Survey Mate, an extension of Google Assistant, and conducted two studies to analyze the reliability and validity of data collected through this medium. In the first study, we ...
An overview of Bard: an early experiment with generative AI James Manyika, SVP, Research, Technology and Society, and Sissie Hsiao, Vice President and General Manager, Google Assistant and Bard Editor's note: This is a living document and will be updated periodically as we continue to rapidly improve Bard's capabilities as well as
For instance, in many African countries, people do not own a computer but a smartphone (Pew Research Center, 2019), and surveys could be rolled out in multiple languages using the often preinstalled Google Assistant. Behavioral scientists routinely draw broad claims from Western, educated, industrialized, rich, and democratic (WEIRD) samples.
1. Introduction. The communication with devices using the voice is nowadays a common task for many people. Intelligent Personal Assistants (IPA), such as Amazon Alexa, 1 Microsoft Cortana, 2 Google Assistant, 3 or Apple Siri, 4 allow people to search for various subjects, schedule a meeting, or to make a call from their car or house hands-free, no longer needing to hold any mobile devices.
Natural user interfaces are becoming popular. One of the most common natural user interfaces nowadays are voice activated interfaces, particularly smart personal assistants such as Google Assistant, Alexa, Cortana, and Siri. This paper presents the results of an evaluation of these four smart personal assistants in two dimensions: the correctness of their answers and how natural the responses ...
In this paper, we discussed the forensic analysis of Google Assistant, a virtual assistant developed by Google and primarily available on mobile and smart home IoT devices. We showed client-centric forensic artifacts stored in the main opa_history SQLite database on Android smartphones which contain all local copies of voice conversations ...
Our goal in Speech Technology Research is twofold: to make speaking to devices around you (home, in car), devices you wear (watch), devices with you (phone, tablet) ubiquitous and seamless. Our research focuses on what makes Google unique: computing scale and data. Using large scale computing resources pushes us to rethink the architecture and ...
According to National Public Radio and Edison Research, 21% of Americans (53-million people) own smart speakers, growing quickly from the 14-million people who owned their first smart speakers in 2018. Huffman, Vice President of Google Assistant, announced that Google Assistant mobile application has been downloaded to 500-million devices.
1.1. Background. The most popular VAs on the market are Apple's Siri, Amazon's Alexa, Microsoft's Cortana, and Google's Assistant []; these assistants, often found in portable devices such as smartphones or tablets, can each be considered a 'speech-based natural user interface' (NUI) []; a system that can be operated by a user via intuitive, natural behaviour, i.e., voice instructions.
This article proposes IoT-enabled smart home using Google Assistant. Internet of things (IoT) is the emerging technology used to correlate the computing devices by trans-receiving data using Internet. Smart home system ensures our individual home appliances get controlled by voice commands without actually turning on or off.
The paper uses the voice-controlled home automation concept into practice with the help of Google Assistant for voice recognition and control. The purpose of Google Assistant-controlled Home Automation is to provide voice control of household appliances. The NodeMCU (FS P32) microcontroller is utilised, and Wi-Fi is used for communication ...
Google Scholar provides a simple way to broadly search for scholarly literature. Search across a wide variety of disciplines and sources: articles, theses, books, abstracts and court opinions.
This paper proposes a voice control system based on artificial intelligence (AI) assistant. The AI assistant system using Google Assistant, a representative service of open API artificial intelligence, and the conditional auto-run system, IFTTT(IF This, Then That) was designed. It cost-effectively implemented the system using Raspberry Pi, voice recognition module, and open software. The ...
the elasticity and flexibility of cloud computing enables [18]Google Assistant to serve users anytime, anywhere, whether on mobile devices or smart home devices. By combining with cloud computing ...
In twenty-first-century virtual assistant is playing a very crucial role in day to day activities of human. According to the survey report of Clutch in 2019, 27% of the people are using the AI-powered virtual assistant such as: Google Assistant, Amazon Alexa, Cortana, Apple Siri, etc., for performing a simple task, people are using virtual assistant designed with natural language processing.
International Journal of Science and Research (IJSR) ISSN: 2319-7064 SJIF (2022): 7.942 Volume 11 Issue 5, May 2022 ... Voice Assistant, NLP, Neural Network, Google Search. 1. Introduction . AI voice assistant, also known as a virtual or digital ... continuous in research papers since 2000, except the year 2010 (Figure .
If you search Google Scholar for the phrase "as an AI language model", you'll find plenty of AI research literature and also some rather suspicious results. For example, one paper on ...
Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio.
The tech giant is rolling out AI-generated answers that displace links to human-written websites, threatening millions of creators. By Gerrit De Vynck. and. Cat Zakrzewski. Updated May 13, 2024 at ...
Mission. The Purdue On-Campus Writing Lab and Purdue Online Writing Lab assist clients in their development as writers—no matter what their skill level—with on-campus consultations, online participation, and community engagement. The Purdue Writing Lab serves the Purdue, West Lafayette, campus and coordinates with local literacy initiatives.
This paper presents a usability of four Virtual assistant voic-baesd and contextual text (Google assistant, Coratan, Siri, Alexa) . Cortana can likewise read your messages, track your area, watch your perusing history, check your contact list, watch out for your date-book, and set up this information together to propose valuable data, on the ...
Many people are worried by the use of AI in academic papers. Indeed, the practice has been described as "contaminating" scholarly literature. Some argue that using AI output amounts to plagiarism ...