Smart speakers from Amazon and Google offer simple access to information through voice commands. The capability of the speakers can be extended by third-party developers through small apps. The apps currently create privacy issues.
UPDATE December 17, 2019: Attacks still possible Six weeks after first publicly discussing the Smart Spies attacks, we performed some retests to see whether Google and Amazon implemented sufficient checks to mitigate the attacks.
Smart speakers from Amazon and Google offer simple access to information through voice commands. The capability of the speakers can be extended by third-party developers through small apps. These smart speaker voice apps are called Skills for Alexa and Actions on Google Home. The apps currently create privacy issues: They can be abused to listen in on users or vish (voice-phish) their passwords.
As the functionality of smart speakers grows so too does the attack surface for hackers to exploit them. SRLabs research found two possible hacking scenarios that apply to both Amazon Alexa and Google Home. The flaws allow a hacker to phish for sensitive information and eavesdrop on users. We created voice applications to demonstrate both hacks on both device platforms, turning the assistants into ‘Smart Spies’.
Both Alexa Skills and Google Home Actions are activated by the user calling out the invocation name chosen by the application developer. (“Alexa, turn on My Horoscopes.”) Users can then call functions (Intents) within the application by speaking specific phrases. (“Tell me my horoscope for today.”) These set phrases can include variable arguments given by the user as slot values. The input slots are converted to text and sent to the application backend, which are often operated outside the control of Amazon or Google.
Through the standard development interfaces, SRLabs researchers were able to compromise the data privacy of users in two ways:
The ‘Smart Spies’ hacks combine three building blocks:
a. We leverage the “fallback intent”, which is what a voice app defaults to when it cannot assign the user’s most recent spoken command to any other intent and should offer help. (“I’m sorry, I did not understand that. Can you please repeat it?”)
b. To eavesdrop on Alexa users, we further exploit the built-in stop intent which reacts to the user saying “stop”. We also took advantage of being allowed to change an intent’s functionality after the application had already passed the platform’s review process.
c. Lastly, we leverage a quirk in Alexa’s and Google’s Text-to-Speech engine that allows inserting long pauses in the speech output.
It is possible to ask for sensitive data such as the user’s password from any voice app. To create a password phishing Skill/Action, a hacker could follow the following steps:
This video demonstrates the vulnerability with a live Alexa Skill:
A demo video for Google Home can be found here:
Additionally, it would have been possible to also request the corresponding email address to potentially gain access to the user’s Amazon or Google account.
We were able to listen in on conversations after a user believes to have stopped our voice app. To accomplish this, we use a slightly different strategy for each of the voice speaker platforms.
= Amazon Alexa
For Alexa devices, the voice recording is started by the user calling certain trigger words, as chosen by the Skill developer. This makes it possible to choose common words such as “I” or words indicating that some personal information will follow, i.e. “email”, “password” or “address”.
To create an eavesdropping skill, a hacker could follow these steps:
When the user now tries to end the skill, they hear a goodbye message, but the skill keeps running for several more seconds. If the user starts a sentence beginning with the selected word in this time, the intent will save the sentence as slot values and send them to the attacker.
This video demonstrates the vulnerability with a live Alexa Skill:
= Google Home
For Google Home devices, the hack is more powerful: There is no need to specify certain trigger words and the hacker can monitor the user’s conversations infinitely.
This is achieved by putting the user in a loop where the device is constantly sending recognized speech to the hacker’s server while only outputting short silences in between.
To create such an eavesdropping Action, a hacker follows these steps:
After outputting the requested information and playing the earcon, the Google Home device waits for approximately 9 seconds for speech input. If none is detected, the device “outputs” a short silence and waits again for user input. If no speech is detected within 3 iterations, the Action stops.
When speech input is detected, a second intent is called. This intent only consists of one silent output, again with multiple silent reprompt texts. Every time speech is detected, this Intent is called and the reprompt count is reset.
The hacker receives a full transcript of the user’s subsequent conversations, until there is at least a 30 second break of detected speech. (This can be extended by extending the silence duration, during which the eavesdropping is paused.)
In this state, the Google Home Device will also forward all commands prefixed by “OK Google” (except “stop”) to the hacker. Therefore, the hacker could also use this hack to imitate other applications, man-in-the-middle the user’s interaction with the spoofed Actions and start believable phishing attacks.
This video demonstrates the vulnerability with a live Google Action:
Alexa and Google Home are powerful, and often useful, listening devices in private environments. The privacy implications of an internet-connected microphone listening in to what you say are further reaching than previously understood. Users need to be more aware of the potential of malicious voice apps that abuse their smart speakers. Using a new voice app should be approached with a similar level of caution as installing a new app on your smartphone.
To prevent ‘Smart Spies’ attacks, Amazon and Google need to implement better protection, starting with a more thorough review process of third-party Skills and Actions made available in their voice app stores. The voice app review needs to check explicitly for copies of built-in intents. Unpronounceable characters like “�. “ and silent SSML messages should be removed to prevent arbitrary long pauses in the speakers’ output. Suspicious output texts including “password“ deserve particular attention or should be disallowed completely.
Disclosure note. The vulnerabilities were shared with Amazon and Google through our responsible disclosure process.
Related work. Skill Squatting Attacks on Amazon Alexa