Microsoft Research Asia has launched VASA-1, a new experimental AI tool that can turn a person’s artwork or still image into a lifelike talking face in real time using an existing audio sample. It can create the right lip movements to go with a speech or song, as well as the right face expressions and head movements for an existing still image. The project page has a ton of examples that the researchers posted, and the results are convincing enough to trick people into believing they are real.
It’s evident that the technology might be abused to rapidly and easily produce deepfake movies of actual people, even though the lip and head movements in the examples may still appear a little artificial and out of sync upon closer examination. Having recognized this possibility, the researchers have made the decision to hold off on releasing “an online demo, API, product, additional implementation details, or any related offerings” until they are certain that their technology “will be used responsibly and in accordance with proper regulations.” Yet, they omitted to mention if they intended to put in place any security measures to stop unscrupulous individuals from exploiting them for evil intent, such making deepfake porn or disinformation campaigns.
The researchers believe their technology has a ton of benefits despite its potential for misuse. They said it can be used to enhance educational equity, as well as to improve accessibility for those with communication challenges, perhaps by giving them access to an avatar that can communicate for them. It can also provide companionship and therapeutic support for those who need it, they said, insinuating the VASA-1 could be used in programs that offer access to AI characters people can talk to.
According to the paper published with the announcement, VASA-1 was trained on the VoxCeleb2 Dataset, which contains “over 1 million utterances for 6,112 celebrities” that were extracted from YouTube videos. Even though the tool was trained on real faces, it also works on artistic photos like the Mona Lisa, which the researchers amusingly combined with an audio file of Anne Hathaway’s viral rendition of Lil Wayne’s Paparazzi. It’s so delightful, it’s worth a watch, even if you’re doubting what good a technology like this can do.