It has been reported by The New York Times that OpenAI and Google may have violated authors’ copyrights by training their AI models on language extracted from YouTube videos. Many persons with knowledge of OpenAI, Google, and Meta’s procedures are cited in the paper, which details the extent to which these companies have gone to optimize the quantity of data that they can feed to their AIs. It happens just a few days after OpenAI’s CEO Neal Mohan stated in a Bloomberg Originals interview that the company would violate the terms of service if it used videos from the YouTube network to train its new text-to-video generator, Sora.
Over a million hours of YouTube videos were transcribed by OpenAI using their Whisper speech recognition engine, according to the New York Times. This transcription was used to train GPT-4. The Information has previously revealed that podcasts and YouTube videos had been utilized by OpenAI to train the two AI systems. Apparently, one of the members of this team was OpenAI president Greg Brockman. According to Matt Bryant, a Google official, “unauthorized scraping or downloading of YouTube content” is prohibited. He also stated that OpenAI was not using Google for any such purposes.
The report, however, claims there were people at Google who knew but did not take action against OpenAI because Google was using YouTube videos to train its own AI models. Google told NYT it only does so with videos from creators who have agreed to this. Engadget has reached out to Google and OpenAI for comment.
The NYT report also claims Google asked a team to tweak its privacy policy in June 2023 to more broadly cover its use of publicly available content, including Google Docs and Google Sheets, to train its AI models and products. The changes, which Google says were made for clarity’s sake, were published in July. Bryant told NYT that this type of data is only used with the permission of users who opt into Google’s experimental features tests, and that the company “did not start training on additional types of data based on this language change.” The change added Bard as an example of what that data might be used for.