Tech companies scramble for data to train AI models

AI technology companies are scrambling for training data for their models. OpenAI, Google and Meta* use all available methods to collect information, which are not always legal.

Apr 15, 2024 0 316

Tech companies scramble for data to train AI models

Artificial intelligence companies are actively seeking affordable ways to collect data to train their models, violating laws, copyrights, and platform corporate guidelines.

OpenAI approach

OpenAI developed its Whisper audio transcription model to transcribe over a million hours of YouTube videos to train GPT-4. It is noted that the selection of videos was personally handled by OpenAI President Greg Brockman.

Spokeswoman Lindsay Held emphasized that the company creates specialized data sets for each of its AI models. This helps them better understand the world and keep the startup competitive in the global research community.

In addition, OpenAI uses a variety of data sources, including publicly available data and information obtained through partnerships. The company also plans to create its own synthetic data.

Google's approach

Google also used transcriptions of YouTube videos. The company trained its models based on content, while maintaining agreements with the authors of these videos.

It is noted that in 2023 Google made changes to its terms of service. This was done in order to create the ability to use public documents, reviews on Google Maps and other Internet resources of the company in order to collect more information for Google products based on artificial intelligence.

**Meta* approach**

Meta* also had difficulty finding quality training data. While working on projects and trying to catch up with OpenAI, discussions arose within the company about the possibility of using copyrighted materials.

The company, having studied the majority of English-language books, essays, poems and news articles available on the Internet, considered options such as purchasing book licenses or even paying directly for materials from a large publisher.

According to the source, Meta* employees expressed their willingness to collect data from the Internet, despite the risk of litigation. They felt that licensing negotiations with publishers, artists, musicians and media representatives could take too long.

Solution

The companies' actions illustrate how information on the Internet is becoming a source of development for the artificial intelligence industry. There are two main approaches to solve the problem of data shortage.

The first approach involves training models on synthetic data created using native models, or what is called “curriculum learning.” This approach involves presenting models with qualitative data sequentially in the hope that they can make deeper connections between concepts using much less information. However, the effectiveness of this approach has not yet been confirmed.

Another approach that some companies take is to use whatever information is available, regardless of whether they have permission to do so. However, as numerous lawsuits show, this approach can have serious consequences.