Is OpenAI’s Sora Trained on YouTube Videos? A Question of Ethics and Licensing

Is OpenAI’s Sora Trained on YouTube Videos? A Question of Ethics and Licensing

You probably didn’t miss last month’s announcement of OpenAI’s video generator Sora. It created quite a buzz, raising both excitement and sorrow, as well as a lot of questions within the filmmaking community. One of the pressing matters that always comes up when talking about generative AI is what data developers are using for model training. In a recent interview with The Wall Street Journal, OpenAI’s chief technology officer (CTO) Mira Murati didn’t want (or wasn’t able) to provide the answer to this question. She added that she wasn’t sure whether Sora was trained on YouTube videos or not. This raises the important question: What does this mean in terms of ethics and licensing? Let’s take a critical look together!

In case you did miss it: Sora is OpenAI’s text-to-video generator, which is allegedly capable of creating consistent, realistically-looking, and detailed video clips up to 60 seconds, based on simple text descriptions. It hasn’t been released to the public yet, but the published showcases have already sparked a heavy discussion on the possible outcome. One of the assumptions is that it might entirely replace stock footage. Another is that video creators will have a hard time getting camera gigs.

While personally, I’m skeptical that AI can completely take over creative and cinematography jobs, there is another question that concerns me a lot more. If they used, say, YouTube videos for model training, how on earth would they be legally allowed to roll out Sora for commercial purposes? What would this mean in terms of licensing?

Was Sora trained on YouTube Videos?

Ahead of the interview, Joanna Stern from The Wall Street Journal provided OpenAI with a bunch of text prompts that were used to generate video clips. In the discussion with OpenAi’s CTO Mira Murati, they analyzed the results in terms of Sora’s strong sides and current limitations. What also became Joanna’s point of interest, is how severely some of the output reminded her of well-known cartoons or films.

Did the model see any clips of „Ferdinand“ to know what a bull in a China shop should look like? Was it a fan of „Spongebob“?

Joanna Stern, a quote from The Wall Street Journal interview with Mira Murati

However, when their interview touched on the dataset Sora learns from, Murati suddenly backed up and started beating around the bush. She didn’t want to dive into the details, was “not sure”, whether YouTube, Facebook, or Instagram videos were used in Sora’s model training, and leaned on the safe answer, that “it was publicly available or licensed data” (which are two very different things to begin with!). You don’t need to be a body language expert, to see that OpenAI’s CTO didn’t feel comfortable answering these questions. (You can watch her reaction in the original video interview below, starting from 04:05).

Copyright challenges concerning generative AI

According to WSJ, after the interview, Mira Murati confirmed that Sora used content from Shutterstock, which OpenAI has a partnership with. However, it’s guaranteed not the only source of footage that developers fed into their deep-learning models.

If we take a closer look at Murati’s response, the copyright and attribution situation becomes even more critical. The wording “publicly available data” may indeed mean, that OpenAI’s Sora scrapes the entire Internet, including YouTube publications, and content on social media. The licensing terms on YouTube content, for instance, most certainly don’t allow for this usage of all the content hosted there.

Maintaining copyrights online is a challenging area on its own. I’m not a lawyer, but some things are common sense. For instance, if Searchlight Pictures publishes a trailer for “Poor Things” on YouTube, it doesn’t mean that I’m freely allowed to use clips from it in my commercial work (or even in my blog, without correct attribution). At the same time, OpenAI’s Sora will get access to it, and be able to use it for learning purposes, but also to profit from it, just like that.

How some companies react

The copyright (and licensing) problem with generative AI is not new. Over the past year, we’ve heard about an increasing number of lawsuits that big media companies like “The New York Times” and “Getty Images” filed against AI developers (particularly often, against OpenAI).

If you have ever used text-to-image generators, you’ve surely seen, how artificial intelligence adds weird-looking words to the created pictures. More often than not they distinctly remind of a stock image watermark or a company name, which signifies these AI companies don’t have rights for all the datasets they use.

Was OpenAI's Sora trained on YouTube Videos? - how image generators sometimes include random texts
An “abstract background” image, suddenly including a random text. Image source: generated with Midjourney for CineD

Unfortunately, there are no strict regulations in place yet, that would prevent AI developers from using materials online, and finding out and proving that a particular piece of data was used for training the model is close to impossible. Apart from issuing lawsuits, some companies have blocked OpenAI’s web crawler, so that it won’t be able to continue taking content from their websites, while others sign licensing agreements (one of the latest examples – Le Monde and Prisa Media, which will bring French and Spanish content to ChatGPT). But what do you do as an individual artist or video creator? This question stays open.

Not revealing datasets is a common issue for generative AI

It’s not just OpenAI’s CTO, who doesn’t want to talk about the datasets for Sora’s learning. The company generally hardly mentions the sources they use. Even in Sora’s technical paper, you can only find a vague note, that “training text-to-video generation systems requires a large amount of videos with corresponding text captions.”

The same problematic issue applies to other AI developers, especially to the ones, that call themselves “small”, “independent”, and/or “research”. For example, if you take a look at the website of famous image generator Midjourney and try to find information about the data on how they train their models, you are out of luck. Lack of transparency in this question can be the first sign these companies are trying to avoid legal problems due to the fact that they don’t have rights for the data they are using.

There are exceptions, of course. Thus, Adobe, launching their generative model Firefly, directly addressed the ethical question and published the information to the used datasets.

Was OpenAI's Sora trained on YouTube Videos? - a screenshot from Adobe's website about the dataset Firefly is trained on
Image source: Adobe Firefly’s webpage

However, their approach is still questionable. Were Adobe Stock contributors notified, that their footage would become the training field for AI? Did they give their consent? Does this fact increase their earnings? I doubt it.

What it means if Sora was trained on YouTube videos

So, as you can see, we have landed in a very messy situation with no clear solutions in sigh. During the same interview with the Wall Street Journal, Mira Murati mentioned, that Sora would be released to the public already this year. According to her, OpenAI aims to make the tool available at similar costs to their image generator DALL-E 3 (currently around $0.080 per image). However, if they don’t find a way to clarify their training data, or compensate filmmakers and video creators, things might get very tense for them. We predict that at least the big studios, production companies, and successful YouTube channels will bury OpenAI in copyright lawsuits if they don’t solve this by themselves, which might be hard to do.

And what do you think? How would you react, if OpenAI directly confirmed that they used YouTube videos and all published content, regardless of whom it belongs to? Is there any way, they can make things right, before they roll out Sora?

Feature image source: a screenshot from the video clip, generated by OpenAI’s Sora.


Notify of

Sort by:
Sort by:

Take part in the CineD community experience