When The Internet Is Not Enough

When The Internet Is Not Enough Data CoCreations

There have been a lot of stories recently about how the large language models behind the biggest artificial intelligence products from OpenAI, Gemini and Meta have consumed so much data in the training of their models that they have (or will imminently) exhaust all of the data available on the Internet.

In case you missed that, I’ll repeat it. They have consumed virtually ALL of the data available on the Internet. All of it.  That’s a lot of data.

I’ve previously written about the copyright issues at play with this behavior, and some of these recent stories from credible sources suggest that OpenAI, Google and Meta knew they were in potentially questionable territory in using copyrighted works to train their models, but did so anyway.  They were so eager/desperate to find new, untapped pools of data to use for AI training that OpenAI and Google figured out how to transcribe the audio portions of more than one million hours of YouTube videos, likely in violation of YouTube’s own terms of use.  And let’s not forget that YouTube is owned by Google!  Meta even considered purchasing Simon & Schuster, the book publisher, to mine its catalog of books.  While it does not excuse their violation of copyright holders’ rights, hearing about the need and competition for such vast amounts of data gives some insight as to why they proceeded without heeding the warning signs. 

When Everything Is Not Enough

Ingesting all of the data on the Internet is a remarkable accomplishment.  Ignoring the fights over copyrights for the moment, it is impressive that the AI companies accomplished this herculean task in a relatively short amount of time.  Good for them.  Their models are as trained as they can be, right?

Not so fast say these companies.  Once they have used all of the available data, they will need to find new data to continue training their models.

They are saying that “everything” is not enough.  That’s a tough one for me to wrap my head around.  When you have consumed “all”, how can that not be sufficient?

Finding New Data

They claim that new data is required to continue improving on the products they have created.  So what are their options?

First of all, there is new data being created every day.  By the truckloads.  Text, photos, videos, music.  Since 2015, it is estimated that we take and share more photographs each year than all of the years before that since the beginning of photography.  And that’s just photographs.  That’s got to be meaningful.  If the AI companies set up their systems to ingest all of the new data created each day, certainly that must be good enough.

The AI companies don’t think so.

Second, there are now millions of people using their models every day.  Studies estimate that 49% of the population is now using generative AI, with nearly 34% using it every day.  Unlike a search engine which responds to a query with a list of thousands of results that it thinks may be relevant to your question, AI attempts to provide you with THE ANSWER.  With that being the goal, there must be value in all of the feedback data being collected from daily users.  They are inputting prompts and refining those queries to make whatever results the models produce better and more accurate to what they are looking for.  There is vast amounts of data points to be collected and ingested from this usage.  However, the AI companies still don’t believe this is enough.

Introducing Synthetic Data

Their current plan is to start creating synthetic data to feed their models.  Synthetic data is data that is created by generative AI systems.  Apparently, the goal is to have one system respond to a query with multiple answers and then have a second system judge those answers and put forth the better answer as learning for the AI system.  Almost enough to make your head spin, right?  It reminds me of that Escher drawing of a hand drawing a hand drawing.  Feels very circular to me.

And given all that we’ve learned about generative AI’s propensity to hallucinate – to make up answers because it is trying to provide the user with an answer, whether that answer exists or not, this feels like it could be a mistake.  Plus there have been plenty of articles about how AI picks up the biases contained in the data it was trained on.  If AI is creating synthetic data from that already flawed data, it could magnify its own imperfections.  With each generation of synthetic data, the data set could degrade, moving further and further away from the original accurate data, however flawed.

Call Me Skeptical

The AI companies claim to have a plan to remove or reduce bias in synthetic data.  Call me skeptical given that attempts to do this thus far have produced less than stellar results.  Also, I haven’t yet read a theory of how synthetic data systems won’t degrade the data they were originally trained on.  

I believe AI has tremendous potential to improve our lives.  However, like most things AI, this push to create or discover new data is moving fast – probably faster than it should.  Couple this with the fact that it is being driven by competition among the AI companies to create the “best” model probably doesn’t bode well for the exercise of caution.  Let’s just hope that measures are being implemented to prevent this train from running off the rails.  Buckle in.

NOTES

  1. https://www.nytimes.com/2024/04/06/technology/ai-data-tech-companies.html

  2. https://www.cocreations.ai/news/would-we-have-taylor-swift-without-the-beatlesnbsp-the-battle-over-ai-training

  3. https://www.nytimes.com/2024/04/06/technology/tech-giants-harvest-data-artificial-intelligence.html

  4. https://www.ben-evans.com/benedictevans/2015/8/19/how-many-pictures

  5. https://www.forbes.com/sites/forbestechcouncil/2023/11/20/the-pros-and-cons-of-using-synthetic-data-for-training-ai/

Robert Rosenberg

Robert Rosenberg is an independent legal consultant and principal of Telluride Legal Strategies.  He spent 22 years at Showtime Networks in various legal and business roles, most recently as Executive Vice President, General Counsel and Assistant Secretary.  He now consults with companies of all sizes on legal and business strategies. Rob is a thought leader, an expert witness, and a problem solver working at the intersection of media, communication and technology with a strong interest in solving issues introduced by artificial intelligence in business.  Rob can be reached at rob@telluridelegalstrategies.com.

https://www.linkedin.com/in/robertrosenberg/
Previous
Previous

Rethinking AI in Customer Service While Balancing Efficiency with Empathy

Next
Next

How Tech Companies Are Responding To The EU’s AI Act