The AI Wars Have Begun, Part 2
How does Google train their AI models? We look at some "interesting" data points. I'm not saying it was aliens...
Did You Know? When it comes to understanding “AI terminology and capabilities,” 50% of surveyed marketers identified as beginners and 37% as intermediate. (Source)
Yesterday, we discussed the beef slow-cooking between Google and OpenAI over OpenAI’s alleged use of million of hours of YouTube video transcriptions to train their new Sora AI text-to-video model.
I used that subject as a jumping off point to pose two questions about legal and ethical gray areas of AI, particularly as they relate to organic search in light of Google’s new AI Overviews rollout.
Does Google’s AI model powering their AI Overviews actually create the content, or does it rewrite content that other people have created?
How does Google train their model to publish accurate AI Overviews tailored to individual search queries?
We tackled the Question 1 in Part 1. To quote myself:
It’s easy to say, “That’s our content and Google is stealing or plagiarizing it!” which may be true.
But how is that different than the process used by a content creator tasked with publishing a new article or blog post? Doesn’t that content creator type queries into Google, read the top results, compile information, and synthesize a new piece of content aggregating all of that information? An article that they own? That they created?
If so, how is Google’s AI model—or any generative AI model, for that matter—really any different, aside from the scale at which it works?
Today, we tackle Question 2.
How does Google train their model to publish accurate AI Overviews?
I really like SEO. I love trying to crack the algorithm2 , mostly because I just love algorithms. I’m something of a data scientist myself, having trained several machine learning and AI models for my own uses (and sometimes just because it’s fun! 🤓).
(I walked right into this meme, didn’t I?)
Let’s answer this question by taking things step-by-step. Even if you have little-to-no previous understanding of how AI models work, this still should make sense to you.
Google is unveiling a new AI Overview section at the top of page 1 that provides AI-generated responses to user queries.
These AI Overviews are powered by Google’s proprietary AI models, which use content from top-ranking search results to generate new content.
All AI models work by taking input data and processing it to yield output data. Here, the input data is the content from top-ranking search results for a user’s query and the output data is the AI Overview published for that query.
The smaller the difference between the input data used in training and the input data used in production, the better the model performs.
Google is paying Reddit $60 million per year to train their AI models on Reddit posts, so we know Google is building models where the training inputs are forum-style, user-generated content.
That means, once deployed for real-world use, these models will give the best outputs to searchers if those same models’ inputs also are forum-style, user-generated content.
Before you get your pitchfork: While we know Google is using Reddit posts to train AI models, we can’t say which models or whether those models are used to create the new AI Overviews.
However, we also know Google’s March 2024 update gave large bumps to websites powered by forum-style, user-generated content—specifically, Reddit and Quora as shown in the following screenshots of SEMrush’s estimated organic search traffic to each domain.
Quora
Adding it all up:
If AI models perform better when the input data they use in production matches the structure and format of the input data on which they were trained…
And if we know Google just paid Reddit a ton of money to train models on their forum-style, user-generated content…
And if we also know forum-style, user-generated content just got a massive bump search rankings from the March 2024 update…
Gif by giffffr on Giphy
I don’t care to pore over Google’s official Terms of Use or any other corporate, CYA bullshit to decide whether Google is technically or legally in the right because of how they wrote their policies that 99.93% of us never read. There certainly will be other people who willingly take on that task.
(Recently, Google was sued for $5 billion in damages related to how they collect, store, and use customer-tracking data. Maybe those same folks want to take a crack at this?)
I’m approaching this question more from a moral/ethical perspective. Putting legalities aside, isn’t it a little hypocritical of Google to get all pissy about OpenAI training models on publicly available YouTube videos while simultaneously using content they didn’t create to spin up on-demand overviews with prime placement at the top of page 1?
I think so.
But I also haven’t decided how I feel or what I think should be done about all of this yet, because it’s complex, it’s all still unfolding, and there’s a lot we just don’t know about how these private companies operates.
One thing I do know.
It’s time to grab a $7 large movie theater popcorn and enjoy the show.
The AI wars have begun.
Footnotes
CGO stands for Chief GIF-Selection Officer and it’s my favorite role here at Data-Driven Marketing.
By “crack the algorithm” I don’t mean any of those shady, short-sighted black and gray hat tactics you often hear about, like PBNs or trying to write content optimized solely for Google instead of for people. Rather, I enjoy trying to figure out how the algorithm works and what that means for building a sustainable future-proofed SEO strategy and website.
Everyone say, “Hi!” to Alex M 👋
Question: What’s the most random fact you know?
Alex M’s Answer: “There is a tick called the lonestar tick that can make people allergic to red meat. It’s rare but it can happen.“
Editor’s Note (from me, Bryan): Alex also provided a link to this article from the American College of Allergy, Asthma, and Immunology. Looks like it’s true! The irony of this tick potentially being found across the southeast into Texas, with all of that region’s delicious BBQ, is just too ironic.
ChatGPT-Generated Stolen Joke of the Day 🤣
Why did the chicken join a band?
Because it had the drumsticks!
Suggest a topic for a future edition 🤔
Got an idea for a topic I can cover? Or maybe you’re struggling with a specific marketing-related problem that you’d like me to address?
Just reply to this email and describe the topic.
There's no guarantee I'll use your suggestion, but I read and reply to everyone, so have at it!