Where Do Generative AI Models Source Their Data & Information?


When you hear about ChatGPT and other AI tools, you often hear about the vast amounts of data that they can cultivate and utilize. But where does that data come from? There are a few different sources, depending on the type of AI tool you’re using and what kind of information is relevant. Understanding this can help you see all that you can gain from AI when you use it correctly.

First, though, just to make sure everyone is on the same page, let’s go over what generative AI models are and how they work. 

What is generative AI?

Generative AI is a form of artificial intelligence that can create output content like text, images, audio, code, and more, based on information or data that is input. ChatGPT and Bard are a couple of good examples of generative AI. They function based on prompts entered by the user, and then source their answers from the droves of data that they have access to. 

Some common use cases include:

  • Translation 
  • Coding 
  • Grammar analysis and writing support
  • Dictation and transcription
  • Voice/speech recognition
  • Sound editing
  • Infographics and illustrations 
  • 3D modeling

This technology is used across several industries and applications today, although concerns and debates continue regarding things like ethics, legality, and privacy. Part of the concern comes from those wondering where this data comes from and how it’s being used. 

Where does the data come from?

Generative AI tools typically have access to data from a few different places. They can utilize text, images, audio, and video to learn patterns, generate new content, and deliver answers (or output) that are almost human. Some data collection methods are automated while others require manual effort. 

Web scraping and crawling 

This is the most common method that people are familiar with when it comes to generative AI training. AI models can be set to focus on a specific topic or type of information. Then, they will use a crawler to help them scour the Internet and find the information they need. 

There are a growing number of tools and applications to help with web scraping. It’s important to choose reputable ones and set the parameters accordingly. Otherwise, your AI models might start sourcing low-quality or inaccurate data, and that can create a huge ripple effect on your entire AI approach. 

Public data sets 

You already have access to a wealth of data that your generative AI models can use to train themselves and deliver results. There is a lot of data out there that belongs to the public domain. That means anyone can access and use them for any purpose. Plus, these data sets include text, as well as images, audio and video, and other materials. 

Books, scientific journals, Wikipedia pages, free image libraries, and news articles are just a few examples of public data sets that can be accessed. 


Crowdsourcing is a lesser-known technique for cultivating data for AI. This method requires companies to develop their own online platform where they can hire and manage their “crowd” to gather data. This method offers diverse, high-quality data from a firsthand source, but it’s not always easy to do.

Some choose to work with a crowdsourcing service provider, which allows them to enjoy all the perks of getting this useful data without the barriers or big investments. When training AI data models, the more direct the information, the more accurate the outcomes. 

Synthetic data generation 

This one is newer and a bit confusing to some, but it’s really revolutionary and points to just how powerful AI can be. Synthetic data generation means that you use one generative AI model to create synthetic data, and then use that data to train another generative AI model. 

If you are developing a customer service AI model, for example, you could use another generative AI model to create fictional customer situations and interactions. This method is gaining popularity because it provides high-quality data that’s diverse and doesn’t have privacy concerns. 

Customer data

One of the best sources of training data for generative AI is something that you already have, probably in droves: customer data. You can use information like customer call logs, purchase history, customer service history, and other details to help with customer service and other aspects of your business. 

This will require that you follow all privacy laws and get consent for the use of the data. There’s also the concern of bias in call logs because of call times, types, and so forth. You may have to clean data to remove background noise, irrelevant information, or errors. However, the result is plenty of useful data that can help you train AI models for a variety of needs.

User-generated content (UGC)

There are privacy and usage concerns to keep in mind, but user-generated content is a great source of valuable training data for generative AI models. Social media sites, forums, and personal blogs are a few examples of sites filled with insights, data, and other valuable assets that you can use in training your AI and your team alike. 

Make sure that you know which sites allow this and where you’ll find the most quality data. Reddit, for example, doesn’t hand out free data for the sake of AI training anymore. Fortunately, modern AI tools know where to go for data that they’re allowed to source, and sites that are against it will be pretty clear about letting you know. 

The moral of the story: Choose your partners wisely 

At Smith.ai, we know that quality matters in all that you do. As you begin searching for ways to embrace AI in your business, we’re here to help. From our AI voice assistant to our team of virtual receptionists who can handle just about anything you need, we have solutions of all kinds. 

To learn more, schedule a consultation or reach out to hello@smith.ai.


Business Education
Written by Samir Sampat

Samir Sampat is a Marketing Manager with Smith.ai. He has experience working with businesses of all sizes focusing on marketing, communications, and business development.

Take the faster path to growth.
Get Smith.ai today.

Affordable plans for every budget.

Take the faster path to growth.
Get Smith.ai today.

Affordable plans for every budget.