Synthetic Data: Innovation or Illusion?
Too many artificial intelligence correspondents are singing the praises for synthetic data without diving into the details. Synthetic data is currently a subset of one of the three driving factors that will decide the pace of advancement of AI models. These factors include:
- Model Architecture
- Compute (or energy)
- Data (synthetic and real)
So far, more compute trained on more data within the same transformer architecture is correlated with a more powerful model. If current models have already been trained on the entire internet, then data might be the first limiting factor impeding the advancement of AI models. This is where synthetic data enters the equation.
Synthetic data is artificially generated data that mimics real life data. It is dummy data. It is NOT real. If synthetic data is not real, then why do we care about it? Even before the proliferation of large language models (LLMs), generating synthetic data was way easier and cheaper than collecting real data. It requires less labor. You do not need any special instruments. And it is much faster. If you want to build a model that can predict future sales growth based on past sales growth for a global company, collecting and integrating data across dozens of divisions is much harder than generating synthetic data in a few minutes. After LLMs became mainstream, generating synthetic data has become easier than ever before; especially, for unstructured data (i.e. text, voice, video, etc.).
Because synthetic data is cheaper and faster to create than real data, it is tempting to use it in as many situations as possible. Continuing on the previous use case, let’s assume you know that annual sales are between 1 and 20 million dollars per division at ABC Company and there are 5 divisions. If you generated fake data, you might assume an approximately normal distribution and get the middle column of results in the below table. The right column contains the actual sales data.
Division | Synthetic Data | Real Data |
---|---|---|
1 | $14.9 million | $12.1 million |
2 | $10.0 million | $18.0 million |
3 | $9.6 million | $10.1 million |
4 | $13.7 million | $2.6 million |
5 | $7.4 million | $16.7 million |
TOTAL | $55.6 million | $59.5 million |
The synthetic data in the table above yielded a total sales value of about 4 million dollars less than reality. It assumed a roughly normal distribution of division sales while reality was skewed right. Predictions for total company future sales based on synthetic data would significantly underestimate reality. Furthermore, sales predictions at the division level were often completely wrong. Without data grounded in reality, executives would make the wrong decisions.
While synthetic data is not useful for this specific sales forecasting use case, it is useful for a myriad of others. Synthetic data might be useful for improving foundational models models that focus on unstructured data. Foundational models are massive, versatile AI systems pre-trained on diverse data for wide applications, while traditional models are smaller, task-specific, and trained for narrow, predefined purposes. The sales forecasting example above would have required a traditional model while the engine that powers ChatGPT is a foundational model that was mostly trained on text from the internet. Theoretically, using ChatGPT to generate synthetic data that contains millions of sentences could be used to feed back into the model to train and improve it’s performance. While this is possible, AI professionals are far from certain that a high volume of synthetic data will significantly improve the model’s performance.
How do you know when synthetic data will be useful for your use case? Synthetic data might be useful when your use case abides by most (or all) of the following conditions:
- The goal is to train or improve a foundational model
- The synthetic data follows the same distribution of data as reality
- The real distribution of data is relatively stable over time
- Accuracy is not paramount
- Accuracy is subjective
When reality occurs outside of the model, you need to bridge the gap by grounding the model with real world facts. This requires the inclusion of real data to ensure the outputs of the model are accurate, relevant, and tailored to the specific task. Public and proprietary real world data will always have a place in AI; however, synthetic data still has much to prove. Do not be fooled by the illusion that synthetic data replaces real data.
~ The Data Generalist
Data Science Career Advisor