How to Generate Synthetic Data for Software Testing (with AI)

Maria Homann

May 29, 2024

Testing software without proper test data is a bit like baking a cake without the right ingredients. Test data plays an essential role in ensuring that your software works under various conditions. And generating the right data at speed is critical for productivity and success. In this article, we’ll dive into the topic of synthetic data generation and show you how AI is effectively changing the way we approach test data generation.

What is test data?

Test data is the information we use to make sure our software works correctly, performs well, and is reliable during testing phases.

Test data can come in two forms:

Synthetic (generated): Created to simulate real data without using actual data. Think of it as creating a virtual twin of your data - realistic but completely safe.
Real (extracted from actual databases): Pulled from existing databases to provide authentic testing scenarios.

Having the right test data is crucial because it ensures that the system is tested under realistic conditions, which helps in identifying hidden bugs and performance issues that might not surface with unrealistic or incomplete data. This realism ensures that the system behaves as expected in production, providing accurate, reliable, and secure outcomes for actual users. In short, test data helps us ensure the software we develop is robust and reliable.

But let’s face it. Generating test data for software testing is no walk in the park. Often you can’t use real data as it contains sensitive information or Personally Identifiable Information (PII), which means you need to generate realistic synthetic data that mimics real-world scenarios.

However, generating synthetic data comes with its own set of challenges. Ensuring that synthetic data complies with privacy regulations can be a real headache. Keeping the data updated as applications change feels like a never-ending task, and it can seriously strain resources to generate high quantities of data at high speed. Plus, integrating this data with automated testing frameworks - essential for continuous testing - is technically challenging. And let's not forget about legacy systems; making old data formats work with new testing environments can be a nightmare

So how do you go about this daunting task, when timelines are tight and resources are scarce?

How to generate synthetic data

Generating test data is a critical part of the software testing process, and there are several ways to do it, each evolving as technology advances.

Here are some common methods:

Manual data generation: This involves creating data by hand based on specific requirements. While straightforward, it can be time-consuming and prone to human error. Plus, let’s face it - it can be incredibly tedious.
Automated data generation tools: These tools automatically generate test data based on predefined parameters and rules. They can quickly and consistently produce large volumes of data in for example a spreadsheet that you can then use for data-driven testing.
Data masking: Anonymizing or obfuscating real production data helps protect sensitive information while retaining its structure and properties. Rather than generating new data from scratch, this is about altering or transforming data. In some cases, only some pieces of the real data are removed or altered to protect sensitive information and the rest is left intact.
Using AI and Machine Learning: Employing AI and machine learning models to generate realistic and varied test data that mimics real-world scenarios is the cutting edge of test data generation. These models can learn from existing data and produce new data that follows the same patterns.

Each method has its advantages and is suited to different testing needs, helping ensure comprehensive and effective software testing.

Let’s dive deeper into using AI and machine learning for generating test data, as this area is rapidly evolving and offers exciting new possibilities.

Generating test data with AI

With the emergence of generative AI tools like ChatGPT, generating vast amounts of data quickly has become much easier. You can prompt ChatGPT to create the exact data you need in precise quantities.

However, there is one drawback to using ChatGPT: Once you produce the data, you’re left with the task of inputting it into your test setup. This task, while not difficult, can be time-consuming and tedious.

The good news is that there are alternative ways of generating test data that allow you to bypass the steps of first using an external tool to generate your data and then inputting that data into your testing setup. Ultimately, the reason you introduced test automation in the first place was likely to boost productivity—so why slow it down with cumbersome processes when it comes to test data generation?

Leapwork’s AI blocks for test data generation

Leapwork has introduced an entirely new way to manage test data generation. Within Leapwork's test automation platform, you can use the Generate AI capabilities to create data directly in your test automation flows. There’s no need to leave your automated testing setup to generate the data. You can specify exactly what data you need, and the block generates unique data that follows those specifications each time you run the test.

But there’s more. If you need to transform data, such as masking production data for testing purposes, you can use the Transform AI capabilities. This feature addresses privacy concerns and allows you to alter data on the fly.

Lastly, you can use the Extract AI capabilities if you want to select specific data within a dataset. For example, if your dataset includes First Name, Last Name, Phone Number, and Email, but you only need the First and Last Names, you can simply specify this, and it will automatically extract just that data.

Learn more about Leapwork's AI-augmented testing solution

Download our report, AI and Software Quality: Trends and Executive Insights, to gain a comprehensive understanding of how AI is reshaping software quality. This report offers key insights and actionable solutions to help your business adapt, scale, and consistently deliver exceptional user and customer experiences in today’s AI-driven landscape.

About the authors

Maria Homann has 5 years of experience in creating helpful content for people in the software development and quality assurance space. She brings together insights from Leapwork’s in-house experts and conducts thorough research to provide comprehensive and informative articles. These AI articles are written in collaboration with Claus Topholt, a seasoned software development professional with over 30 years of experience, and a key developer of Leapwork's AI solutions.