
Responsible AI development demands an ongoing commitment to mitigate bias throughout the system’s life cycle. And synthetic data is an effective way to do that.
I started working with synthetic data quite early on in 2014, even when there were no provisions for generating it efficiently at scale like now. I was working for a startup that was building an app to prevent drivers from using chatting apps while driving above a certain speed. We had no reference data to train the app, and all the developers would actually take out their cars to capture sensor data for various scenarios.
Our use case was clear. There was no data available, and we had to generate the entire dataset. Currently, the engineering world is facing a different problem. Data is available, but it’s not enough to build a robust system.
Elon Musk said in a recent interview, “Now we have exhausted almost the entire body of human knowledge in the field of AI training. This happened last year, “The only way to complement [real—world information] is with synthetic data when AI creates [training information]. With such materials, [AI] will sort of evaluate itself and go through this self—learning process,” the entrepreneur added.”
In fact, not having enough data to train systems is also one of the top reasons for AI project failures. Engineering teams now need to develop synthetic data to fill gaps in existing datasets and reduce biases in their models. In this post, I have covered the most common types of biases in AI systems and how synthetic data can help mitigate them.
See also: NIST: AI Bias Goes Way Beyond Data
Top biases and how synthetic data can handle them
Bias is complicated. You have a dataset, but it has issues. Developers may end up creating models in line with this problematic dataset rather than in line with reality. That’s why you need synthetic data to supplement your existing data. The first step to achieve this is to identify the type of bias in the data and then generate synthetic data to close the gaps.
But how do you pinpoint which bias exists and how to solve it? I have covered the most common categories of biases below, how to identify them, and how to solve them with synthetic data.
1. Selection bias
Selection bias is one of the most common types of bias, where the data is incomplete and doesn’t represent the entire target audience. For example, you are building a grocery delivery app for metropolitan and developing cities, but your current dataset has only data from metropolitan cities.
How to detect this bias?
When you perform a deeper analysis of the input datasets, you will notice you are missing data from certain demographics, such as location, age, gender, ethnicity, etc.
How to solve this bias with synthetic data?
To overcome selection bias, you can take the help of a data scientist and a business to understand what missing data will look like. Understand the peculiarities and features. Let’s take the same grocery store issue as an example. Data scientists can share insights into what the data of developing cities will look like. For instance, you can expect a 35% youth and a 30% adult population. They usually refer to external reports, internal surveys, or awareness campaigns to get such information. Once you have sufficient features, you can generate synthetic data based on this information and use it to create a comprehensive dataset (original + synthetic data), which in turn enables the creation of effective models to yield more accurate results.
2. Survivorship bias
Survivorship bias is another type of bias where you have more data for success scenarios (survivors) and less on failed cases (non-survivors). For example, in a usual eCommerce workflow, customers can drop out at any stage. Companies have more data for the successful purchase than for the churns. As a result, the model isn’t optimized properly for negative workflows.
How to detect this bias?
If you find a maximum dataset of happy flow only, then there is a high chance that your dataset will have a survivorship bias.
How to solve this bias with synthetic data?
You can run surveys to understand failed cases (non-survivors). Once you identify the possible non-survivor workflows for a limited sample set, you can extrapolate them to create a bigger volume of synthetic data. This synthetic data, along with real data, will give you a complete dataset for model training.
3. Historical/Racial/Association bias
Historical/Racial/Association bias is another type of bias where systems do not favor a specific gender or race due to past prejudices. For example, Amazon developed AI tools for reviewing job applications and trained them on historical hiring data. Based on past data, more men were hired for certain technical roles, so the model started favoring male candidates due to historical bias.
How to detect this bias?
Check if there is a pattern in failed and successful scenarios that is due to past stereotypes. Anything favorable or biased towards certain ethnic groups, races, or genders.
How to solve this bias with synthetic data?
Create synthetic data that negates the prejudices so your AI model gives a fair chance to everyone. For the Amazon hiring example, this would mean creating synthetic data of women hired for technical roles.
4. Measurement bias, Label and Reporting bias
Measurement bias, label, or reporting bias is a type of bias where data itself is incorrect, either due to systemic issues or bias in the person who collected it. For instance, a company is using document automation software to digitize the data of its physical invoices, and the software has glitches. It mapped the entire dataset incorrectly. This incorrectly mapped data was passed on to the developer to build an expense forecasting model.
How to detect this bias?
You can look at the data to identify any possible labeling mistakes initially. However, it may not always be possible to detect the issues by manual validation. So, it would be good to start developing a base model. If your model is giving wrong results, then you can reverse engineer to identify mistakes in the data.
How to solve this bias with synthetic data?
You can replicate the entire dataset and only fix problematic or incorrect columns. For instance, if only the tax amount was not mapped correctly in expenses, you can recreate the entire dataset by just fixing this column. That will be a more precise dataset for training any model.
5. Rare event bias
As the name suggests, rare event bias occurs when the model fails to handle those rare or once-in-a-million edge cases. For instance, you are building an AI-powered remediation system, and you do not have a sample dataset for those rare but possible failures.
How to detect this bias?
Work with data scientists and the business team to identify edge cases and verify if you have data for it.
How to solve this bias with synthetic data?
Once you have input from data scientists and the business team about all possible edge cases, generate synthetic data for it to train a solid model
6. Confirmation bias
Confirmation bias is another bias where models stay too close to existing data, and fail to understand that people or conditions can change. For example, we have all been there when we landed on a streaming app that has content from all over the world, but we felt like there was nothing exciting for us. It’s because their recommendation engine has a bias toward the type of content you like based on your past viewing habits, but you are no longer in the mood for that genre today. There is no room for flexibility, and the model has completely overfitted in the existing dataset.
How to detect this bias?
The simplest way to detect this bias is to check the model results. In case of overfitting, you will most likely see that the model shows high accuracy, but the results aren’t favorable. For instance, in the above streaming example, the model showed the right recommendation, but the user simply logged out of the app.
How to solve this bias with synthetic data?
Usually, in such cases, the model leaves no room for nuances by overfitting into existing data. You can add these nuances with synthetic data. For instance, in the above example, you can generate synthetic data with ideal profiles having a healthy mix of different genres. So when the model looks for profile similarity for suggestions, it can recommend something different, as well as there are other similar profiles that have watched it.
7. Temporal bias
Temporal bias happens when the data is old and does not accurately respond to the current conditions. For example, you are building an AI-powered logistics platform that uses census data as a base to understand population density. But the census you have is 12 years old and no longer accurately represents the current population.
How to detect this bias?
Always understand the source of the data and when it was generated, and verify if it remains valid in the current circumstances.
How to solve this bias with synthetic data?
Work with data scientists and business teams to project current conditions and create a synthetic dataset based on that.
Solving bias is a continuous process
Bias isn’t something you tackle once while building a system. As data is continuously changing, bias can propagate over time. It’s essential to periodically review the data and model for any biases that may affect performance. Ultimately, responsible AI development demands an ongoing commitment to mitigate bias throughout the system’s life cycle. And synthetic data is an effective way to do that.