How Brands Can Spot Bad Data Before It Ruins Their AI Project

According to a June 2020 IDC Survey of IT and business decision-makers, AI adoption and spending were reported to be on the rise globally. Despite this reported rise, the survey also noted key challenges in AI implementation, with roughly 28% of AI and machine learning initiatives failing in some way. One of the primary reasons cited in project failure was the “lack of production-ready data.” As many global businesses face unprecedented financial pressure brought on by the COVID-19 recession, this pressure can also be felt on AI projects. While maintaining speed and cost efficiency are desirable standards for many AI projects, the achievement of these standards can be overshadowed by incomplete or problematic data — or in other words, “bad data.”

What does bad data look like?

The data sets we define as “bad” tend to warrant this label for at least one of the following reasons.

1.     Problem with data size

As a general rule, the larger the data set, the better the AI project will likely be. Having a smaller data set will likely pose some limitations to the success of a project. Even though AI platforms may be able to perform necessary functions with a small data set, the accuracy and statistical significance of their results will be reduced from the beginning. A notably large data set, on the other hand, will likely increase accuracy and statistical significance by default. The potential issue with larger data sets lies less in the data itself and more in the requisite management by human analysts and consultants. This is where the frequently used industry expression, “big data, big mess,” comes into play.

2.     Problem with the data source

The source(s) for the data also represents a very important consideration in AI models. The source needs to be valid for the specific business problem the company is trying to solve. This means that the source and the data it produces should be as accurate and unbiased as possible. The source also needs to be active/available during critical time periods of AI production and deployment. If the data source is not available at the time of prediction for models (a common problem), the project is doomed to fail; therefore, it is essential for project teams to carefully assess not only the data sources needed to train the model, but also the data sources needed for the model to make predictions once fully operational, thus ensuring a successful AI solution.

3.     Problem with environmental context for the data

Environmental context always serves an important role in the creation, deployment, and ongoing management of AI solutions. However, during periods of major change and in scenarios where historical patterns no longer happen in the present, environmental context can come directly to the forefront of an AI project, threatening to dismantle the data findings entirely. Gary Marcus, cognitive scientist and professor at NYU, recently commented on this phenomenon. “Top algorithms are left flat-footed when data they’ve trained on no longer represents the world we live in,” he said. Furthermore, continuous change and volatility can create even more confusion for both the AI solution and humans monitoring the data.

Why are certain AI projects more vulnerable? 

Data issues like the three types identified above tend to happen, at least partially, because of issues involving project teams and corporate management. At the project team level, this often means not having enough staff with the appropriate technical expertise, but it can also involve issues with team collaboration or conflict between team members. At the management level, not having the right level of interest, financial backing, or even having inappropriate, micromanaging tendencies from senior leadership can affect the formulation of data.

Projects can also become vulnerable because of temporary inattentiveness to the data. Machine learning models are meant to respond to changes in their operating environment, but they also need to be thoroughly monitored. As enticing as AI and automation may seem on paper as a replacement for certain human activity, in practice it is still necessary for humans to keep consistent involvement in their AI projects – even long after they have been deployed to the public.

Maintaining a workflow balance through automation

Automation technology can be used not only for the specific AI/machine learning solution going to market, but also as a means of managing the project itself and for monitoring the productivity and contributions of team members.  When looking for potential automation solutions, IT decision-makers should consider how well the product integrates into team members’ existing or planned workflow, and what it could do to help launch or monitor the AI project. Platforms that can reduce the time, cost and any friction occurring on a project (like the data issues discussed above) can be desirable not only for the IT teams, but also for business leaders and investors looking for a clear Return on AI Investment (RoAI).

Website | + posts

Pedro Alves is the founder and CEO of Ople.AI, a software startup that provides an Automated Machine Learning platform to empower business users with predictive analytics.