Proven Shortcuts to Enterprise Scale Machine Learning

Although applications of natural language processing are increasing, machine learning is still the most dominant manifestation of artificial intelligence in professional use today. Traditionally, there were a number of needs associated with this technology that served as inhibitors to democratizing its deployment — especially for small and medium sized organizations. These primarily involved:

artificial intelligence

Data Scientists: At one point, it was widely believed machine learning required a slew of well-paid, difficult-to-find data scientists handcrafting machine learning models to leverage its benefits throughout the enterprise.

Training Data: Traditionally, machine learning necessitated a great deal of labeled training data for commonplace applications of supervised learning. Depending on the particular use case, such data might be scarce or non-existent.

Data Preparation: Worst of all, even if organizations somehow could find enough mythical data scientists to build machine learning models, these professionals were forced to spend the duration of their time slogging through various aspects of data preparation before performing analytics of any type.

By now, however, the majority of these perceptions no longer apply. There are a number of shortcuts organizations can leverage to overcome the demands for training data, data science, and data preparation to democratize the use of machine learning. By leveraging advances in transfer learning, ensemble modeling, and self-service data preparation, even the smallest of companies can readily access machine learning. And the business value derived from its deployment is expanding throughout the enterprise, clearly justifying its expenditures.

Transfer learning

There are several ways to overcome the obstacles to annotated training data that once prevented organizations from capitalizing on the various dimensions of supervised and unsupervised learning. The most proven of these revolves around the power of transfer learning, which reduces the amount of labeled input data necessary to accurately train machine learning models.

The effects of transfer learning on enterprise applications of machine learning are considerable. Organizations can access transfer learning models through the cloud or on-premises, and apply them to almost any machine learning use case. According to Indico CEO Tom Wilde, common unstructured data applications — such as text analytics — could require several million examples of the desired output of a machine learning model. Using transfer learning, organizations can reduce this amount to a couple hundred examples for effective models that Wilde said “are built in hours, not years.”

Another prized dimension of transfer learning is its versatility. Organizations can use the same transfer learning model — typically accessed through a vendor — to create machine learning models for any task they see fit. It not only enables them to accomplish this task with much less annotated training data than they’d otherwise need, but also allows them to use more advanced forms of machine learning, such as deep neural networks.

When working with unstructured data pertaining to text, images, or video, organizations encounter a significant amount of variability in the tasks that machine learning needs to accomplish, such as identifying all the different forms of PII in documents. For these jobs, they’re often “better off with transfer learning” and applying neural networks, Wilde said. While transfer learning reduces the amount of labeled training data (and time spent training models), neural networks abridge the time needed to solve business problems once organizations provide examples of what they’re needed to do.

Ensemble modeling

Data science is a lengthy, iterative process. Once data scientists devise machine learning models and implement them in production, they frequently go back and adjust them to improve their accuracy. After doing so, the process begins again: models are operationalized, adjusted, and then put in production again. The objective is to create the most accurate models possible. Ensemble modeling is an alternative technique to this time-consuming process that yields greater business value more quickly. Instead of attempting to create a perfect model or continually refining ones in production to get as close to perfect as possible, ensemble modeling involves combining the predictive prowess of models to get more accurate results than individual models provide.

According to Smartest Systems Principal Consultant Julian Loren, many people became aware of the efficacy of ensemble modeling during the Netflix recommendation engine challenge, in which the media provider attempted to improve the results of its recommendations. Despite the profound subject matter expertise of Netflix’s personnel, teams of data scientists handily won the challenge by creating extensive ensemble models that bested Netflix’s approach “by double digits,” Loren recalled. The challenge demonstrated that simply combining different machine learning models frequently delivers more accurate results than creating one impeccable one does — without having to finance costly subject matter experts or spend months in iterations.

Ensemble modeling is particularly effective on business problems with a range of variables or machine learning factors, such as recommendation engines. It enables organizations to incorporate what Loren termed “all different sorts of math” into a single, defined objective. Users can improve the results of ensemble models with ensemble management, which usually involves voting or putting more emphasis on the results of some models more than others in the ensemble. However, Loren noted that even if organizations give each model an equal vote, the ensemble is likely to be more accurate than individual models are. He observed that ensembling is ideal for “recommendation engines, complex analytics, or the kind of basic problem where there’s not one answer …  [For] these fuzzy problems the ensemble wins, and the bigger the ensemble, the better.”

Self-service data preparation

The ensemble model approach is much more cost-effective and efficient than continually financing dedicated subject matter experts to inform machine learning. However, it doesn’t overcome what many consider the biggest barrier to using this technology. The continual process of preparing data for analytics — including cleansing, implementing data quality, integrating, and discovering data — is by far the most time-consuming aspect of leveraging advanced analytics. However, recent advancements in self-service data preparation platforms that expedite the process of transforming raw data into datasets useful for machine learning insights have all but eliminated this issue. These platforms employ various techniques to ready data for machine learning models, including:

Machine learning: This sounds like a paradox, but credible options rely on many aspects of AI, from basic algorithmic AI to more advanced cognitive computing algorithms, to learn by example. For instance, if users specify a certain way to join datasets, when attempting to do so again the system will offer them the previous join to approve or decline. The same recommendations apply to downstream processes such as transformation. In other applications, these tools “utilize machine learning for identifying the structure of data,” according to Piet Loubser, Alation VP strategic marketing. Doing so is essential for working with semi-structured and unstructured data alongside structured data.

Business friendly user experiences: While traditional data science tools for preparing machine learning models are not meant for the layman, contemporary data preparation tools are designed for business end users. “It doesn’t help us to create yet another IT tool and program it in SQL or Python or Spark or whatever,” Loubser said. “We just don’t have enough of those skills to go around.” Instead, using approaches centered on drag-and-drop and point-and-click methodologies, self-service data prep solutions are “visual, interactive, and designed for the business,” Loubser said, enabling these users to manipulate their own data for advanced analytics.     

Collaborative features: Without the ability to share the results of previous work with other users and effectively reuse that foundation to spur further data preparation, end users would have to constantly redo the work that others had already done to ready data for machine learning models. With collaborative features, users can further reduce time spent preparing data and devote more time to actually analyzing and acting on that data. “In most of these organizations, it’s not just you,” Loubser said. “You work on something; you share a result with another person who builds that into the next process. Then six weeks, 10 weeks, [or] 12 weeks down the line we’ve got this whole ecosystem of self-service data access that’s been created.” The data lineage and governance capabilities of such platforms are critical for leveraging these collaborative approaches across organizations, so users can see what’s been done to datasets and how.

Skipping the line

These approaches to simplifying machine learning requirements allow users to access this technology much quicker and painlessly than they could without them. Self-service data preparation platforms reduce time spent wrangling data for machine learning models. Transfer learning decreases the time spent training those models, while ensemble modeling diminishes the time and subject matter expertise required to maximize machine learning accuracy. Each of these benefits also directly relates to reduced costs for using machine learning, making it much more viable to businesses than it once was. 

+ posts
Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance, and analytics.