Not Your Father’s Next Best Action! Why Untimed Single KPI NBAs Don’t Cut It

Reinforcement Learning-based Next Best Actions overcome limitations of traditional NBAs and offer better ways to personalize customer engagements.

The goalposts for personalized marketing have moved, and thus the requirements for Next Best Actions (NBAs) that engage consumers. Using legacy business rules or heuristics, NBAs fail to adapt to customers, changing product lines, and market conditions. State-of-the-art reinforcement learning artificial intelligence (AI) can produce multiple, consecutive, sequenced sets of timed NBAs that adapt to changes in offerings and markets while scaling to millions of consumers.

The NBA approach to marketing and sales is based on mapping a semi-continuous campaign to the customer journey and proposing the optimal action to take for each consumer, one decision at a time. The time when this action is applied can be set by a consumer activity (for example, at the moment of purchase or the payment of a bill online) or by the brand activity (noticing the expiration of an option).

The NBA paradigm has been touted as the ultimate one-on-one context-based marketing. Who can resist action? Not marketers. Marketers need potent actions that fit the rhythm of the business. Gone are the days of spamming when customers are just happy to get an offer from a vendor. Tailored offers, derived from highly sophisticated analysis, are the order of the day. Alas, NBAs must often conform to the dogma of top-down segments and rely on simple and easy to compute business rules. This world soon becomes a world of {income above $25k} + {cookie on the web site more than 4 months old} -> {send email}, reactive, aseptic, and doctrinaire.

So, what wrong with rule-based NBAs? It turns out many things. Let us examine their limitations and point to a better way, namely what Reinforcement Learning (RL) based NBAs can do.

Rule-based NBAs

“Learn the rules like a pro, so you can break them like an artist.” ~ Pablo Picasso

Rule-based NBAs cannot support maximum personalization. This is because of the inherent complexity of writing code or rules that adapt to the millions of patterns customers can experience. Not only are rules challenging to encode, but they also require comprehensive optimization over a vast search space. How large? Assume a customer journey of just 60 interactions (think credit card uses over say two months) and ignore the amount spent. This customer journey generates more combinations than they are atoms in the universe (and that before taking amount into accounts). Machine learning is the only way to create personalization that accounts for life events and interactions at the granular level.

To go around the search space curse, some designers architect rule-based NBAs set by demographics and a few aggregated measures from the customer journey. Such features can be RFM (recency frequency monetary) value. This simplification is a significant compromise. Analysis of time-series or time-based machine learning shows that 70% to 80% of the decision drivers for consumer actions are the timing and attributes of individual events.

Rule-based NBAs cannot adequately leverage time dimension in our decision making. Making recommendations is time-dependent. Our propensity to buy product changes over time. Our willingness to not renew contract changes with the calendar. The most crucial impact of taking time into account is that, at times, the best action is no action.

Rule-based NBAs cannot adjudicate across multiple KPIs. This limitation is a problem as no company operates on a single performance metric, Selling X and Y, cross-selling, percent of revenues from product introduced in the last year, upgrading, decrease churn, increase some KPIs while depressing others. For instance, if we want to reduce wireless churn, we could offer free iPhones to everyone. But this may dramatically increase risk and may potentially increase future service plan revenues while decreasing future hardware revenues. Who picks the winner across multiple KPIs? The VP with the loudest voice? The largest budget? The best way to approach NBA is to rely on patterns that transcend politics.

Rule-based NBAs have an atomic view of actions. That is, each best action is taken one at a time. Naive policies can be obtained based on heuristics or using algorithms that attempt to maximize the immediate reward from the customer. While simplifying the mathematics of designs, it does not reflect real life. As consumers, we make decisions based on a complete experience, not the last email we received. Quite often, this simplification is pegged on the modeling of the decision as a Markov Decision Process (MDP), where only the previous state (and thus action) matters. Un-sequenced rule-based NBAs cannot leverage the fact that baby steps are, at times, the best way to reach a bigger goal.

As market conditions changes, competitors introduce products, new variations on existing products, the next best action should change. Because of the lack of testing of the market new conditions, at the individual and the aggregate, rule-based NBAs are mostly blind. The introduction of a new product will necessarily change rules. Still, marketers need to be able to see the impact of their actions for new products ahead of launch, not six months after market introduction. That is too late.

Bring in the reinforcement. The reinforcement learning that is!

“Learn from yesterday, live for today, hope for tomorrow. The important thing is not to stop questioning.” ~ Albert Einstein

Reinforcement Learning (RL) is at the cutting edge of Machine Learning. It is used in robot designs, self-driving automobiles, video, and board games. In RL, a Reinforcement Learning Agent (RLA) interacts and takes actions in an environment (a customer, set of customers, cars with immediate surroundings, video game status, stones on a Go board) in discrete steps. The environment and RLA have both a state (think of the state as a collection of variables). A policy sets the RLA actions. Actions trigger feedback, aka rewards, from the environment. Changes in the environment and rewards are passed back into the RLA, which learns from these interactions. Because those interactions can be negative (a car crashing, a customer leaving for a competitor), RL requires sophisticated mechanisms to assess whether to explore (gather) more information that might lead RLA to make better decisions or RLA to exploit the best decision given current information.

In a Customer Experience application, a player is not playing against one player. The player is the company vs. a customer base that can number in the millions of customers. RL must be adapted to this framework. While daunting, it achieved by noting the following:

Corporations have a plethora of historical data that, if organized smartly along customer journeys, has a lot of basis for learning. Historical data has inherently exploration and exploitation information.
Causality and counterfactual analysis can be used to reduce action space more effectively than a functional approximation.
Time intervals are essential signals. So, when playing, the moment of play matters. One needs to encode time carefully.
RL is inherently a divisible and integrable framework. The mapping of RLAs to a situation is a flexible optimization mechanism.
The customer journey is a very rich state to work with, but behavioral tokens can simplify design significantly. Commitment to a brand is a perfect behavioral token.
Often, the set of actions available to the RLAs is restricted (you cannot loan more than a loan limit, a zero balance cannot be reduced, GDPR restricts the number of emails that can be sent within a time period). Corporate policies are like other action rules.

In an enterprise environment, RLA can both intervene in their own environment (one or more customers) choosing actions, receiving feedback-rewards, modifying future actions), and observe other RLAs interacting with different customers. This division of labor between RLAs can be based on customer behavioral token, based on product or service features, based on KPIs, or combination. For these reasons, Reinforcement Learning is the most appropriate way to tackle the next best actions.

Using RL, we can produce compound actions – timed and sequenced, margins included, with many more customer KPIs than ever before. Each customer, one at a time, not in super segments that lump in thousands of customers together based on typical demographics. Instead, we can get to know individual customers who are treated and valued as individuals. NBAs, which can be incredibly helpful and can move at the speed of data, thus giving customers a more targeted offer, deal, or action.

Using RL, we can take all this data and develop sequences of multiple actions. We can derive a set of compound NBAs. This marks a breakthrough in dealing with customers on a prospective basis, customer per customer while at scale, product per product, and across product lines. For these obvious reasons, we expect that more marketers will insist on using RL based NBAs.

Not Your Father’s Next Best Action! Why Untimed Single KPI NBAs Don’t Cut It

About Dr. Alain Briancon

Leave a Reply Cancel reply

About Dr. Alain Briancon

Recommended Articles

Leave a Reply Cancel reply