Survival Data Mining for Customer Insight
Data mining techniques that have proven their worth in smaller applications are now crossing over into mainstream business computing. A practical approach will help you better understand customer behavior and reduce churn.
When I'm trying to understand a company's customers by using data collected in its databases, my first inclination is to apply survival data mining. Over the years, I've found that this approach provides rapid feedback about the customers and their behaviors, while at the same time providing a solid basis for quantifying customer value and measuring customer loyalty. This is customer insight in practice.
What is survival data mining? It's the application of survival analysis a traditional statistical technique to data mining problems concerning customers. The application to the business world changes the flavor of such statistical techniques, which were honed on the analysis of small numbers of patients in medical studies. Extracting the last iota of information from a handful of customers is no longer the primary concern. The key issue is how to make sense of millions or tens of millions of database records describing current and past customers and their business interactions.
This article presents survival data mining in practice. It starts with a methodology for subscription-based businesses and introduces hazards and survival curves for understanding churn. It then explains how you can quantify results, and then how you can apply the same techniques to general time-to-event problems in business. A technical sidebar ("Calculating Hazards in a Database," at end of article) shows how to do some of the calculations in a relational database.
Hazard Probability
In the medical world, doctors often want to understand which treatments help patients survive longer and which have no effect at all (or worse). In the business world, the equivalent concern is when customers stop being customers. This is particularly true of businesses that have a well-defined beginning and end to the customer relationship. A good example is a subscription-based relationship, which may be found in a wide range of industries including insurance, communication, cable television, newspaper and magazine publishing, banking, and newly competitive utility markets.
The basis of survival data mining is hazard probability: that is, the chance that someone who has survived for a certain length of time (called customer "tenure") is going to stop, cancel, or expire before the next unit of time. This definition assumes that time is discrete, and such discrete time intervals days, weeks, or months fit business needs. By contrast, traditional survival analysis in statistics usually assumes that time is continuous.
Given the right data, calculating the hazard probability for a given tenure t is simple. The probability is the number who succumbed to the risk divided by the population at risk during that tenure. That is, the numerator is the number of customers who stopped with exactly tenure t and the denominator is everyone who had tenures greater than or equal to t. Customers with shorter tenures aren't part of the risk group. The sidebar explains how to calculate hazards directly using a relational database.
A picture paints a thousand words. Figure 1 charts hazard probabilities for customers in a typical subscription business. The horizontal axis is the tenure of customers measured in days; the vertical axis is the probability that customers stop at a particular tenure point.

FIGURE 1 Hazard probabilities for customers in a typical subscription business.
The hazard chart in Figure 1 is an X-ray into the customer life cycle because it highlights different important events. The first hazard probability at time zero is about 4 percent; this bump is due to customers not starting and is often caused by poor customer information being gathered at the point of sale or perhaps by buyer's remorse. At around 60 days, there's a very strong peak in the hazard probability. This peak corresponds to those customers who start but never pay. The company moves customers through various dunning levels to inspire payment. However, at some point, the company must force churn because of nonpayment. Changes in this policy, such as a reduction in the period of time for cutting off nonpaying customers, would be apparent in the hazard probabilities.
At around 90 days, we see another significant spike in the hazards. This spike actually has nothing to do with nonpayment. It's due to the end of the initial promotion. Customers who signed up for this service because the initial offer was cheap often stop when they have to start paying full price. Happily, the customers who stop at this point have at least been paying their bills.
After these two initial peaks, the hazard probability gradually declines, but with a jagged characteristic. The jaggedness is actually due to the one-month billing cycle that most customers are on. Customers are more likely to stop at the end of a billing cycle. One reason is that when customers call in to stop, the stop date is set to the end of the billing cycle unless the customer requests a specific date.
We welcome your comments on this topic on our social media channels, or
[contact us directly] with questions about the site.

1 of 4

More Insights