Is your company having trouble seeing through the "data smog"? That's what some have come to call the massive amount of highly relevant but disconnected and disorganized data confronting them as they try to perform Web analytics. An enterprise may have in its possession all the data it needs to make the right offer to the right customer at the right time. But the company stumbles as it tries to pull data together and mine it to discover how to make targeted offers to Web site visitors--all in the few milliseconds it has to remain competitive.
Decision trees can help you see through the data haze and improve the relevance and accuracy of Web analytics. With a long history in data mining and machine learning, decision trees aren't new. But once you see how they've been developed and deployed, you can understand their use for Web analytics.
Businesses commonly look at a variety of data types to execute real-time customer acquisition and retention on their Web sites. Sources and types include Internet click streams, marketing campaigns, affiliate information, demographic sources, transaction data, call-center data and other information about customers and their lifestyles. An array of departments and internal and external organizations often own the data, making it hard to combine the different types. Working with each source leads to its own management and access overload; integrating the data types can be flat out overwhelming, especially in terms of expense. And you can't be sure that in the end you will have solved the problem of real-time targeting.
But the obstacles can't stop companies, which know that by combining diverse data sources, they'll arrive at a critical understanding of "who" and "what" drives online and offline sales and revenue. A Web site is a looking glass through which an organization can see the desires and needs of its customers and partners, especially when online behavioral data can be integrated with lifestyle, demographic and transactional data.
Web analytics should be the means by which a company can see through the "smog" to understand their core customers. In so doing, the company can develop consumer profiles that help align sales and marketing with specific products and services.
A decision tree is, to quote Wikipedia, "an idea-generation tool that generally refers to a graph or model of decisions and their possible consequences." Firms in financial services, insurance, telecommunications and other industries have employed decision trees to segment large databases so they can determine customer lifetime value, detect fraud and engage in other forms of business intelligence and analysis. Often associated with machine learning, decision trees are typically built for perform segmentation and classification purposes. Decision trees are excellent for modeling consumer behavior because they let you see what combinations of data attributes can best predict who will buy what.
Decision trees let you measure the "information gain" in the classification of a dependent variable, such as "will buy" versus "will not buy," through comparison of hundreds of independent variables, such as age, income, number of Web site visits and total sales. You can automate data segmentation by using machine-learning algorithms, which work with decision trees to split and test for independent variables that increase the information gain to be found between "will buy" and "will not buy" customers.
Automated decision trees perform classification through recursive partitioning. That is, they split and measure the amount of information a single data variable provides for determining a dependent variable. Let's say you are trying to classify fruit, and you have three data attributes: weight, shape and color. A decision tree quickly eliminates weight, which provides little information because the three fruits weigh about the same. Shape offers a higher information gain; "round" segregates bananas from apples and oranges. However, the decision tree would rank color as the variable offering the most information gain because with this attribute you can discriminate easily between oranges, apples and bananas.
The same process works for determining which variables are most valuable in forecasting who will buy or not buy. Decision trees also are helpful in segmenting one-time buyers from repeat buyers, multiple-product buyers from single-item buyers, and so on. In addition, decision trees can be constructed to predict the success of cross sales, as well as the lifetime value of online visitors and shoppers.
Classification and modeling through decision trees is a two-step process. The first step is the "learning," or training of the tree. The second step is to create and deploy conditional, if/then rules based on the decision tree. In the learning phase, the decision tree is exposed to samples of historical data that represent of buyers and non-buyers, single-product and multiple-product buyers, and so on. Once trained and tested, the decision tree can elicit if/then rules for classifying future shoppers into the classes that the tree was created to detect and classify. The system can then employ the rules for segmenting real-time data streams coming from the enterprise Web site.
Decision trees let Web analysts prioritize the most important data types and attributes for classifying profitable visitors. You also can use them to find subtle ranges hidden in the data: in a continuous value, for example, such as ages between 18 and 23 or incomes between $75,000 and $100,000. Decision trees arrive at these valuable nuggets of knowledge through a process of splitting and testing for information gain.
Through trees, you can prioritize which data attributes are most important in identifying profitable visitors, or which online shoppers are most likely to make a purchase. Here is an example of the logic:
IF number of website visits > 2
AND online purchases made = 0
AND ZIP 94502 = (89%-96%) Affluent Income Consumer Group 1
AND Customer Life Value > (16 months - 31 months)
THEN Will Buy 78%
An enterprise might want to construct a decision tree based on multiple factors from different data types, including demographics, click stream, transactions and lifestyle attributes. The objective would be to develop robust if/ then rules for segmenting online visitors in real time, thereby enabling the enterprise to better predict online behavior and buying patterns. However, this is where we confront the problem of different data types coming from different sources. Integrating them is necessary to develop predictive rules for classifying online visitors. Typical attributes and sources match up like this:
Number of site visits Internet click stream
Online purchases Transactions
ZIP code Lifestyle data
Life value Customer data
Most modern enterprises possess extensive data assets generated by customer Web site activity. Unfortunately, the assets are often scattered across server farms owned by different departments in different locations, each of which has its own agenda and objectives. It takes a uniform Web analytics strategy to properly leverage data assets and types.
Rather than pull and combine the click stream, transactional and other data types into a single data warehouse, some organizations use software agents managed across the network to segment the data types more cost-effectively. Personalization and online cross- and up-selling then takes place across a distributed streaming architecture.
Web analytics can then happen without the need to share and replicate data though a centralized data warehouse. In other words, the distributed approach helps solve the problem of finding needles in an assortment of moving and changing haystacks. A distributed architecture also can ensure privacy and security at a reduced IT cost.
Where can we find an example of a working distributed data mining network? The U.S. Department of Defense. As "<>Global Mining," left, depicts, the DoD's challenge was to analyze multiple data sensor silos in real time to guide a missile while it was in flight.
To solve this problem, DoD needed to mine sensor data distributed across the globe within seconds to recognize the difference between, for example, a tank and a school bus, which have similar density. This meant identifying and discriminating between targets through analysis of sensor data coming from ships, satellites and the ground. The DoD addressed the problem with a combination of machine-learning algorithms for creating multiple decision trees and networking them together to develop a global model for targeting in real time.
With a distributed modeling architecture, the DoD was able to analyze and share data for making decisions rapidly. The best way to visualize how it works is to imagine the networking of a group of decision trees into a distributed data "forest," (see "Segmented Decision Trees" right). The segmentation process happens when moving assets are targeted. Taking away the parts of the tree in the middle of the figure for a moment, you can see two decision trees from different locations: one is analyzing sensor data from a battleship, while the other is examining sensor data transmitted from a satellite.
The result of the two decision trees is that the data is segmented along two separate splits; the two resulting tree branches can be seen in the middle of "Segmented Decision Trees." A mediating software agent would then take these results and, over the network, consolidate the findings at a central location per the global model. The two separate decision trees are networked, enabling the segmentation of integrated sensor data coming from the battleship and the satellite.
By networking decision trees in this fashion, organizations can better develop global models and gain insight from different data types from different locations and owners. Software agents are sent to do the segmentation and then, through the network, combine the results of if/then rules at a central location.
The method used by the DoD to analyze sensor data from around the world for missile defense could also apply to business requirements for targeting and segmenting online visitors and customers. Similar network forests may be created using software agents to communicate and coordinate data splits over networks. A central mediator can manage the matching and evaluation of decision tree branches drawing from multiple, disparate data sources. Weak branches are displaced as the evolving segmentation process works on the stream of incoming data. The mediating agent can assemble the prevailing branches to create a global network.
The DoD's networked decision tree design can work for Web analytics and the segmentation of online visitors in real time based on data streams. Time-critical if/then rules can be developed as streams of interactions are taking place at a company's Web site. And decision trees could draw from all the other data sources the enterprise has at its disposal.
The military's need for distributed knowledge discovery for command and control is not unlike the requirements that companies have with dynamic Web sites. To take the lead, business enterprises must be almost as responsive to customers with personalized offers as the military must be to target missiles moving at supersonic speeds.
Consider a Web site selling air purifiers and air conditioners. This company found that its online shoppers generated remarkable lifestyle information for each specific product line. First-time buyers, identified by their IP addresses combined with number of visits (click stream data) and ZIP code (lifestyle), purchased mostly window air conditioners. Other customers with totally different data features purchased different types of air conditioners; one class went for units with remote controls, while another went for air purifiers.
Armed with this knowledge--based on multiple data sources and expressed in hundreds of if/then rules--the company can make segmented offers of all their products based on prearranged profiles that match visitors to specific features. For example, it might offer window units to new shoppers with profiles that matched the click stream, demographics, transactional behavior and other information known historically about buyers of those products. Other visitors might see different products matching other attributes and behavior.
As many others do today, such a Web site could prompt visitors for their ZIP code, which would be used not only to direct them to a local store or dealer, but to segment the visitor through lifestyle demographics. A targeted offer could come his way within milliseconds. The company performs real-time customer and product segmentation at its Web site; Web pages are customized on the fly for cross-selling and up-selling.
A knowledge discovery process of this sort often reveals previously unknown key attributes. Customer features extracted from Web site interactions can change future marketing efforts and allocation of budget. The air conditioning firm, for example, might discover that the bulk of its consumers were high-rise renters who did not own cars, and instead took public transportation. Thus, if it were spending a high percentage of its marketing budget on radio ads during commute time, knowledge discovery should lead to a change in strategy.
Finally, even as the changes occur at the point of customer contact, the networked decision tree design does not have to move; instead, only encrypted pointers to the data need to be sent over the networks in the form of binary streams. Data can stay with its owners, where they can manage customer privacy and data security. Centralized data warehouses have become an object of great concern due to the potential of data misuse. In this Web analytics paradigm, only the knowledge moves.
Decision trees are changing what organizations can do--and must do--with Web analytics. Drawing from distributed and diverse data, they can grow in sophistication as the company learns more about its customers and their behavior.
Jesus Mena is chief strategy officer at InferX, and is the author of several books on data mining and web analytics. He can be reached at [email protected].
Jerzy Bala, CTO of InferX, holds several patents in machine learning and has authored many works on the subject. He can be reached at [email protected].