Complex network topologies have received attention from a wide variety of fields in recent years [Newman (2003), Albert et al. (2002), Dorogovtsev & Mendes (2002)]. For example, the cell is now well described as a network of chemicals connected by chemical reactions; the Internet is a network of routers and computers linked by many physical or wireless links; culture and ideas spread on social networks, whose nodes are human beings and whose edges represent various social relationships; the World Wide Web is an enormous network of Web pages connected by hyperlinks.
Many new concepts and measures have been recently proposed and investigated to characterize such systems. We define and briefly discuss several of the most important concepts:
Degree distribution. Nodes in a network typically do not all have the same number of links, or degree. This variation can be characterized by a distribution function , which gives the probability that a randomly selected node has exactly links. Since in a random graph the links are placed randomly, the majority of nodes have approximately the same degree, close to the average degree of the network. The degree distribution of a random graph is a Poisson distribution with a peak at . However, recent empirical results show that the degree distributions of most large networks are quite different from a Poisson distribution. In particular, for a large number of networks, including the World Wide Web [Albert et al. (1999)], the internet [Faloutsos et al. (1999)], and metabolic networks [Jeong et al. (2000)], the degree distribution has a power law tail. Investigation of degree distribution can help characterize network topology. The call log data is more complex than has been analyzed before, as the academic literature assumes a static network. Degree distribution remains a useful concept to consider, however.
Small Worlds. The small world concept describes the fact that, in most networks, there is a relatively short path between any two nodes, even if the number of nodes is large. The distance between two nodes is defined as the number of edges along the shortest path connecting them. The best known example of small worlds is the "six degrees of separation" found by the social psychologist Stanley Milgram, who showed that there is an average number of six acquaintances between most pairs of people in the United States [Milgram (1967)]. The small world property can be observed in most complex networks: the actors in Hollywood are on average within three costars from each other, or the chemicals in a cell are typically separated by three reactions. The small world concept, however, is not an indication of any organizing principle. Erdos and Renyi demonstrated that the typical distance between any two nodes in a random graph scales as the logarithm of the number of nodes . Thus, even random graphs are "small worlds." Relevant information on customer topology will most likely be limited to their immediate connections or to the structure and behavior of "friends of friends."
Clustering. A common property of social networks are cliques, circles of friends or acquaintances in which every member knows every other member. This inherent tendency to cluster is quantified by the clustering coefficient [Watts and Strogatz (1998)]. Consider a selected node i in a network, having edges connected to other nodes. If the first neighbors of the original node were all connected, there would be ( - 1)/2 edges between them. The ratio between the number of edges that actually exist between these nodes, , and the maximum number, ( -1)/2, gives the value C of the clustering coefficient of node i: . A network's clustering coefficient is the average clustering coefficient of its nodes. In a random graph, since the edges are distributed randomly, the clustering coefficient is , where is the probability of a link existing between any pair of nodes. However, Watts and Strogatz pointed out that in most real networks the clustering coefficient is typically much larger than it is in a random network of equal number of nodes and edges [Watts & Strogatz (1998)]. Using concepts like the clustering coefficient will be key to differentiating local topologies found within the call log.
Computational epidemiology is the study of modeling disease propagation over a human host population. However, there are many other contagions that can diffuse through a human host population beyond disease: information, rumor, opinion... In order to understand and control the spread of contagions, it is essential to establish some of the key parameters associated with transmission. Determining the basic reproductive ratio (R0) of a disease, for example, is the primary objective for many epidemiological studies. R0 defines the number of secondary cases produced by an infected individual in an entirely susceptible population. Ideally, health policies (or marketing mangers!) can attempt to change the parameters involved in its formulation in order to control a pathogen's spread. Unfortunately, R0 is notoriously difficult to measure, and must be derived indirectly. Mathematical models have played an important role in assessing and understanding the dynamics of disease transmission in human populations. For example, the proportion of individuals requiring vaccination for the eradication of a disease may be formulated using R0.
The majority of epidemiological models are based on a compartmental framework; the host population is partitioned into those that are susceptible, infected, or immune/recovered (SIR) to a particular pathogen. These deterministic models assume that the rate at which new infections are acquired is proportional to the number of encounters between susceptible and infected individuals, and leads to an effective reproductive ratio that is dependent on a threshold density of susceptibles. Thus, is dependent not only on parameters intrinsic to the disease such as latent and infectious periods, but also on contacts between infectious and susceptible hosts. Compartmental models of this kind implicitly assume that the host population is well mixed, such that the probability of infection is equal for all.
However, as our call logs already indicate, social network structures are clearly not always well mixed, and the complexities of host interactions may have profound implications for the interpretation of epidemiological models. Standard mean-field models do not account for heterogeneities of risk between individuals due to the finite number, variability, and clustering of social contacts. Studies have shown that network structure can significantly affect the processes occurring on social networks, including the dynamics and evolution of infectious diseases but also the diffusion of information and opinion.
These models have used hypothetical, extreme network structures as caricatures of real host contact networks. During the recent outbreak of SARS in Singapore, contact tracing of infectious individuals showed a 'superspreader' pattern of disease transmission. Furthermore, many directly transmitted diseases are distributed globally, in different types of society, with different characteristic interactions between people. We will be repurposing these existing stochastic models of pathogens dissemination over a dynamic human network to model product adoption and churn over social networks.
The infection network of SARS (left) and call log network structure (right). Human networks always include some nodes with a disproportionate number of links to other people. These hubs, or ‘super-spreaders' in the field of epidemiology, play a critical role at dissemination contagions, information, or opinions across the network.
During the past 20 years, and over the past 10 years especially, the Statistics and Computer Science communities have developed an array of generally effective classification and prediction methodologies (see, for example, Hastie et. al.). Classification can be thought of as a two-step process: the first is to identify the classes we wish to assign nodes to, and the second is to infer a class for each node. Our strategy to identify the different "latent classes" or "profiles" of telecommunication customers will start with applying clustering algorithms using demographics data and descriptions of local network topology. We will then work with the marketing department of our telecommunications partner to refine the results into profiles they believe will be useful in their marketing efforts. We will also cluster for the sake of prediction: in other words, for those quantities and results we wish to predict, we will create classes which best explain the variance in outcome. With these classes thus defined, each node can be assigned either to the nearest/best class, or – in an effort to express how well a node fits into these classes – a mixture model can be created, and individual node classes defined as a mixture of the defined classes.
With classes formally defined and assigned, we can use these results, along with raw observations of use, to predict future behavior. The effectiveness of these predictions can be evaluated using a random subsample of the customer event database. "Off-the-shelf" prediction algorithms shown to be generally effective include nearest neighbor inference, boosted decision trees, kernel regression methods and support vector machines. By using statistical descriptions of local network topology as the inputs to these algorithms, we believe the ability of telecommunications companies to actually predict their customers' behavior will be greatly enhanced.