Towards an Automated Framework of Root Cause Analysis in the Canadian Telecom Industry

The telecommunication industry today is essential in our everyday life, and providing consistent and stable internet service to the population is the number one priority of telecom companies. That said, it is also important for companies to keep up to date with latest technologies to ensure customers are being o ffered the best services, whether it’s related to speed, reliability etc. As telecom companies introduce new services and technologies, it is important to ensure a seamless delivery and no impact to customers. When that occurs, telecom companies launch investigative processes to examine and analyze the root causes. This process is lengthy and time consuming. In this paper, I propose automating the analysis process by using association rule mining to fast track the investigative process towards identifying root causes.


Introduction and Related Work
The telecommunication network architecture consists of a network of complex interconnected components, starting with the core and travelling up to the hardware devices used to distribute network service to customers premise through their CSP (Communication Service Provider) modem for internet, set-top box for IPTV, and other.The introduction of a new technology service or product may have a direct impact to customer network experience.As telecommunication organizations become more customer centric, i.e. focusing on a customer first approach; such network changes must be closely monitored to ensure zero customer impact.It is the status quo for a network provider to carefully test in the labs any new hardware or software to be introduced to the network prior to deployment.However, on rare occasions, unsuspected issues or bugs can appear that were previously undetected in the labs.These usually translates into customers calling in to inform and seek resolution of a service disturbance.Figure 1 displays the possible call rates scenarios after introduction of new product or service.

Figure 1. Example of Current vs. New Technology Product Call Rate Performance Evaluation
It is therefore very important for the network provider to be able to quickly identify the root causes of such disturbances and take the necessary measures to ensure consistent endto-end network performance.For that to be accomplished, it is important to have full visibility on both network and customer experience metrics, as well as the capacity to correlate the metrics together in a consistent and systematic manner in order to derive actionable insights.
In this paper, I measure customer experience metric as call rates, i.e. the probability of customers to call about service disturbance.It usually takes a team of engineers, data scientists and other to gather and come up with theories corresponding to the root causes of a network issue and investigate them.The root cause investigative process is typically lengthy, and requires many iterations of manual data analysis.It is therefore costly to the organization not only in terms of manual labor time, but also to the call centers receiving a higher number of calls.The goal is to identify the characteristics or subset of customers that are contributing to this call rate delta between two products as quickly as possible (Figure 1).This is challenging for two reasons: 1) it is a lot of time-consuming manual work; 2) the highest delta could be a combination of multiple characteristics.Computing and plotting all the possible call rates is not feasible due to the huge amount of possibilities.
To overcome these challenges, I propose a machine learning approach using association rules mining to automate the analysis.The tool is currently being developed and an early version has been in use in one of Canada's largest CSPs to fast-track investigative processes.In summary, the tool allows the user to choose the 'new' and 'current' product, and the tool will compute all possible call rates permutations.The tool is intended to assist engineers and analysts in their deltas investigation, so they can narrow down their investigations to the subset of customers and call issues that are mostly contributing to the call rate delta.
At the time of writing the paper and to the extent of the author's knowledge, there is no previous machine learning and automation work done that correlates new network product launches and customer experience measured in call rates.The advantage of using calls is that it is a measure of direct impact, and specifically, an unknown or previously undetected impact.Many papers present methods to automatically detect and resolve issues as they occur in the network.However, our focus here is the quick resolution of a previously undetected issue that is directly affecting the customer.The work that is closest to this paper's work are: [1] [2].The authors address the current challenges of machine learning applications as being: 1) too many alarms: current alarms raised by the network are flooding the operators making it challenging to distinguish the ones with direct customer impact.2) Data assumptions: one of the reason many machine learning methods fail is partly due to the fact that each model has to satisfy different data assumptions.And the data, particularly time series type data, are prone to fluctuations and different distributions from one period to the other.Hence, an algorithm trained with data at time t, may not apply to the data at time t+1, t+2 etc. 3) Data changes: data format is sensitive to change as results of internal data source logic changes, vendors adjusting their log measurements, or new metrics incorporation etc. Deepak [1] [2] solution to the above is a multi-tier ensemble machine learning approach that dynamically adapts to changing network data feature set.
My proposed framework addresses these challenges differently.In regards to the first point, the framework proposed is specific to calls, meaning, any insight derived from its application is directly measuring customer impact.As for the second and third point, the proposed machine learning approach is an unsupervised Apriori association rule; the model is run daily and is dynamic, adjustable and scalable to any input data.In such case, should any input parameter change from one day to the other, the algorithm's output at the time of the change and after will adjust by default with the input data provided with no need to retrain an algorithm with the new feature set format.
Additionally, the root cause detection in Deepak [2] is pre-set.Essentially, the authors propose training several algorithms (RF, LR etc.) and extracting the features with highest importance to the model prediction as the root cause.My proposed model does not define pre-set root causes, as it is concerned with unknown potential root causes.My framework reads from a pool of different metrics and generates a list of metrics or combination of metrics that correlates the most with the observed high call rates in order to determine new paths of investigations for unknown root causes.I explain the methodology in section 2, and discuss results and conclusions in section 3.

Methodology
The data is stored in a relational database, with each record representing a customer, and the columns representing the characteristic of the customer.Table 1 provides examples of such metrics.Currently, there are around 30 metrics collected daily per customer.After reading the data, the first step is to convert the records into a list of transactions.Each row or customer characteristics is considered a transaction.In this transaction, the items are the customer associated information (call status, network components, network telemetry).The next step is to split the data into NewProduct and CurrentProduct, and then feed the list of each product transactions to the association mining rule algorithm, in the proposed methodology, I use the Apriori algorithm [3][4] [5].An association rule is of the form X=>Y, where X={x1,x2,...,xn}, and Y={y1,y2,...,ym} are two sets of mutually exclusive observations.For an association rule to be of interest, it must satisfy two interest measures: support and confidence.Support is an indication of how often an observation or a set of observations appear in the dataset and it equals P(X,Y).Confidence measures the strength of the rule and is equal to P(Y|X).A rule of the form {}=>{Y} means that the observation in Y will appear with the probability given by the rule support (which equals confidence) [5].In this paper, we are interested in particular confidence values of Y.We want to know which X sets will result in the following Y: Call Status=Yes.Because the confidence is P(Y|X), then the above Y results in call rates, which is: P(Call=Yes|X).
In summary, the algorithm will compute all the probabilities and conditional probabilities of all item sets in the list of transaction given.One of the items is 'Call=Yes' (an item present in transactions where the customer called in).We filter those from the output table and get the full list of call rates based on different customer and network characteristics, ranging from 1 to n items.In this way, we have all the call rates of NewProduct and of CurrentProduct.By concatenating both outputs, I calculate delta.Delta = CurrentProduct call rate -NewProduct call rate.I sort the deltas in an increasing order, and highlight the top deltas.These would be the characteristics or subset of customers that are mostly impacting the propensity of customers with the NewProduct to call more than the CurrentProduct.
That being said, the average delta alone might not be the best indicator of a consistent issue on the new product.Sometimes the average can be skewed due to a peak in calls for the new product on one or two days only, and is not consistent, therefore the high delta cannot be attributed to the new product performance.To overcome this, I ensure to present only 'consistent' deltas.In other words, let's say we measure the delta of specific itemsets during 10 days.That is 10 delta data points.I check to make sure the 10 delta data points are within 2 to 3 standard deviation away from the mean of the 10 delta points.If so, then the trend is consistent, and we have a daily consistent delta.If not, I mark it as 'inconsistent' and it is not displayed to the end user.

Results and Discussion
The goal of this framework is to assess the performance of a new product to ensure performance is on-par or better than current product.The KPI that indicates a direct service disturbance to the customer is call rates.In case the call rates are higher on the new product, an investigation is launched.The challenge is that the investigation is broad because of the nature of the inter-connected components in the network.In order to point towards the potential issues, there are hundreds of possible components and combination of metrics to investigate.The challenge is understanding the type and subsets of customers that are impacted, and pointing towards the metrics that are most different in customers calling for the new product, as opposed to the current.We want to understand what unknown problem was introduced, if any, and where the organization should focus its resources and investigations on to be able to get to the fastest and most effective resolution possible.
The proposed framework leverages the Apriori association rules algorithm that automatically provides the call rates for all possible metrics and metrics combination for both products (new and current).I then propose ranking all these rules by the ones contributing the most to the call rate delta between the two products, to the least contributors.This framework is currently being in use in one of Canada's largest CSPs, and being developed into a production product.It has contributed in reducing investigative time by almost half the time it usually takes.Some challenges arise from the proposed methodology that are: 1) the conclusive insights are highly dependent on the base table data.It can be the case where a contributor in the call rate delta is not fed to the algorithm, and results in missing important insights that can determine the actual root cause(s) of the call rate discrepancy.To overcome this, it is very important to keep enriching the base table with as much data and network characteristics as possible to account for a holistic view of the network and customer experience.On the other side, the advantage of using this framework is that it is dynamic and scalable; one does not need to manually change any part of the framework when more data is added to the base table; these will be handled automatically by the algorithm.2) Relying solely on the mean deltas over the period of time being analyzed might be unreliable as some peaks in the deltas could be skewing the average.For that, I incorporate a consistency metric to ensure the deltas between the two products' call rates are consistent over time.In the future, additional statistical methods can be incorporated to enhance the validity of the output displayed to the end user.
Finally, although the current paper is specific to call rates KPI, however, the framework extends to any other KPI that measures the rate difference between two products.Currently, the framework takes in daily data, concatenate them together and outputs the results.In the future, the process can be made even more efficient when data is collected more frequently, say hourly, as it allows to detect characteristics and patterns of potential issues with a new product near real-time.That allows for quicker proactive measures to be taken from the CSP side, and also falls in line with an agile work approach.

Table 1 .
Input Data Example and Description