Building An Analytical Data Cube
Another important point when building analytical data cubes is that the more indexed fields you utilize in your cube, the faster you can drill into your cube. When I am building a cube for analysis I will normal draw the data from a view I have created in SQL and apply indexing to those fields which might not be indexed in the hosting database.
The following fields are contained in my database that have statistical value:
- Purchase Type: a repeating field indicating what type of purchase was made
- Platform: a repeating field indicating which eCommerce platform a purchase was associated with
- Card Type: A repeating field indicating what type of card was used (AMEX,MC,VISA)
- Transaction Amount: While this field is going to contain lots of unique amounts, for statistical purposes I can group those values in my cube such as between $50-$100
- Transaction Date
- Bank Identifier: The code on credit cards and routing numbers for direct debit, which identify the issuing financial institution
- Fraud Score: The score assigned to the transaction by my risk model
- IP City, State and Country: For determining the source of eCommerce transactions
- Transaction State: A field indicating the state of the transaction such as "chargeback" or "fraud" or "dispute"
Once I have selected the fields I am going to utilize I build my cube utilizing analytical services in SQL Server. We can verify that the cube was built successfully by opening Analytical Services on the server containing your database. From the example below you can see that I have built several cubes around my eCommerce data.
Another important note about building analytical cubes is that the fields you select must have relationships to one an another or you will loose data in your cube. For example if I build a cube with "transaction date" and "IP country", every field in "IP Country" has to be populated, if I have any nulls in that column I will loose the entire line in my data and it will not be included in the aggregate set. For that reason it is important to plan which columns to use in your data model based on the statical analysis you are going to perform. If the IP Country is important, loosing a thousand transactions with no IP Country captured might not be a big deal, however if it is important in your statical analysis that all transactions be captured, you can only build your cube with columns that every field in each column has a relation to the other with no nulls in your data.
Drilling Into Your DataLet's being drilling down into my transactions to find specific fraud issues to visualize through link analysis. I am starting with all of my transactions in my cube:
From here I am going to drill down into transaction date to determine which months had the highest amount of chargebacks. When you build a data cube in any program, SQL analytical services takes date columns and breaks them out so you can drill down by Year, Month, Day or Day of the Week, Week Day or Week Number automatically:
From this first drill down I can see that I had a huge spike in chargeback activity in April that I probably want to take a look at. For my next drill down I am going to look at all chargebacks for April 2009 by Country:
From this visualization you can see that there are allot of countries in my data. As I am looking for the main countries responsible for my chargebacks I am going to filter the results by look at the top 10 countries:
Now that I have filtered my results, its easy to see which countries I need to focus on. I am going to take the largest chargeback origination country, U.S., and drill down to get specific details on where these chargebacks are coming from. My next drill down is going to be by State:
From here I am going to sort the States from highest to lowest to determine which state had the highest amount of chargeback activity.
I can see that California had the highest number of chargebacks in April 2009. It is important to compare the total number of chargebacks by state to the total number of overall transactions. One of the reasons that California might have the highest number of chargebacks is because they also had the highest number of overall transactions so I want to establish a ratio of chargebacks to transactions to confirm any issues that might exist.
For this example we are going to assume that the ratio of chargebacks to transactions in California is out of wack and verifies that there is a problem. Now I am going to back up to my original cube with all transactions and begin drilling down into the data with the goal of isolating California's problem. In Data Miner this is very easy as the program keeps a history of each drill down, so all I have to do is click on the original cube in the history to return to it. For those using Visual Studio, this will involve beginning a new query altogether.
I return to my originial cube and drill down by Country and then by State selecting on California. Now I am going to drill down on California by Month. This was not possible from the point I was at in my last series of drill downs as I was only visualizing California for April 2009. If I was to view chargeback activity for California across all months I need to follow the route I am at now.
Now I can see that Los Angeles had the highest concentration of chargebacks for April 2009. The next few steps drill down to isolate the issues that exist in Los Angeles that may be leading to my fraud problem there:
Visualizing The Statistical Analysis Set:
Now that I have pulled all of the records into Analyst Notebook for visualization, I can expand on these records to include all fields in my database, including those I did not use in my data cube. This will allow me to perform tactical analysis on the fraud chargebacks to determine the source of my fraud problem:
Now that I have expanded on all of my fraud chargeback entities, I have brought in the card information, the card holder information, the IP addresses and the credit card transaction history for each of the individuals.
I am going to focus in on the largest cluster of interrelated activity in my visualization for analysis. By placing the largest cluster into a new chart for analysis and visualization and looking at the transactions in a hirachy view I can see that numerous chargebacks are all associated with a store where the cards were last used. Additionally all of the cards were swipped by the same associate at the store within a three day period of my chargebacks occurring.
There is no pattern in the IP addresses or in the individual card holders. The only central association is from the charge history of the individuals which points to an issue with a specific merchant and a specific operator at the store which is most likely the source of my fraud chargebacks.
In Conclusion
In this example we have sucessfully leveraged statistical analysis to identify a specific set of issues within a large data source for visual analysis. Performing link analysis on the entire group of 220,000 records would have been impossible but by drilling down into this mass of data to discover anomolies in activity, we have sucessfully identified a fraud issue across years of chargeback data.
Link and association analyis in a visual form is a powerful tool, however, by integrating visual analysis with other forms of data mining we are able to perform analysis across a much larger set of data then would be possible by query and visualization alone.