Welcome


Welcome to Understanding Link Analysis. The purpose of my site is to discuss the methods behind leveraging visual analytics to discover answers and patterns buried within data sets.

Visual analytics provides a proactive response to threats and risks by holistically examining information. As opposed to traditional data mining, by visualizing information, patterns of activity that run contrary to normal activity surface within very few occurances.

We can dive into thousands of insurance fraud claims to discover clusters of interrelated parties involved in a staged accident ring.

We can examine months of burglary reports to find a pattern leading back to a suspect.

With the new generation of visualization software our team is developing, we can dive into massive data sets and visually find new trends, patterns and threats that would take hours or days using conventional data mining.

The eye processes information much more rapidly when information is presented as images, this has been true since children started learning to read. As our instinct develops over time so does our ability to process complex concepts through visual identification. This is the power of visual analysis that I focus on in my site.

All information and data used in articles on this site is randomly generated with no relation to actual individuals or companies.

Integrating Statistical Analysis With Link Analysis

As discussed in previous posts, visual link analysis is a powerful tool for determining and confirming relationships between entities across a large span of data however, it's limitation is that the machine and the user can only take in so many entities and links in one workspace before the quality of the analysis begins to suffer.

The ability to drill down into data to extract sets of data to visual analyze is extremely important. One way we discussed doing this was through data queries and sets, the second method I will discuss in this article, is through an integration of statistical analysis and visual analysis.

For the purpose of this article I am going to be using i2's Analyst Workstation which integrates Data Miner, a statistical data analysis tool, with Analyst Notebook for visualization. This can also be accomplished through a variety of desktop tools depending on the size of the data such as Access, Excel, Crystal Reports and Visual Studio. The benefit to using the i2 tool is the ability to switch from statistical analysis to visual analysis without any additional import steps.



Building An Analytical Data Cube

This step is specific to tools that leverage analytical services in SQL, if you are using excel or access you can skip this step. A data cube is a subset of all data contained in a database which leverages fields with statistical value.

In statistical analysis you are only using the fields in your database that have statistical relevance, those which are repeating and not unique. For example in the database I am going to analyze in Data Miner, I have fields such as "Credit Card Holders Name" which have no statistical value, and "Chargeback Date" which do. My first step is to review my data to determine which fields I want to use in building my data cube. The better I define these fields, the faster I am going to be able drill down into the data as my cube is only going to hold relevant statistical data. When I import the data into my visualization tool, I can always integrate in all the other attributes in my data set to form relationships.

Another important point when building analytical data cubes is that the more indexed fields you utilize in your cube, the faster you can drill into your cube. When I am building a cube for analysis I will normal draw the data from a view I have created in SQL and apply indexing to those fields which might not be indexed in the hosting database.

For the purpose of this example I am going to build an analytical cube around eCommerce transaction payments to detect patterns in fraudulent electronic payments.

The following fields are contained in my database that have statistical value:

  • Purchase Type: a repeating field indicating what type of purchase was made

  • Platform: a repeating field indicating which eCommerce platform a purchase was associated with

  • Card Type: A repeating field indicating what type of card was used (AMEX,MC,VISA)

  • Transaction Amount: While this field is going to contain lots of unique amounts, for statistical purposes I can group those values in my cube such as between $50-$100

  • Transaction Date

  • Bank Identifier: The code on credit cards and routing numbers for direct debit, which identify the issuing financial institution

  • Fraud Score: The score assigned to the transaction by my risk model

  • IP City, State and Country: For determining the source of eCommerce transactions

  • Transaction State: A field indicating the state of the transaction such as "chargeback" or "fraud" or "dispute"

The fields I am leaving out are those with unique identifiers and no statistical relevance such as "name, street address or phone number". It is also import to note that your analytical cube can use any combination of fields with statistical relevance based on the type of analysis you are going to perform. For example, if I were only going to look at fraud charges by Country I could leave out fields such as "fraud score" or "transaction date" because they do not apply to the scenario I am examining and will slow down drilling into the data.

One of the important things to remember about building analytical cubes is that opposed to databases which capture fields for ever possible scenario in your environment, data cubes are customized for the exact type of analysis you are going to perform and their performance and accuracy depend on you only utilizing those fields in your data with relevance to your analysis.

Once I have selected the fields I am going to utilize I build my cube utilizing analytical services in SQL Server. We can verify that the cube was built successfully by opening Analytical Services on the server containing your database. From the example below you can see that I have built several cubes around my eCommerce data.










Another important note about building analytical cubes is that the fields you select must have relationships to one an another or you will loose data in your cube. For example if I build a cube with "transaction date" and "IP country", every field in "IP Country" has to be populated, if I have any nulls in that column I will loose the entire line in my data and it will not be included in the aggregate set. For that reason it is important to plan which columns to use in your data model based on the statical analysis you are going to perform. If the IP Country is important, loosing a thousand transactions with no IP Country captured might not be a big deal, however if it is important in your statical analysis that all transactions be captured, you can only build your cube with columns that every field in each column has a relation to the other with no nulls in your data.

Drilling Into Your Data

Now that I have built my data cube I am going to begin my statistical analysis. Because I am using Data Miner, this is can also be visual making my statistical analysis much faster to understand then looking at aggregate text fields in a spreadsheet format.

Let's being drilling down into my transactions to find specific fraud issues to visualize through link analysis. I am starting with all of my transactions in my cube:










From here I am going to drill down into transaction date to determine which months had the highest amount of chargebacks. When you build a data cube in any program, SQL analytical services takes date columns and breaks them out so you can drill down by Year, Month, Day or Day of the Week, Week Day or Week Number automatically:












From this first drill down I can see that I had a huge spike in chargeback activity in April that I probably want to take a look at. For my next drill down I am going to look at all chargebacks for April 2009 by Country:













From this visualization you can see that there are allot of countries in my data. As I am looking for the main countries responsible for my chargebacks I am going to filter the results by look at the top 10 countries:




































Now that I have filtered my results, its easy to see which countries I need to focus on. I am going to take the largest chargeback origination country, U.S., and drill down to get specific details on where these chargebacks are coming from. My next drill down is going to be by State:




















From here I am going to sort the States from highest to lowest to determine which state had the highest amount of chargeback activity.


I can see that California had the highest number of chargebacks in April 2009. It is important to compare the total number of chargebacks by state to the total number of overall transactions. One of the reasons that California might have the highest number of chargebacks is because they also had the highest number of overall transactions so I want to establish a ratio of chargebacks to transactions to confirm any issues that might exist.


For this example we are going to assume that the ratio of chargebacks to transactions in California is out of wack and verifies that there is a problem. Now I am going to back up to my original cube with all transactions and begin drilling down into the data with the goal of isolating California's problem. In Data Miner this is very easy as the program keeps a history of each drill down, so all I have to do is click on the original cube in the history to return to it. For those using Visual Studio, this will involve beginning a new query altogether.
















I return to my originial cube and drill down by Country and then by State selecting on California. Now I am going to drill down on California by Month. This was not possible from the point I was at in my last series of drill downs as I was only visualizing California for April 2009. If I was to view chargeback activity for California across all months I need to follow the route I am at now.
From this visualization, I can see that California had a spike in chargeback activity for April 2009 confirming what I had found in my last cube series. For my analysis I have sorted in Data Miner by decending order to determine which month had the highest amount of activity. My next step is to isolate the region where chargebacks are occuring by drilling down into City:

One of the first things I am going to encounter is that all cities in California where I have ever had transactions are going to be included in my drill down, even if they didn't have chargebacks in April 2009. To remove these fields I need to filter on null values so that only cities with chargebacks in April 2009 are displayed. In Data Miner this is a two click process but can also be down through filtering in Access, Excel or Visual Studio.




















Now I can see that Los Angeles had the highest concentration of chargebacks for April 2009. The next few steps drill down to isolate the issues that exist in Los Angeles that may be leading to my fraud problem there:


Visualizing The Statistical Analysis Set:


I have drilled down by payment type and by day of the week. Now that I have established which card and which day of the week is the most problematic, I am going to switch to visual analysis to see if I can find out who is comitting the fraud. As Data Miner is integrated with iBase, my database of chargebacks, and Analyst Notebook, I can send all the records from Saturday to my link analysis with one right click. If I were using Access or Excel to datamine, I would simply write a new query into my database around all transactions with a transaction date of April 2009, a payment instrument of "Visa" that occurred in Los Angeles California and import that data into my visualization program. Lets see what we find:



























Now that I have pulled all of the records into Analyst Notebook for visualization, I can expand on these records to include all fields in my database, including those I did not use in my data cube. This will allow me to perform tactical analysis on the fraud chargebacks to determine the source of my fraud problem:










Now that I have expanded on all of my fraud chargeback entities, I have brought in the card information, the card holder information, the IP addresses and the credit card transaction history for each of the individuals.


I am going to focus in on the largest cluster of interrelated activity in my visualization for analysis. By placing the largest cluster into a new chart for analysis and visualization and looking at the transactions in a hirachy view I can see that numerous chargebacks are all associated with a store where the cards were last used. Additionally all of the cards were swipped by the same associate at the store within a three day period of my chargebacks occurring.








There is no pattern in the IP addresses or in the individual card holders. The only central association is from the charge history of the individuals which points to an issue with a specific merchant and a specific operator at the store which is most likely the source of my fraud chargebacks.


In Conclusion


In this example we have sucessfully leveraged statistical analysis to identify a specific set of issues within a large data source for visual analysis. Performing link analysis on the entire group of 220,000 records would have been impossible but by drilling down into this mass of data to discover anomolies in activity, we have sucessfully identified a fraud issue across years of chargeback data.


Link and association analyis in a visual form is a powerful tool, however, by integrating visual analysis with other forms of data mining we are able to perform analysis across a much larger set of data then would be possible by query and visualization alone.