Welcome


Welcome to Understanding Link Analysis. The purpose of my site is to discuss the methods behind leveraging visual analytics to discover answers and patterns buried within data sets.

Visual analytics provides a proactive response to threats and risks by holistically examining information. As opposed to traditional data mining, by visualizing information, patterns of activity that run contrary to normal activity surface within very few occurances.

We can dive into thousands of insurance fraud claims to discover clusters of interrelated parties involved in a staged accident ring.

We can examine months of burglary reports to find a pattern leading back to a suspect.

With the new generation of visualization software our team is developing, we can dive into massive data sets and visually find new trends, patterns and threats that would take hours or days using conventional data mining.

The eye processes information much more rapidly when information is presented as images, this has been true since children started learning to read. As our instinct develops over time so does our ability to process complex concepts through visual identification. This is the power of visual analysis that I focus on in my site.

All information and data used in articles on this site is randomly generated with no relation to actual individuals or companies.

Integrating Statistical Analysis With Link Analysis

As discussed in previous posts, visual link analysis is a powerful tool for determining and confirming relationships between entities across a large span of data however, it's limitation is that the machine and the user can only take in so many entities and links in one workspace before the quality of the analysis begins to suffer.

The ability to drill down into data to extract sets of data to visual analyze is extremely important. One way we discussed doing this was through data queries and sets, the second method I will discuss in this article, is through an integration of statistical analysis and visual analysis.

For the purpose of this article I am going to be using i2's Analyst Workstation which integrates Data Miner, a statistical data analysis tool, with Analyst Notebook for visualization. This can also be accomplished through a variety of desktop tools depending on the size of the data such as Access, Excel, Crystal Reports and Visual Studio. The benefit to using the i2 tool is the ability to switch from statistical analysis to visual analysis without any additional import steps.



Building An Analytical Data Cube

This step is specific to tools that leverage analytical services in SQL, if you are using excel or access you can skip this step. A data cube is a subset of all data contained in a database which leverages fields with statistical value.

In statistical analysis you are only using the fields in your database that have statistical relevance, those which are repeating and not unique. For example in the database I am going to analyze in Data Miner, I have fields such as "Credit Card Holders Name" which have no statistical value, and "Chargeback Date" which do. My first step is to review my data to determine which fields I want to use in building my data cube. The better I define these fields, the faster I am going to be able drill down into the data as my cube is only going to hold relevant statistical data. When I import the data into my visualization tool, I can always integrate in all the other attributes in my data set to form relationships.

Another important point when building analytical data cubes is that the more indexed fields you utilize in your cube, the faster you can drill into your cube. When I am building a cube for analysis I will normal draw the data from a view I have created in SQL and apply indexing to those fields which might not be indexed in the hosting database.

For the purpose of this example I am going to build an analytical cube around eCommerce transaction payments to detect patterns in fraudulent electronic payments.

The following fields are contained in my database that have statistical value:

  • Purchase Type: a repeating field indicating what type of purchase was made

  • Platform: a repeating field indicating which eCommerce platform a purchase was associated with

  • Card Type: A repeating field indicating what type of card was used (AMEX,MC,VISA)

  • Transaction Amount: While this field is going to contain lots of unique amounts, for statistical purposes I can group those values in my cube such as between $50-$100

  • Transaction Date

  • Bank Identifier: The code on credit cards and routing numbers for direct debit, which identify the issuing financial institution

  • Fraud Score: The score assigned to the transaction by my risk model

  • IP City, State and Country: For determining the source of eCommerce transactions

  • Transaction State: A field indicating the state of the transaction such as "chargeback" or "fraud" or "dispute"

The fields I am leaving out are those with unique identifiers and no statistical relevance such as "name, street address or phone number". It is also import to note that your analytical cube can use any combination of fields with statistical relevance based on the type of analysis you are going to perform. For example, if I were only going to look at fraud charges by Country I could leave out fields such as "fraud score" or "transaction date" because they do not apply to the scenario I am examining and will slow down drilling into the data.

One of the important things to remember about building analytical cubes is that opposed to databases which capture fields for ever possible scenario in your environment, data cubes are customized for the exact type of analysis you are going to perform and their performance and accuracy depend on you only utilizing those fields in your data with relevance to your analysis.

Once I have selected the fields I am going to utilize I build my cube utilizing analytical services in SQL Server. We can verify that the cube was built successfully by opening Analytical Services on the server containing your database. From the example below you can see that I have built several cubes around my eCommerce data.










Another important note about building analytical cubes is that the fields you select must have relationships to one an another or you will loose data in your cube. For example if I build a cube with "transaction date" and "IP country", every field in "IP Country" has to be populated, if I have any nulls in that column I will loose the entire line in my data and it will not be included in the aggregate set. For that reason it is important to plan which columns to use in your data model based on the statical analysis you are going to perform. If the IP Country is important, loosing a thousand transactions with no IP Country captured might not be a big deal, however if it is important in your statical analysis that all transactions be captured, you can only build your cube with columns that every field in each column has a relation to the other with no nulls in your data.

Drilling Into Your Data

Now that I have built my data cube I am going to begin my statistical analysis. Because I am using Data Miner, this is can also be visual making my statistical analysis much faster to understand then looking at aggregate text fields in a spreadsheet format.

Let's being drilling down into my transactions to find specific fraud issues to visualize through link analysis. I am starting with all of my transactions in my cube:










From here I am going to drill down into transaction date to determine which months had the highest amount of chargebacks. When you build a data cube in any program, SQL analytical services takes date columns and breaks them out so you can drill down by Year, Month, Day or Day of the Week, Week Day or Week Number automatically:












From this first drill down I can see that I had a huge spike in chargeback activity in April that I probably want to take a look at. For my next drill down I am going to look at all chargebacks for April 2009 by Country:













From this visualization you can see that there are allot of countries in my data. As I am looking for the main countries responsible for my chargebacks I am going to filter the results by look at the top 10 countries:




































Now that I have filtered my results, its easy to see which countries I need to focus on. I am going to take the largest chargeback origination country, U.S., and drill down to get specific details on where these chargebacks are coming from. My next drill down is going to be by State:




















From here I am going to sort the States from highest to lowest to determine which state had the highest amount of chargeback activity.


I can see that California had the highest number of chargebacks in April 2009. It is important to compare the total number of chargebacks by state to the total number of overall transactions. One of the reasons that California might have the highest number of chargebacks is because they also had the highest number of overall transactions so I want to establish a ratio of chargebacks to transactions to confirm any issues that might exist.


For this example we are going to assume that the ratio of chargebacks to transactions in California is out of wack and verifies that there is a problem. Now I am going to back up to my original cube with all transactions and begin drilling down into the data with the goal of isolating California's problem. In Data Miner this is very easy as the program keeps a history of each drill down, so all I have to do is click on the original cube in the history to return to it. For those using Visual Studio, this will involve beginning a new query altogether.
















I return to my originial cube and drill down by Country and then by State selecting on California. Now I am going to drill down on California by Month. This was not possible from the point I was at in my last series of drill downs as I was only visualizing California for April 2009. If I was to view chargeback activity for California across all months I need to follow the route I am at now.
From this visualization, I can see that California had a spike in chargeback activity for April 2009 confirming what I had found in my last cube series. For my analysis I have sorted in Data Miner by decending order to determine which month had the highest amount of activity. My next step is to isolate the region where chargebacks are occuring by drilling down into City:

One of the first things I am going to encounter is that all cities in California where I have ever had transactions are going to be included in my drill down, even if they didn't have chargebacks in April 2009. To remove these fields I need to filter on null values so that only cities with chargebacks in April 2009 are displayed. In Data Miner this is a two click process but can also be down through filtering in Access, Excel or Visual Studio.




















Now I can see that Los Angeles had the highest concentration of chargebacks for April 2009. The next few steps drill down to isolate the issues that exist in Los Angeles that may be leading to my fraud problem there:


Visualizing The Statistical Analysis Set:


I have drilled down by payment type and by day of the week. Now that I have established which card and which day of the week is the most problematic, I am going to switch to visual analysis to see if I can find out who is comitting the fraud. As Data Miner is integrated with iBase, my database of chargebacks, and Analyst Notebook, I can send all the records from Saturday to my link analysis with one right click. If I were using Access or Excel to datamine, I would simply write a new query into my database around all transactions with a transaction date of April 2009, a payment instrument of "Visa" that occurred in Los Angeles California and import that data into my visualization program. Lets see what we find:



























Now that I have pulled all of the records into Analyst Notebook for visualization, I can expand on these records to include all fields in my database, including those I did not use in my data cube. This will allow me to perform tactical analysis on the fraud chargebacks to determine the source of my fraud problem:










Now that I have expanded on all of my fraud chargeback entities, I have brought in the card information, the card holder information, the IP addresses and the credit card transaction history for each of the individuals.


I am going to focus in on the largest cluster of interrelated activity in my visualization for analysis. By placing the largest cluster into a new chart for analysis and visualization and looking at the transactions in a hirachy view I can see that numerous chargebacks are all associated with a store where the cards were last used. Additionally all of the cards were swipped by the same associate at the store within a three day period of my chargebacks occurring.








There is no pattern in the IP addresses or in the individual card holders. The only central association is from the charge history of the individuals which points to an issue with a specific merchant and a specific operator at the store which is most likely the source of my fraud chargebacks.


In Conclusion


In this example we have sucessfully leveraged statistical analysis to identify a specific set of issues within a large data source for visual analysis. Performing link analysis on the entire group of 220,000 records would have been impossible but by drilling down into this mass of data to discover anomolies in activity, we have sucessfully identified a fraud issue across years of chargeback data.


Link and association analyis in a visual form is a powerful tool, however, by integrating visual analysis with other forms of data mining we are able to perform analysis across a much larger set of data then would be possible by query and visualization alone.

Integrating Visual Analysis and Fraud/Risk Modeling

For most of us, visualizing everything contained in a database for analysis is simply not an option. Even if you have the most powerful analytical software in the world, the human brain can only wrap itself around so much data. Additionally, under most circumstances, the data that you are most interested in analyzing consists of 10% of the entities and transactions contained in your data.

To illustrate this, lets assume that you have a database of credit card transactions from your medium sized e Commerce business. You receive 100,000 credit card transactions daily and you are tasked with proactively identifying the fraudulent ones. In order to establish patterns in fraudulent transactions you need a history of transactions across your data, so we are not talking about visualizing one days worth of transactions, we are talking about several months worth. Visualizing just three months of credit card transaction data would be 9 million records. Unless you are Rain Man and can count the number of toothpicks that fall out of a box in 2 seconds, don't try visualizing 9 million records. Even if the software could handle that many entities and links, it would create a staggering ball of twine that would be very pretty to look at but impossible to dissect the patterns that are contained in it.

This is where the partnering of statistical modeling and visual analysis comes into play. There are several parts to statistical analysis and modeling that we will cover later, what we are going to focus on in this article is creating an analytical subset, or fraud model, to leverage in pulling out a small percentage of the transactions which have the highest potential of fraud.

The partnership between risk or fraud modeling and visual analysis is symbiotically important. Fraud models are important tools for extracting out or scoring vast transaction pools but they do not learn new threats, only the old ones they are designed to protect against. Visual analysis is important for detecting relationships and patterns across data, but the software and the operator can only process so many entities at a time.

To optimize both, we partner visual analysis with risk and fraud modeling to offset the deficiencies in both. As an analyst I rely on my fraud model to pull out the transactions out of the big block of 9 million, that have a higher probability of fraud. Then by leveraging visual analysis, I examine these records for patterns or clusters of fraud activity by drilling into the results from the fraud model and expanding on those entities by pulling in related records throughout the entire data set. Through visual analysis of the data returned by the risk model and incorporating related data, I can not only verify fraud detected by the model but also identify new patterns of fraud in the related data which the model may miss. That information is then fed back into the modeling rules to identify that activity in the future, hence my fraud model learns from my visual analysis.

Hopefully I haven't lost everyone at this point so let me describe how I develop a "learning fraud model" by incorporating risk modeling and visual analysis. First a quick analogy of this scenario:

A customs officer at the airport is on the lookout for people who might be smuggling drugs. His observations are based on profiling the behavior of people at the airport which is made up of a series of attributes that drug smugglers display. He is standing at the security checkpoint and notices two people, first is a man who is visibility nervous and sweating. The next person is a well dressed elderly woman laughing and joking with a small child who is standing behind her.

The customs officer pulls the man out of line for further screening allowing the elderly woman through the checkpoint. The man, as it turns out, is nervous about flying because he was on a plane that crashed in the Hudson a couple of months ago and also has the flu. The elderly woman is carrying enough cocaine to kill a team of elephants.

Risk modeling works the same way as the customs officer. The model has a series of attributes that it looks for in transactions and pulls those transactions out of line for further examination. Just like the customers officer, the risk model does not look at transactions 3 dimensionally and cannot tell that the grandmothers address is linked to a person who had been arrested for smuggling drugs at another airport three months ago, if it could, the risk model and the customs officer would have pulled granny aside with the sweating guy. Just like the analyst tasked with looking at 300,000 transactions to find fraud, the customs officer can't be expected to look at 300,000 people passing through his line every day and know which person is the smuggler.

Ultimately the activity that poses the greatest threat is the activity which hasn't been seen before. Count on the fact that if you know what attributes your fraud model looks at in your transaction flow, the bad people also know it and actually some of the transactions your fraud model is stopping today are probes to learn what will get through your model and what wont.

The weakness to fraud schemes and transactions is no matter how hard you try, you can't make fraudulent transactions look exactly like good ones and eventually the activity you are analyzing will share attributes and behaviors that through visual analysis will make the well dressed old lady look more and more like the nervous sweating guy and the same rule that got the sweating guy pulled aside will be incorporated for the elderly lady.
















Building a "Leaning" Fraud Model For Visual Analysis


For the purpose of this example we are going to assume that our e Commerce operation has been in business for a couple of month and through charge backs and complaints, we have identified transactions which are fraudulent and incorporated those into our standard fraud model.

I am going to import those records into my visual analysis software and start the process of expanding on those entities, leveraging every transaction data point in my database to identify undiscovered fraudulent transactions from identified ones.

Each expansion takes me one more level into all my transactional data and what I am looking for are clustered transactions that both share attributes with fraudulent transactions and also from a visual analysis standpoint, do not share the same behavior as legitimate transactions.

Here I have located a cluster of interrelated transactions which my fraud model has scored very low, but is linked through two or more levels of relationships to a charge back transactions. The cluster tells me two things; first, if these transactions are all from unrelated individuals they shouldn't be clustered and second, at some point all of these transactions are linked to a fraud charge back.

I am going to focus on this cluster of activity to determine what is occurring which joins them together and through analysis, determine if these transactions are indeed fraudulent. Once we have determined that these transactions are fraudulent there are two things I need to do:







1. Add these transactions to my analytical data set of all identified fraudulent transactions. I refer to this data set as my "scum pond", it consists of every identified fraudulent transaction and all there associated attributes and is what I use to determine to visual analysis footprint of the associated fraud scenarios to compare across transactional data. It differs from the fraud model as it has no stored procedures and is not used for the scoring of transactions, but rather an analytical model my visualization software can refer to, allowing for visual comparisons.

2. I need to determine if these newly discovered transactions share any common attributes which can be incorporated into my "learning" fraud model. Through my visualization I can determine that all of the newly discovered fraud transactions have a mismatch in the IP state and the account state, are over $1,000 and all use the @hotmail.com email domain. I am going to incorporate all of these attributes into my fraud scoring to trigger a review of transactions which share this pattern.

This process gets repeated daily, the fraud model depends on the visual analysis as much as the analyst depends on the fraud model. We have stopped this activity for now, but like all fraud and flu viruses, it will mutate over time, the fraud model will stop detecting the activity and the analyst will have to discover the change and incorporate it back in the modeling.