Welcome to Understanding Link Analysis. The purpose of my site is to discuss the methods behind leveraging visual analytics to discover answers and patterns buried within data sets.

Visual analytics provides a proactive response to threats and risks by holistically examining information. As opposed to traditional data mining, by visualizing information, patterns of activity that run contrary to normal activity surface within very few occurances.

We can dive into thousands of insurance fraud claims to discover clusters of interrelated parties involved in a staged accident ring.

We can examine months of burglary reports to find a pattern leading back to a suspect.

With the new generation of visualization software our team is developing, we can dive into massive data sets and visually find new trends, patterns and threats that would take hours or days using conventional data mining.

The eye processes information much more rapidly when information is presented as images, this has been true since children started learning to read. As our instinct develops over time so does our ability to process complex concepts through visual identification. This is the power of visual analysis that I focus on in my site.

All information and data used in articles on this site is randomly generated with no relation to actual individuals or companies.

Visualizing Data For Analysis

For this example I am going to utilize i2's Analyst Notebook as my visual analysis software, however for the most part all visualization software follows the same rules of relationships so you should be able to apply these principals to all platforms.

In the previous article we discussed the method of querying and extracting data for visual analysis. Now we are going to take that data and import it into our visual analysis software to determine relationships between the entities in the data.

The scenario for this project is going to be the same as the example for data mining, the discovery of organized insurance fraud. In later articles I will cover other scenario's for analysis, but for this example I want to keep the same scenario as my data extract to make it easier to conceptualize.

Formatting Data For Import:

Based on the type of visualization software you utilize, you will need to properly format the data you extracted for analysis into a file compatible with your software. In this example, Analyst Notebook can accept data in .txt or excel formats for import so I am going to take my data extract and place it in excel for import.

I know that I am going to need to establish relationships between claims, people, vehicles and the associated locations for each in order to determine if related people are involved in multiple claims which would be an indicator of insurance fraud.

Looking at my data, I have extracted out filed that I need in order to create unique identifiers for each of my entities in order to establish the visual links between them:

Information about the people involved including where they live and their phone numbers

Information about the property involved

Information about the claim

Importing Your Data For Analysis:

I am ready to import my data for analysis. I begin by selecting my import file in Analyst Notebook and begin the process of assigning identities and attributes to each of the entities that I am going to import.

I always start with the entity that is going to tie the most objects together, something I refer to as my pivot entity. In this case it is going to be claim as people and property and linked together through the claim. As the goal of my analysis is find the most interrelated claims, this makes the claim entity my central point of focus.

Establishing the import for the claim is fairly simple as it already has a unique identifier, the claim number. However there is additional information about the claim that I want to include as attributes such as the date of loss or the type of claim.

As illustrated in the picture, I am assigning the claim number field from my data as the identity of my claim entity and utlizing the claim loss date as my date field in the visualization. This will serve two purposes, first it will provide the information in my chart relating to the loss date and will also allow me to produce a time line off this data later if I choose.

Linked to the claim are vehicles and people. Since I am looking for interrelated claims as the basis for establishing organized insurance fraud, I am linking people to claims and vehicles to claim as opposed to linking people to vehicles.

I have four field for vehicles, the Vehicle Identification Number or VIN and the year make and model of the vehicle. From these four fields I need to create a unique identifier for my vehicle entity. A big mistake would be to make the year make and model of the vehicle the identity of the vehicle entity. This would in essence would create one 1990 Ford Taurus entity and link every claim involving this type of vehicle together. I am pretty sure that there is more then one 1990 Ford Taurus driving around in the country so I need to make sure the identity is distinct. For this reason I am going to select the VIN as the unique identifier or identity for this entity.

Now I have another issue, if I only use the VIN number as the description for this entity, it is not going to make sense to the people who are going to view my chart. No problem, I am going to use the VIN as the identity of this entity but use the year, make and model of the vehicle as the label. By doing this I still create a unique identifier for the vehicle but readers will have a simple label to tell them what the entity is.

I am going to have the same issue with people in the off chance that there is more then one person named John Smith in the State. Just like with vehicles, I am going establish a unique identity for people by using a combination of fields from my data such as First Name, Last Name and Date of Birth. For the label in my chart I am only going to use first and last name.
(All of the identifiers I used are random data not real)

I do want to be able to link people together by their locations and other identifiers. The most import part of importing data for visual analysis is deciding what to make an entity and what to make an attribute. In this case I could make social security number an attribute of the person but then I would not be able to link people together who are using the same social security number. For that reason I want to make social security number it's own entity because in fraud scenarios people often use fake SSN's and I may be able to link multiple people who are using the same made up SSN. The same issue exists for telephones, if I want to link entities together by a field I need to establish it as it's own entity not an attribute.

The final step is qualify blank fields in my data so that I can remove them later. For example, maybe in my data not everyone entered a phone number and that field is null. If I do not assign null entities a value, when I import my data it is going to create a bunch of blank telephones which have no analytical value. For that reason I am going to assign blank fields the value of "delete" allowing me to search out those entities and delete them from my chart.

Now I am ready to import and analyze my data. I pull the trigger on my visualization software and import my data. 90% of the time, after your initial import you are going to end up with what I refer to as the "dreaded ball of twine". The reason is that 90% of the data in the world contains nulls and "false positives" that need to be cleaned out before we analyze the results.

The first thing I am going to do is search out my "delete" or null entities and remove them from my chart. If you are fortunate enough to have clean data, that should take care of your twine problem and you ready to begin your analysis.

After cleaning the data, clusters of interrelated claims are going to appear in Analyst Notebook. The largest cluster to the smallest cluster will be organized left to right in your chart. At this point in my analysis I want to being to break out the individual clusters for analysis as looking at them in a circular layout does not tell me how they are related.

I take the largest cluster and move it to a new chart then utilize a "peacock" or a "minimize cross links" layout to examine the relationships between the entities to draw my conclusion.

What I can see in this chart is we have several people who are associated with numerous multi occupant/multi injury claims on new policies which was the query conditions I used when I searched for data. The chances of one person being involved in numerious high risk claims in a short period of time and it not being fraud is very small.

I am going to want to sample some of the claims to eliminate the chance of "false positives" in my data such as a commercial policy with 1000 vehicles on it or people in these claims being involved in a large loss such as a bus accident. Failing those sceneros, I have located a cluster of very high risk claims which are indicative of fraud.