Welcome to Understanding Link Analysis. The purpose of my site is to discuss the methods behind leveraging visual analytics to discover answers and patterns buried within data sets.

Visual analytics provides a proactive response to threats and risks by holistically examining information. As opposed to traditional data mining, by visualizing information, patterns of activity that run contrary to normal activity surface within very few occurances.

We can dive into thousands of insurance fraud claims to discover clusters of interrelated parties involved in a staged accident ring.

We can examine months of burglary reports to find a pattern leading back to a suspect.

With the new generation of visualization software our team is developing, we can dive into massive data sets and visually find new trends, patterns and threats that would take hours or days using conventional data mining.

The eye processes information much more rapidly when information is presented as images, this has been true since children started learning to read. As our instinct develops over time so does our ability to process complex concepts through visual identification. This is the power of visual analysis that I focus on in my site.

All information and data used in articles on this site is randomly generated with no relation to actual individuals or companies.

Visualizing Data For Analysis

For this example I am going to utilize i2's Analyst Notebook as my visual analysis software, however for the most part all visualization software follows the same rules of relationships so you should be able to apply these principals to all platforms.

In the previous article we discussed the method of querying and extracting data for visual analysis. Now we are going to take that data and import it into our visual analysis software to determine relationships between the entities in the data.

The scenario for this project is going to be the same as the example for data mining, the discovery of organized insurance fraud. In later articles I will cover other scenario's for analysis, but for this example I want to keep the same scenario as my data extract to make it easier to conceptualize.

Formatting Data For Import:

Based on the type of visualization software you utilize, you will need to properly format the data you extracted for analysis into a file compatible with your software. In this example, Analyst Notebook can accept data in .txt or excel formats for import so I am going to take my data extract and place it in excel for import.

I know that I am going to need to establish relationships between claims, people, vehicles and the associated locations for each in order to determine if related people are involved in multiple claims which would be an indicator of insurance fraud.

Looking at my data, I have extracted out filed that I need in order to create unique identifiers for each of my entities in order to establish the visual links between them:

Information about the people involved including where they live and their phone numbers

Information about the property involved

Information about the claim

Importing Your Data For Analysis:

I am ready to import my data for analysis. I begin by selecting my import file in Analyst Notebook and begin the process of assigning identities and attributes to each of the entities that I am going to import.

I always start with the entity that is going to tie the most objects together, something I refer to as my pivot entity. In this case it is going to be claim as people and property and linked together through the claim. As the goal of my analysis is find the most interrelated claims, this makes the claim entity my central point of focus.

Establishing the import for the claim is fairly simple as it already has a unique identifier, the claim number. However there is additional information about the claim that I want to include as attributes such as the date of loss or the type of claim.

As illustrated in the picture, I am assigning the claim number field from my data as the identity of my claim entity and utlizing the claim loss date as my date field in the visualization. This will serve two purposes, first it will provide the information in my chart relating to the loss date and will also allow me to produce a time line off this data later if I choose.

Linked to the claim are vehicles and people. Since I am looking for interrelated claims as the basis for establishing organized insurance fraud, I am linking people to claims and vehicles to claim as opposed to linking people to vehicles.

I have four field for vehicles, the Vehicle Identification Number or VIN and the year make and model of the vehicle. From these four fields I need to create a unique identifier for my vehicle entity. A big mistake would be to make the year make and model of the vehicle the identity of the vehicle entity. This would in essence would create one 1990 Ford Taurus entity and link every claim involving this type of vehicle together. I am pretty sure that there is more then one 1990 Ford Taurus driving around in the country so I need to make sure the identity is distinct. For this reason I am going to select the VIN as the unique identifier or identity for this entity.

Now I have another issue, if I only use the VIN number as the description for this entity, it is not going to make sense to the people who are going to view my chart. No problem, I am going to use the VIN as the identity of this entity but use the year, make and model of the vehicle as the label. By doing this I still create a unique identifier for the vehicle but readers will have a simple label to tell them what the entity is.

I am going to have the same issue with people in the off chance that there is more then one person named John Smith in the State. Just like with vehicles, I am going establish a unique identity for people by using a combination of fields from my data such as First Name, Last Name and Date of Birth. For the label in my chart I am only going to use first and last name.
(All of the identifiers I used are random data not real)

I do want to be able to link people together by their locations and other identifiers. The most import part of importing data for visual analysis is deciding what to make an entity and what to make an attribute. In this case I could make social security number an attribute of the person but then I would not be able to link people together who are using the same social security number. For that reason I want to make social security number it's own entity because in fraud scenarios people often use fake SSN's and I may be able to link multiple people who are using the same made up SSN. The same issue exists for telephones, if I want to link entities together by a field I need to establish it as it's own entity not an attribute.

The final step is qualify blank fields in my data so that I can remove them later. For example, maybe in my data not everyone entered a phone number and that field is null. If I do not assign null entities a value, when I import my data it is going to create a bunch of blank telephones which have no analytical value. For that reason I am going to assign blank fields the value of "delete" allowing me to search out those entities and delete them from my chart.

Now I am ready to import and analyze my data. I pull the trigger on my visualization software and import my data. 90% of the time, after your initial import you are going to end up with what I refer to as the "dreaded ball of twine". The reason is that 90% of the data in the world contains nulls and "false positives" that need to be cleaned out before we analyze the results.

The first thing I am going to do is search out my "delete" or null entities and remove them from my chart. If you are fortunate enough to have clean data, that should take care of your twine problem and you ready to begin your analysis.

After cleaning the data, clusters of interrelated claims are going to appear in Analyst Notebook. The largest cluster to the smallest cluster will be organized left to right in your chart. At this point in my analysis I want to being to break out the individual clusters for analysis as looking at them in a circular layout does not tell me how they are related.

I take the largest cluster and move it to a new chart then utilize a "peacock" or a "minimize cross links" layout to examine the relationships between the entities to draw my conclusion.

What I can see in this chart is we have several people who are associated with numerous multi occupant/multi injury claims on new policies which was the query conditions I used when I searched for data. The chances of one person being involved in numerious high risk claims in a short period of time and it not being fraud is very small.

I am going to want to sample some of the claims to eliminate the chance of "false positives" in my data such as a commercial policy with 1000 vehicles on it or people in these claims being involved in a large loss such as a bus accident. Failing those sceneros, I have located a cluster of very high risk claims which are indicative of fraud.

Setting Up Data For Link Analysis

There are a large number of software packages available on the market today to perform link analysis across data sets. While not arguing the strength or weaknesses of different programs, there is one key element to getting the most from data set analysis that is often overlooked.

Even the best analytical software can be flummoxed by un-relational data or poor conceived imports. Analysis of data sets require relational data where unique identifiers can be constructed and thoughtful planning on how each entity will relate to another in the chart. Many good analysis projects find the wheels coming off the wagon when the analyst tried to cram data into a visual analysis program without planning.

Lets start with an example of importing insurance claims information to discover relationships between vehicles, claimants, addresses and phone numbers to detect organized insurance fraud. The scenario will be:

I am an investigative analyst for Andrew's insurance company. I am tasked with discovering patterns in claims activity which are indicative of organized fraud. At my disposal I have the ability to download data from my claims and policy system and I am working with the i2 program to visualize this data. My objective is to look at State's claim activity to find new organized fraud rings for my SIU unit to investigate.

Starting with my claims database, I examine what fields are being captured by my company to decide which to use in my analysis. First thing to consider is what I am trying to discover, in this case it's organized claims fraud. I only want to extract those field from my database which would:

* Be used to indicate insurance fraud (People, Vehicles, Claims)

* Be used to create unique identifiers between these entities and nothing more

* Only claims that have a high probably of being involved in insurance fraud to start with

Unless you have a Cray super computer in your office, you want to understand how much data you can realistically examine, and then extract only those fields that you absolutely need to discover you objective.

Step One: Planning Your Data Query

Let's assume Andrew's insurance company is a major insurance carrier in the Southeast U.S., writing policies in Alabama, Louisiana, Florida and Georgia. As I am trying to find patterns in insurance claims which are indicative of insurance fraud I want to limit my query to those States with similar lines of business.

For example, Florida has Personal Injury Protection which pays for claimants medical coverage mandated by the State's law. Louisiana, another state I am responsible for, does not have PIP coverage so I cannot download and analyze claims data from Florida and Louisiana together.

Louisiana has more investigations then Florida so I am going to choose to download claims from Louisiana for my analysis. My next step is to determine what is the claims volume in Louisiana so I know how many months of claims I can realistically examine.

The next mistake analysts make when visualizing data is trying to import and visualize too much data. Maybe you have the worlds best computer at your disposal and can import 65,000 entities into your analytical software, that is wonderful, except most people, aside from Mensa members, can't process 65,000 entities on a chart accurately.

I need to decide how many claims my company has in Louisiana in a month, so I can determine how many months of claims I can download and import. Knowing my machines and brains limitation, I factor that Andrew's insurance clients generate 50,000 claims a month. Obviously, I don't want to limit my analysis to just one month, that simply is enough time to identify an organized fraud ring. If I want to analyze more then one month, then I need to write my query to pull only those claims which are the most likely to be involved in organized fraud.

By examining claims from the past that have been investigated and found fraudulent in Louisiana I establish that:

* Multi-occupant injury claims
* On policies that have been bought less then 6 months ago
* Occurring late at night
* Where the vehicles were six years or older

had the highest likelihood of being fraudulent, and those are the claims I want to limit my download to. There are on average 2,000 claims in Louisiana a month from my company falling into this criteria. I can accurately examine and analyze 10,000 claims so I can download 5 months of "high fraud risk" claims from my claim system for analysis.

Now that I have determined from what State and how many months of claims I can download for analysis, my next step is to determine what fields within the claims database I need to select for my download in order to establish relationships and unique identifiers for analysis.

I mentioned two things in the paragraph above, "relationships" and "unique identifiers", two of the most important items required for visual analysis.

Relationships: Visual analysis is based on the joining of two entities together by a relationship. A person is driving a car and has a passenger when they are involved in an accident with another car with two occupants, how many relationships are involved here:

1. People to Claims: The claim is the center of every relationship in my analysis. Everyone and everything involved with my company is associated by the claim

2. Claims to Vehicles: Unless they are on horseback, every accident claim I am going to analyze has a vehicle if not more then one.

3. People to Addresses: The claimants live or get mail somewhere

4. People to Phone Numbers: People have to give a phone number in order to know how their claim is coming along.

In one accident we have four different relationships multiplied by the number of people and cars involved in the accident. Four people involved, 16 relationships.

Unique Identifiers: We know we have 4 relationships and 8 entities involved. The importing of data into an analytical program requires unique identifiers to be established in the data before you can successfully analyze it. For example, if you decide to import vehicles into your analytical program by designating the identity of your vehicles as year, make and model, you will end up with one vehicle entity for every 1990 Ford Taurus in your data. As there are five different 1990 Ford Tauruses in my data, I don't want this to occur so I need to set up a unique identifier for my vehicles so the program will see them as separate entities. In the case of car's, I know my claims system captures license plate numbers and vehicle identification numbers. Both of these fields are unique to a car and thus establish a unique identity to each car I am going to import.

Lets look at the entities I am going to import for visualization and determine what fields exist to create unique identifiers for my entities:

1. People to Claims: Claims is the easy part, unless you work for a really wacky insurance company, each claim number should be unique so that is going to be my unique identifier for my claims. People are a little more complicated, I cannot simply use first and last name in the off chance there are more then one John Smith's in my data which have had an accident. On the other hand I also know that people who commit fraud also tend to lie about things so I need a field that I can add to the Person which will not only establish each person as unique but will also link one person to multiple claims reliably. I know that on every accident, the persons drivers license is recorded and photocopied and on the license is a date of birth so my unique identifier is going to be "persons first name, persons last name and DOB".

2. Claims to Vehicles: We already covered claims and vehicles so we are going to choose the Claim Number field and the Vehicle Identification Number field as my unique identifiers.

3. People to Addresses: Addresses would seem like an easy one, and they are as long as you make sure all the entire address field is used to create a unique identifier. You want to make sure that you don't end up linking together at one address, every resident of a high rise, therefore to create a unique identifier on an address, use every field available in your data.

4. People to Phone Numbers: Phone numbers are a pretty simply one also, in theory no two people should have the same phone number at the same time, although I wish someone would tell the dental office which calls me at 9 PM each month that.

We are almost done with establishing a good data extract for visual analysis. We have established the volume of claims we can look at, which entities we are going to import and the unique identifiers for each entity. The last step is cleaning the data.

Data Cleaning:

Cleaning data is an important step prior to importing for analysis. Every database on the planet contains "dirty" data. There are nulls in your data, there are twenty different variations of "unknown name" in your data, all of which need to be normalized before importing, otherwise your visual analysis software is going to see each "unknown male" and null as an entity and they have no analytical value.

There are three different ways to clean data prior to import:

1. Filtering data through your query: Inserting statements into your query such as "is not null" or <> statements such as <>"unknown male" will eliminate those data fields in your extract.

2. Cleaning after extract: By sorting and filtering on your data after extract, you can delete null or non-valid data from your data set prior to import

3. Cleaning after import into visualization software: This is a process I use quite a bit. i2 has great visual analysis ability where after import you can easily search out non-valid entities and eliminate them from your visualization. If the software you leverage has this ability, it is a very good way of cleaning data, particularly when used in combination with filtering data in your query.

Coming up next: Visualizing Your Data