Welcome to Understanding Link Analysis. The purpose of my site is to discuss the methods behind leveraging visual analytics to discover answers and patterns buried within data sets.

Visual analytics provides a proactive response to threats and risks by holistically examining information. As opposed to traditional data mining, by visualizing information, patterns of activity that run contrary to normal activity surface within very few occurances.

We can dive into thousands of insurance fraud claims to discover clusters of interrelated parties involved in a staged accident ring.

We can examine months of burglary reports to find a pattern leading back to a suspect.

With the new generation of visualization software our team is developing, we can dive into massive data sets and visually find new trends, patterns and threats that would take hours or days using conventional data mining.

The eye processes information much more rapidly when information is presented as images, this has been true since children started learning to read. As our instinct develops over time so does our ability to process complex concepts through visual identification. This is the power of visual analysis that I focus on in my site.

All information and data used in articles on this site is randomly generated with no relation to actual individuals or companies.

Setting Up Data For Link Analysis

There are a large number of software packages available on the market today to perform link analysis across data sets. While not arguing the strength or weaknesses of different programs, there is one key element to getting the most from data set analysis that is often overlooked.

Even the best analytical software can be flummoxed by un-relational data or poor conceived imports. Analysis of data sets require relational data where unique identifiers can be constructed and thoughtful planning on how each entity will relate to another in the chart. Many good analysis projects find the wheels coming off the wagon when the analyst tried to cram data into a visual analysis program without planning.

Lets start with an example of importing insurance claims information to discover relationships between vehicles, claimants, addresses and phone numbers to detect organized insurance fraud. The scenario will be:

I am an investigative analyst for Andrew's insurance company. I am tasked with discovering patterns in claims activity which are indicative of organized fraud. At my disposal I have the ability to download data from my claims and policy system and I am working with the i2 program to visualize this data. My objective is to look at State's claim activity to find new organized fraud rings for my SIU unit to investigate.

Starting with my claims database, I examine what fields are being captured by my company to decide which to use in my analysis. First thing to consider is what I am trying to discover, in this case it's organized claims fraud. I only want to extract those field from my database which would:

* Be used to indicate insurance fraud (People, Vehicles, Claims)

* Be used to create unique identifiers between these entities and nothing more

* Only claims that have a high probably of being involved in insurance fraud to start with

Unless you have a Cray super computer in your office, you want to understand how much data you can realistically examine, and then extract only those fields that you absolutely need to discover you objective.

Step One: Planning Your Data Query

Let's assume Andrew's insurance company is a major insurance carrier in the Southeast U.S., writing policies in Alabama, Louisiana, Florida and Georgia. As I am trying to find patterns in insurance claims which are indicative of insurance fraud I want to limit my query to those States with similar lines of business.

For example, Florida has Personal Injury Protection which pays for claimants medical coverage mandated by the State's law. Louisiana, another state I am responsible for, does not have PIP coverage so I cannot download and analyze claims data from Florida and Louisiana together.

Louisiana has more investigations then Florida so I am going to choose to download claims from Louisiana for my analysis. My next step is to determine what is the claims volume in Louisiana so I know how many months of claims I can realistically examine.

The next mistake analysts make when visualizing data is trying to import and visualize too much data. Maybe you have the worlds best computer at your disposal and can import 65,000 entities into your analytical software, that is wonderful, except most people, aside from Mensa members, can't process 65,000 entities on a chart accurately.

I need to decide how many claims my company has in Louisiana in a month, so I can determine how many months of claims I can download and import. Knowing my machines and brains limitation, I factor that Andrew's insurance clients generate 50,000 claims a month. Obviously, I don't want to limit my analysis to just one month, that simply is enough time to identify an organized fraud ring. If I want to analyze more then one month, then I need to write my query to pull only those claims which are the most likely to be involved in organized fraud.

By examining claims from the past that have been investigated and found fraudulent in Louisiana I establish that:

* Multi-occupant injury claims
* On policies that have been bought less then 6 months ago
* Occurring late at night
* Where the vehicles were six years or older

had the highest likelihood of being fraudulent, and those are the claims I want to limit my download to. There are on average 2,000 claims in Louisiana a month from my company falling into this criteria. I can accurately examine and analyze 10,000 claims so I can download 5 months of "high fraud risk" claims from my claim system for analysis.

Now that I have determined from what State and how many months of claims I can download for analysis, my next step is to determine what fields within the claims database I need to select for my download in order to establish relationships and unique identifiers for analysis.

I mentioned two things in the paragraph above, "relationships" and "unique identifiers", two of the most important items required for visual analysis.

Relationships: Visual analysis is based on the joining of two entities together by a relationship. A person is driving a car and has a passenger when they are involved in an accident with another car with two occupants, how many relationships are involved here:

1. People to Claims: The claim is the center of every relationship in my analysis. Everyone and everything involved with my company is associated by the claim

2. Claims to Vehicles: Unless they are on horseback, every accident claim I am going to analyze has a vehicle if not more then one.

3. People to Addresses: The claimants live or get mail somewhere

4. People to Phone Numbers: People have to give a phone number in order to know how their claim is coming along.

In one accident we have four different relationships multiplied by the number of people and cars involved in the accident. Four people involved, 16 relationships.

Unique Identifiers: We know we have 4 relationships and 8 entities involved. The importing of data into an analytical program requires unique identifiers to be established in the data before you can successfully analyze it. For example, if you decide to import vehicles into your analytical program by designating the identity of your vehicles as year, make and model, you will end up with one vehicle entity for every 1990 Ford Taurus in your data. As there are five different 1990 Ford Tauruses in my data, I don't want this to occur so I need to set up a unique identifier for my vehicles so the program will see them as separate entities. In the case of car's, I know my claims system captures license plate numbers and vehicle identification numbers. Both of these fields are unique to a car and thus establish a unique identity to each car I am going to import.

Lets look at the entities I am going to import for visualization and determine what fields exist to create unique identifiers for my entities:

1. People to Claims: Claims is the easy part, unless you work for a really wacky insurance company, each claim number should be unique so that is going to be my unique identifier for my claims. People are a little more complicated, I cannot simply use first and last name in the off chance there are more then one John Smith's in my data which have had an accident. On the other hand I also know that people who commit fraud also tend to lie about things so I need a field that I can add to the Person which will not only establish each person as unique but will also link one person to multiple claims reliably. I know that on every accident, the persons drivers license is recorded and photocopied and on the license is a date of birth so my unique identifier is going to be "persons first name, persons last name and DOB".

2. Claims to Vehicles: We already covered claims and vehicles so we are going to choose the Claim Number field and the Vehicle Identification Number field as my unique identifiers.

3. People to Addresses: Addresses would seem like an easy one, and they are as long as you make sure all the entire address field is used to create a unique identifier. You want to make sure that you don't end up linking together at one address, every resident of a high rise, therefore to create a unique identifier on an address, use every field available in your data.

4. People to Phone Numbers: Phone numbers are a pretty simply one also, in theory no two people should have the same phone number at the same time, although I wish someone would tell the dental office which calls me at 9 PM each month that.

We are almost done with establishing a good data extract for visual analysis. We have established the volume of claims we can look at, which entities we are going to import and the unique identifiers for each entity. The last step is cleaning the data.

Data Cleaning:

Cleaning data is an important step prior to importing for analysis. Every database on the planet contains "dirty" data. There are nulls in your data, there are twenty different variations of "unknown name" in your data, all of which need to be normalized before importing, otherwise your visual analysis software is going to see each "unknown male" and null as an entity and they have no analytical value.

There are three different ways to clean data prior to import:

1. Filtering data through your query: Inserting statements into your query such as "is not null" or <> statements such as <>"unknown male" will eliminate those data fields in your extract.

2. Cleaning after extract: By sorting and filtering on your data after extract, you can delete null or non-valid data from your data set prior to import

3. Cleaning after import into visualization software: This is a process I use quite a bit. i2 has great visual analysis ability where after import you can easily search out non-valid entities and eliminate them from your visualization. If the software you leverage has this ability, it is a very good way of cleaning data, particularly when used in combination with filtering data in your query.

Coming up next: Visualizing Your Data