Welcome to Understanding Link Analysis. The purpose of my site is to discuss the methods behind leveraging visual analytics to discover answers and patterns buried within data sets.

Visual analytics provides a proactive response to threats and risks by holistically examining information. As opposed to traditional data mining, by visualizing information, patterns of activity that run contrary to normal activity surface within very few occurances.

We can dive into thousands of insurance fraud claims to discover clusters of interrelated parties involved in a staged accident ring.

We can examine months of burglary reports to find a pattern leading back to a suspect.

With the new generation of visualization software our team is developing, we can dive into massive data sets and visually find new trends, patterns and threats that would take hours or days using conventional data mining.

The eye processes information much more rapidly when information is presented as images, this has been true since children started learning to read. As our instinct develops over time so does our ability to process complex concepts through visual identification. This is the power of visual analysis that I focus on in my site.

All information and data used in articles on this site is randomly generated with no relation to actual individuals or companies.

Creating A Learning Risk Model

Risk modeling is used extensively in all areas of commercial fraud analysis. Financial institutions were among the first to leverage risk modeling for financial transactions led by credit reporting agencies which assigned scores to establish indicators of credit worthiness. Slowly, risk modeling began being incorporated into insurance companies in an effort to identify claims which have a higher probability of fraud. As eCommerce began to grow, risk modeling was incorporated into online transactions to provide real time rules and risk scores to pay in and pay out transactions.

Commercial companies have become reliant on risk modeling to be the gatekeeper for transactions flowing in and out of their workspace, driven by the rising rate of fraud and the increase of cost related to manual review of transactions. The bottom line became, the better the model, the less per cost of transaction.

The downside to heavy reliance on risk modeling is that the rules and methodology the risk models rely on are derived from known risk transactions, investigation and analysis. Because fraud is constantly evolving, so must the risk model, to the point that the cost of manual review has been transferred to risk model evolution.

Companies must fine tune the balancing act between the amount of resources allocated to fraud and risk analysis, investigation and statistical analysis which feeds the risk model and the amount of time dedicated to updating the model itself.

When we talk about a learning risk model, it's important to be clear that a risk model itself never learns on it's own. The risk model simply incorporates the knowledge gained from analysis, investigation and lessons learned to prevent the same scenarios from reoccurring but it cannot stop new scenarios.

What can make the risk model more responsive is the development of a pipeline from your fraud analytical, investigation and review process into your risk model in real time. This allows companies to navigate the balancing act between analytical and investigation resources and fraud model maintenance more effectively.

The Partnering of Fraud Analysis and Investigation with Risk Modeling

Fraud analysis should be an ongoing process in commercial vector. Your fraud analysts and investigators and your best defense against new and emerging threats to your company and assets.

Resources are available to augment the ability and decrease the time it takes for your fraud analysts to detect, deter and investigate new threats such as visual analysis tools (link analysis, association analysis, time line analysis). Statistical analysis tools such as i2 Analyst Workstation which can identify anomalies or velocities which exists in attributes used in your business and bring those to a visual analysis tool for identification and verification. SQL Server and SQL analytical services can be used to build transactional cubes around massive amounts of data to detect irregularities in data which might be indicators of fraud or threats. The better equipped your fraud analysts are, the more data which can be examined in the shortest amount of time.

Time and resources is where risk modeling falls into the risk mitigation equation. Risk and fraud analysis is resource intensive endeavor. You want your analysts constantly focusing on new threats and trends, not fighting those trends they have already detected, that is the job of the risk model.

The problem is there has always been a lag time between when a threat or trend is identified by analysis, investigation or manual review, and when that trend or threat is mitigated through new rules within the risk model. This is the conflict in the balancing act because the faster rules are updated in the risk model, the more time your analysts, investigators or manual reviewers can spend on new threats.

Creating the Learning Model

The best way to illustrate the advantages and process of a learning model is to give you a real life scenario which happens to me all the time. I travel on a regular basis to the Philippines and like most travelers, I regularly use my credit card when I travel. Without fail, the first day I am in the Philippines and use my card, the authorization fails, usually when I am with a group of friends to maximize the embarrassment which is a whole separate issue. I must contact my credit card company in the U.S. and explain to them that I am traveling and provide validation information at which time they turn my card back on again.

This is an example of what happens if you are dependent on a non-learning risk model. Somewhere in time, an analyst at my credit card company determined a correlation between potential credit card transaction through geo-spatial analysis of the credit card holder's location. This was probably a really good idea as a person who lives in Washington and uses their credit card on Tuesday in their home state then on the same day makes a purchase in Manila has a greater chance of their account being compromised.

What if, however, a person lives in Washington and uses their credit card in their home State on Tuesday then travels to Manila and uses their credit card in that location on Wednesday? If that person had never been to the Philippines before then of course, that transaction would carry enough risk to cause a decline pending verification. But what if that person travels on a regular basis to the Philippines? That persons account would have a history of purchases made in both their home state and the Philippines, something that a learning model could leverage to make a better risk assessment and thus a better customer experience.

The scenario given is an example of poor risk modeling, though with all the right intentions. There are situations where the rate of fraud for an area is so high that an across the board risk rule is to validate any transaction coming from that area or to ban transactions completely from a high risk area in extreme circumstances. This is the exception to the rule however, and illustrates that making global risk rules, with all good intention at the time and based on the best analysis, can turn against you if your risk model fails to learn the patterns of it's subjects.

In this scenario, if it was my first time ever going to the Philippines, it would be crazy for the risk model not to flag my transaction. I had used my card less then 24 hours ago in Seattle, then used it again 24 hours later, half-way around the world. But lets throw in the concept of a learning model based on the assumption that this is my first time traveling to the Philippines. The risk model determines that I have never been to the Philippines before and declines my transaction, sending it to manual review. I call my credit card company, provide validation information and the representative approves the transaction, from that point on, every time I use my card in the Philippines until I leave, it's approved.

Two months go by and once again I fly to the Philippines for work and to see what the sun looks like again. I buy a latte at the Seattle airport with my credit card and wave goodbye to the rain as I board my flight. I arrive in Manila, and as a nod to the sun gods, I buy a pair of sunglasses at the Manila airport. This time the risk model cues up on the fact that I am geographically separated from my last my purchase and begins whirling. The difference is this time, the risk model leverages the validation information which was inputted by the credit card representative on my last trip. The risk model takes a look at the transaction, which is for $20, takes into consideration that I have a history of previously validated transactions from the Philippines, and decides this time to keep on eye on my account for unusual purchasing behavior, but allows my transaction to go through based on my history. The risk model has just learned from previous human review and analysis, that my transaction is not abnormal and I don't get embarrassed at Ninoy's House of Sunglasses by a decline.

Now lets talk about a learning risk model based on the discovery of a new fraud trend through visual analysis. I am a fraud analyst with Andrew's Credit Card Company and I am performing a visual analysis on the last 24 hours of confirmed fraudulent transactions to see if I can establish a pattern to the most recent activity which has eluded my fraud model.

During the course of my analysis, I begin looking at common purchase points for the fraudulent transactions (CPP analysis). I discover through link analysis, that I can connect 150 separate fraudulent transactions to a single CPP by visualizing the credit card holders transaction history. In this case, each of the card holders history indicates a purchase made at a single CPP just prior to the fraudulent transaction being made. A further attribute I discover is that each of the fraudulent purchases were Card Not Present transactions, however the account number and security validation code were entered. All of the connected transactions were made from a wide variety of eCommerce sites for electronic equipment over $500. The common purchase point for all the accounts prior to the fraudulent transaction was from Joe Bob's BBQ in Walla Walla Washington, quite a ways off from Dallas TX.

Here are the common attributes I have been able to identify surrounding the fraudulent transaction:

1. All the accounts affected by the fraudulent transactions had a CPP history of being used at Joe Bob's BBQ in Walla Walla Washington.
2. The next transaction after the CPP were for electronic goods over $500 at a wide variety of eCommerce sites within 24 hours of the transactions at Joe Bob's.
3. Each of the fraud transactions were "card not present" transactions but passed authentication through the security code.

As the analyst, I have identified a potential fraud breach at Joe Bob's. There is either an employee at Joe Bob's skimming credit card data from the customer's magnetic strip or Joe Bob's has a data leak that a hacker is using to steal credit card data from Joe Bob's network. I am going to forward my analysis to the investigators to follow up on, but most importantly I have discovered a new fraud trend which I need to quantify and feed into my fraud modeling as soon as possible to mitigate any further loss.

By utilizing the attributes from the fraud I discovered through visual analysis, I write a query into my transaction database which searches out all transactions made from my compromised CPP, where there is a transaction for >$500 following it. I could narrow it down even more by writing where the transaction was for electronic goods but I don't know that to be a positive indicator yet until I review all the transactions in my visualization.

Based on my query, I find 255 more transactions which share the same attributes but have not been reported as fraud by the customer yet. I pull this data into my i2 visualization and am able to establish confirmed links between the other 255 suspect transactions and the 150 confirmed fraudulent transactions.

My risk model is established as a learning model which pulls in attributes from transactions discovered as fraud by the analysts or manual review team through an indicator in the database, let's call it a "confirmed fraud" field in the transaction table. As the analyst I compose a query to write a "confirmed fraud" indicator into the field which is picked up by the fraud model and dumped into the "scum pond".

The scum pond is separate database which captures all the data attributes from confirmed fraud and allows the risk model to leverage the information to learn from past fraud trends. In this case, the analyst has marked 405 transactions discovered as fraudulent with a "confirmed fraud' indicator. The transactions which were marked included the actual fraudulent transactions as well as the transactions from the CPP which was discovered by the analyst. (Just a note, there is actually allot more involved in creating and marking transactions for the scum pond and the learning model).

The scum pond picks up these transactions through it's own process which looks for new transactions with the confirmed fraud indicator. The scum pond grabs all the attributes from the transaction and categorizes it based on the nature of the fraud. In this case the scum pond takes into account the relation between the CPP and a following transactions for >$500. These are attributes which the risk model can reference to make real time transaction decisions in the future. The next transactions that hits the risk model which is for electronic equipment over $500 where a past transaction had occurred at the compromised CPP, the risk model will elevate that transaction by referencing the attributes in the scum pond.

There was no need for new rules to be hard written into the risk model and there was very little lag time between the discovery of the new fraud trend by the analyst and the incorporation of the intelligence into the risk model. Most importantly, we are not escalating every transaction from accounts that had been used at the CPP, but only those which match the criteria discover by the analyst, saving money and customer aggravation.

If the risk model elevates more transactions which have the same attributes contained in the scum pond, emanating from the same CPP, the risk model can alert the analyst of a high percentage of account compromise from the discovered scenario and a decision to re-issue new cards to all the account holders can be made based on the exposure of the compromise.

Using Visual Analysis to Combat Call Center Fraud

Your company's call center processes thousands of transactions each week. They are the face of your company and in most cases, are empowered to access customer data, financial data and grant concessions to dissatisfied customers.

Preventing and detecting internal fraud and data leaks occurring in your call center can be a daunting task based on the sheer number of customers and agents that are involved. To add complexity to proper oversight, is that a great number of these call centers are outsourced, often overseas, making access to complete audit trails of activity more difficult.

From the perspective of executives and managers of call center companies, protecting your clients data and property is one of your top operational priorities. Leaks of client data or fraud involving your clients merchandise could damage your reputation, put in risk your current contracts and leave you open to potential litigation.

Well thought out and enforced privacy and operational policies go a long way to protecting your call center or your clients data, however it is unrealistic to think that all the employees in your call center are going to play by the rules. Just as in every business, there is going to be a few who are looking to profit off their position.

This is where a solid fraud analysis and investigation program will provide a further layer of protection that can proactively identify potential fraud and data loss at the earliest stages. Because of volume of transactions, leveraging visual analysis is the best way to look for associations between call center agents and the customers they interacting with.

The Threat Levels That Exist In Call Centers

There are three distinct threat levels which exist in most call centers. The level of investigative analysis and investigation should be based on the potential threat level which exists in the center.

  • Low - Agents who mistakenly leak information or allow concessions to customers by failing to follow proper procedures
  • Medium - Outsourced or temporary agents who have little or no company loyalty and have no incentive in the company success, but have access to customer information or the ability to send concessions (free merchandise, repair replacements ect)
  • HIGH - Fraudulent agent groups within a call center - call center agents representatives by criminal organizations or friends of corrupt agents for the sole purpose of stealing customer information or converting concessions for personal use.
Visual analytics can effectively address medium and high threat risks within the call center organization by identifying clusters of interrelated activity, customers and customer attributes and activity logs between the call center agents themselves and the individuals they are having contact with or sending merchandise too. Always remember, fraud follows the path of least resistance. By shoring up your fraud prevention defenses through visual analysis, those organizations wanting to penetrate and corrupt your call center's organization will search elsewhere.

Identifying Call Center Concession Fraud

Probably one of the most difficult analytical tasks will be the identification of call center agents who are converting merchandise or concessions for personal use. This type of activity costs companies thousands of dollars every week in misappropriated goods.

The difficulty in discovering this activity through data mining is because the activity itself is three dimensional involving the call center agents themselves and the customers. To accurately analyze the activity we have to look at the not only the activity of the agent but also the relationships of the customer to their attributes and even the relationship between the call center agents and the customers.

This often involves the extraction and mining of data from multiple data sources, and the layering of that data in an visual analysis. This is the scenario we are going to employ in this example because if you are able to leverage visual analysis to proactively identify this activity, any internal issues such as data or financial theft, which is one source visualization, will be much easier to accomplish.

So lets start with a scenario for this analysis. I am an fraud investigator for a call center and conduct a monthly analysis of concessions sent out by my agents to detect any fraud or theft which may be occurring.

Like all analysis, the first step is the planning, extraction and cleaning of the data for import into our visual analysis tool. Since I am analyzing concession fraud, I am going to need to extract data out of my agent activity database which will give me the service requests, type of concession and date of concession. Next, I will need to extract data from my customer and shipping database to find relationships between the customers and where the concessions where shipped.

In all fraud analysis, we are looking to leverage the weakest link of the scheme. In concession fraud, the weakest link in the scheme is the shipping address. If you are a call center and either converting your companies merchandise for personal use or sending concessions to friends, the one piece of information that will have to be accurate is the shipping address. This is the main entity in my visualization that I am going to focus on.

In my first step, I download all transactions from the call center which are coded as concession transactions including the service request number, the service request date, the customer ID, the agent name or number, the concession which was sent out and if possible, the tracking number of the concession package.

In my next step, I download the customer and shipping information from my database. I want to ensure that I capture all fields in the shipping database which will accurately identify and make unique, the location the concessions where sent to. If the agent is involved in fraud, they are going to alter the names and phones, however the address will have to be correct for the scheme to work. Agents who are very good at committing this type of fraud will alter the addresses enough to avoid detection but to still ensure delivery. We can counteract this tactic in visual analysis by conducting semantic matches between entities which will detect patterns of inverted numbers, names, slight misspellings or the addition of small pieces of information in the shipping address.

Now that I have downloaded the data I need for my visual analysis, since this data came from two different sources, I am going to need to join the two tables or files of data to make a flat file or view for import.

Once I have joined and cleaned my data of nulls and bad values I am ready to import my data into my visualize it. The schema that I will use for this analysis will be the call center agent linked to the service request to the customer to the customer's shipping address, phone and email.

After completion of the initial import of all my data, plan for the visualization to appear as the example below. The reason for such large clusters of data is because of the call center agent entity. Through normal transactions, multiple agents are going to link together by joining customers, this is not an indicator of fraud.

The good thing is that for my first cluster analysis of this data, I am not even going to use the call center agent at all in my visualization. By filtering out the call center agent and temporarily hiding that entity I can focus on customer clusters linked by address and service request. By looking for groups of customers which are linked together, I can identify possible destinations for fraudulent concessions, friends of agents which are being shipped merchandise or in the case of external fraud, individuals who are taking advantage of my call center reps to gain free merchandise.

I am moving on to my next step and hiding my call center agents to look for customer clusters, don't worry we will bring the agents back shortly. I am going to focus on the largest clusters of interrelated customers which my visualization tool is going to sort for me left to right in my chart.

My largest cluster of interrelated customers is involved in ten different service requests in which the customer was shipped a concession by the call center agent. There are seven different names associated with this cluster, however they are all linked to the same address which is extremely suspicious.

My next step is to leverage visual semantic searching across all my customer entities and attributes to detect attempts to create the appearance of two separate addresses by changing small details in the data such as 123 main street and 123 main st. A strong visual analysis program, including the tool used in the example, i2, will incorporate smart matching to detect these entities such as in the example below.

Once smart matching is completed and ensure that my clusters contain all linked entities and data, I am going to break out my largest cluster and incorporate in the call center agent or agents which created the service requests linked to the customer cluster.

Now before we go further, there are two possible scenarios which may be occurring with large clusters of interrelated customers, both of which will be detectable when we bring back our call center agents. First, there is a group of customers who are taking advantage of my company and call center by acquiring concessions or merchandise through false pretenses. If this is the case, then the service requests will be linked to different to different agents. The second will be an agent sending out concessions or merchandise to individuals fraudulently in which case there will two indicators, all of the service requests will linked to the same agent and the customer profile will have been created by the call center agent because no service call ever existed.

Lets bring in the call center agent entity and see which scenario exists. From the visualization we see below, all of the interrelated customers and service requests are linked to two agents in which multiple concessions were sent to the same address.

The same call center agent linked to multiple service requests

Group of customers all linked to the same address

Call center agent linked to multiple service requests to customers linked to same address

To complete my analysis I am going to examine who created the customer profiles for each of the customers in my visualization and also incorporate the call logs for the agents during the dates and time these service requests were created to determine if an actual call was inbound to the agent when the service requests were created.


From the visualization examples shown, we have identified two call center agents who are actively engaged in fraudulently shipping concessions (merchandise) to individuals for the purpose of converting it for their own use.

All of the service requests in this cluster occurred over a two week period and all were linked to different individuals living at the same address.

For a strong proactive deterrence to this type of fraud, a regular schedule for visualizing concessions should be performed based on the velocity of calls, numbers of agents and locations so that the analyst doesn't end up in information overload. For example if in my organization, I have ten call centers with 100 agents in five different countries handling 1000 transactions a month, I might want to schedule my analysis on a weekly basis to best identify the activity without being overwhelmed by the data.

By leveraging visual analysis in my fraud investigation and deterrence program, I can add another level of security for my call center, company and client, allowing for timely identification of fraudulent internal and external schemes.

Solving Crimes Through Multi-Source Data Visualization

There are very few cases where all the data you need to complete your analysis resides in one single database. In most link analysis examples, including the majority in my site, I give examples of visual analysis through the import of one data set. While this helps explain the theory behind the analysis being performed, in real time situations, the answers you are looking for rarely reside in one place.

The more complicated the crime or threat, the more disparate the data sources to arrive at a solution through visual analysis. For example, in eCommerce fraud, I rely on data from my transaction platform, order platform, account platform and log in records to provide a complete analysis of the threat being analyzed.

For this example I am going to use a scenario where the analyst is investigating a series of hotel burglaries taking place at a hotel property. I am going to show that by importing and layering data from multiple sources provides a complete picture of the activity that is occurring and a solution to the crimes.

Inventorying The Data and Data Needs

Approaching a case from an analyst standpoint is similar to the way an investigator approaches a new case. The analyst needs to understand the scenario of the threat and then conceptualize the possible sources of information that can be obtained to complete the analysis. The first step is to inventory all the data on hand and the data that you will need to begin a visualization of the case. Just like any investigation, there are going to be additional data needs to complete the analysis, but getting your arms around what you need to start will save you time and aggravation, especially some of the data you are going to need requires subpoena or access by other data administrators.

In this example, I am being to asked to perform a visual analysis of hotel burglaries. I know that there have been multiple burglaries from rooms over the past week. I know the hotel has an electronic key system that logs all entries into rooms into the hotels server, my first source of data.

From reviewing the incident reports, one of the items that is stolen frequently are cell phones. There is a good chance that whomever is responsible for the burglaries has also made calls on the stolen phones which can assist me with my analysis, my second source of data.

Another commonly stolen item from these burglaries is jewelry. Knowing that thieves often pawn stolen jewelry for case, I can access my departments pawn ticket database and integrate that into my analysis, my third source of data.

I have access to my departments case management system so I can download all of the incident report data and integrate that into my analysis, my fourth data source.

Starting The Visualization

Now that I have inventoried and obtained the data I require to perform my analysis, my next step is to decide the best way to integrate all the data from the different sources I have into one visualization.

One of the issues that arise when using data from multiple sources is that the formatting and structure of each data source is going to be different, requiring cleaning and planning prior to importing it into you visualization. The type of analysis and threat is also going to dictate the type of visualization you are going to need in order to produce a result.

In this example I have two options. This is a series of hotel burglaries which may be committed by multiple individuals who are interrelated so an entity association chart might be an option. On the other hand, all of the data I have is time bound, hotel key logs, cell phone records and incident reports, so a time bound theme chart might be best.

Since all of my data is structured by date and time, I am going to begin with a theme chart layout that is time bound to a time line for my analysis. The first item I am going to import in is my RMS data (incident report data) to establish a base for my time line of events.

From this point, I can visualize the dates and times that the hotel burglaries occurred which will help me parse the other data I have by date. From the key log entry files I have obtained from the hotel, I can filter my import by limiting the access logs to the time span of the events.

This data contains the rooms, date and time of entry and if name of the person assigned to the card at the time. To best visualize this data I am going to incorporate the date and time stamp on the entry records to the theme line but link those event frames to the person gaining entry by creating a link association between the event frame and the entity who the card was assigned to.

Now that I have integrated the key logs and the RMS data into the theme line, I can examine and focus on those individuals who have the most associations between the data in the hotel key log and the RMS data.

I begin grouping together the burglary events by the type of event and the items that were removed. This will help me when I integrate in the data I have from the pawn ticket database and the cell phone records from the stolen items.

Now I have an overview of the incidents and those individuals which have links to more then three of the incidents. My goal now is to integrate the rest of my data and try to draw a link between the incidents and players involved.

At this points I am going to import in the cell phone records as a directional link chart to show the originating and destination phone numbers to see if I can link any of the numbers to anyone in my ring.

Once I have visualized the cell phone records of the stolen phones, there is another cell phone number that all three of stolen phones have called. Using my reverse directory I am able to link the cell phone that the stolen phones called to one of the employees at the hotel.

The employee that I identified who received calls from the stolen phones wasn't linked to the rooms where the phones were stolen, however the employee was linked to another employee who entered each of the rooms where cell phones were stolen.

Next I import in my pawn ticket data that matched the items that were stolen from the rooms where jewelry was removed. The pawn ticket was linked to an entity that I did not have on my chart from the room access log files so my next step is to import the NCIC report from the individual to see if I can link the name on the pawn ticket to one my suspects on the chart. If this were a commercial analysis as opposed to law enforcement, we could do the same with a lexis nexis report.

After importing the NCIC data into my chart I am able to link an alias from the palm ticket to the individual being called on the stolen cell phones, an employee at the hotel.


By importing multi-source data from my record system, hotel room log server, NCIC, pawn ticket database and cell phone records I am able to visualize and link the multiple hotel room burglaries to two individuals, both work at the hotel.

Multi-source analysis is almost impossible with a visual analysis tool. Drawing relationships between disparate data sources that can not be linked together by rational fields can only be accomplished through visual analysis.

In this scenario, if I had omitted even one of my data sources for my analysis, I would not have been able to link the incidents together or narrow the list of potential suspects to two. By carefully inventorying all of the data I had available for my analysis and then carefully planning how I was going to visualize it, I was able to produce a time bound theme line and link chart showing the entire investigation, the source data and the suspects for investigation.

Integrating eCommerce Analysis With Insurance Fraud Analysis

More and more, insurance companies are getting into the eCommerce business. Today you buy policies, file claims, get quotes and pay for everything with a credit card, all online. Many businesses, primarily retail, have been involved in eCommerce and the corresponding fraud analysis that comes with it.

As insurance companies move a great deal of their business online, fraud analysis, prevention and investigation will have to leverage online detail. SIU's are experts at the investigation of claims and in the past all the information from those claims came from contact between the claimant and the claims representative. When a person wanted to take out an insurance policy on a vehicle they went to their agent or at a minimum, talked to a salesperson on the phone, somewhere in the transactions there was personal contact.

With the move of more insurance business moving to the web, there is no personal interaction between the person who is buying a policy from your company to the person filing the injury claims from an accident. It is important to integrate the information captured during these online sessions into the fraud detection, analysis and investigation, yet because this is a fairly new venture for insurance companies, it's time to take a lesson from your retail corporate neighbors on how to integrate eCommerce data into your fraud program.

There are a great deal of similarities between fraud analysis performed for eCommerce and that with insurance fraud. Links between attributes entered by the customer are leveraged to find clusters of suspected fraudulent activity such as addresses, phone numbers, email addresses and the like. The difference between the two industries is that eCommerce analysis places less weight on physical attributes and more weight on internet captured data to determine relationships between entities and purchases.

The reason why more weight is placed on iData then physical or entered data is that in eCommerce fraud analysis, it is assumed that anything entered by the user in a fraud scenario is false. It is still important data, the way people create false data creates important patterns for analysis, but when you are trying to tie multiple fraudulent transactions together, what is behind the information is more important.

In insurance fraud analysis, almost 100% of the weight is placed on physical or entered data and iData is almost non-existent. When I was involved in insurance fraud analysis, the rational for not capturing or integrating iData into fraud analysis and investigation was that at the end of the day, an investigator or claims adjuster will be face to face with the claimant and the information will be verified. The more claims and insurance business is handled online, the more danger SIU's get into. Today, someone can take out a policy and pay with a credit card, file a claim and have the money directly deposited into an account without ever seeing anyone, who is this person?

There is a quick test to determine if eCommerce data needs to be integrated into your fraud analysis:

1. What is your charge back rate for online policies purchased with credit cards.

2. Find the corresponding policy numbers where charge backs have occurred, what is the claim rate for those policies and how much was paid.

3. What claims are paid out without any physical contact with a claims adjuster (broken glass claims, claims under $500)

4. Your company permits the underwriting of new policies online without a vehicle inspection.

If your company has charge back rate above 1% and the rate of claims on the charge back policies exceed 20% of all policies and your company pays certain claims without requiring a meeting with a claims adjuster, you have an insurance eCommerce issue that requires integrating iData into your insurance fraud analysis to combat the following scenarios:

  • A person using stolen credit cards to take out policies and either filing a claim on the policy or requesting a refund for the policy before the cardholder discovers the charge (30 to 60 days depending if the card is a personal or business card)
  • A person acting as an agent, selling individuals policies from your website and charging them more then the policy value without a license to sell insurance

Lets look at an example of leveraging internet data captured during eCommerce in fraud analysis. In this scenario I have a group of PIP claims, all of which are multi-occupant, multi-injury treating at the same medical provider.

After importing the claims and policy information into my visualization I do not detect a link between the multiple claimants or the related entities to the claimants such as address or phone.

I discover that all of the policies were incepted in the last 60 days and all were applied online at the company web site. I discover the IP address for each session when the policy applications were completed online and the corresponding device ID's and import that data into my visualization.

I find that out that all 12 claims were on policies that originated from three separate computers tied to one IP address. All eCommerce transactions capture the IP, device ID, operating system and browser type in order to present the information needed for the transaction to the end user. Device ID's are normally hashes from an algorithm established by your IT department based on information captured during the transaction. The level of information captured from the device differs from company to company but regardless of the level of detail, the device hash creates your unique identifier for the computer transacting claims and policies on your system.

Now that I know all the policies originated from the same IP address, I can run a "whois" query on the IP address to determine the ISP or host for the IP address. In this case I discover that the IP address linked to my 12 claims is assigned to the medical provider where all the claimants are treating!

Sounds unrealistic, while the personal information has been changed for this chart, this was an actual case investigated by SIU in Florida. Taking the analysis to the next step it was discovered that the payment instrument used to pay for these policies online was an Corporate Amex belonging to the medical director of the company.

This moves us along to the next necessity in visualizing iData for insurance fraud analysis. Just as in eCommerce fraud, there is a point of friction in each online transaction. If the goal is obtain stolen merchandise online, the point of friction is the shipping address, in the case of insurance eCommerce fraud the point of friction, just as in all insurance fraud claim, is where the money comes in and goes out.

Capturing the payment instrument hash for analysis will take your insurance fraud analysis to the next level as well. I can visualize instances where 10 or more policies were paid for with the same payment instrument and the corresponding claims to discover fraud.

I can visualize where multiple claims all with different claimants had their settlements deposited into the same ACH (direct deposit) account and discover the reason why.

To summarize, the integration of eCommerce or iData has been essential in the detection and analysis of online fraudulent transactions in the retail space for years. As insurance companies transition more business online, the importance in integrating iData into claims and policy fraud increases. Information such as:

  • Session IP and cookie data

  • device ID

  • payment method hash

can expand your ability to proactively detect insurance fraud from eCommerce transactions.

Fraud Modeling Insurance eCommerce Transactions

Using fraud modeling methodology has been used and refined in retail eCommerce transactions but has been absent in insurance eCommerce.

Once again, the more business that insurance companies transition to the internet, the more important fraud modeling based on eCommerce transactions becomes important. There is important data captured on online sessions that can be critical to policy and claims fraud identification.

One example would be based on a scenario we discussed earlier in visualizing payment hashes used in policy inception. In that example we discussed how to identify multiple policies incepted by the same payment method and hash to identify potentially fraudulent claims.

There would be no legitimate reason for multiple policies belonging to different people to be incepted with the same payment instrument. More likely reasons would be someone incepting policies to commit claims fraud or someone acting as an unlicensed agent to sell policies utilizing the companies web site.

By leveraging in-line fraud modeling, a velocity check could be placed on the payment instrument hash preventing these transactions from occurring at the point of origin. A basic velocity rule would preclude more one or two policies with two different policy holders names from being incepted with the same payment instrument hash.

Another modeling point would velocities and "black listing" of suspect IP addresses. If multiple online transactions from the same IP addresses result in charge back or identified as fraudulent from investigations, a velocity and black list in-line check would prevent future policies from being established from that IP address.

Other velocities which could be incorporated into an insurance in-line risk model could include attributes from previously discovered fraud activity such as:

1. Fictitious drivers license numbers or those belonging to staged accident participants
2. Addresses of medical providers or drop boxes being utilized to establish insurance policies
3. SSN's or Name DOB combinations of past insurance fraud participants
4. IP location data in relation to where the claim occured or the policy was incepted. For example the IP address is in New York, NY and the policy indicated the vehicle and the policy owner is in Florida.

There has always been fraud check rules in place at the adjuster level for insurance fraud discovery. By moving some of these rules in-line and integrating eCommerce data, scoring of potentially fraudulent policies and claims can be accomplished real time for more timely intervention by the SIU.

Visualizing Organized Retail Fraud

Organized fraud poses the highest risk to any commerce platform from banking, eCommerce and insurance companies to brick and mortar retail chains and individual stores. In a retail environment, opportunistic shoplifting, the single shoplifter who steals for themselves is the lowest level threat followed by internal fraud which represents a moderate risk.

Organized shoplifting rings can have hundreds of participants both internal and external to the retail organization and can steal over $1 million dollars in merchandise and cash in a single month.

These rings do not limit themselves to just the shoplifting of merchandise from stores, they are highly complex criminal organizations with resources to attack merchants from a number of fronts simultaneous, committing:

  • Credit card fraud
  • Refund and rebate fraud
  • Price switching and price alerting through the production of bar codes
  • Burglary by "sleep in" thefts from stores after hours
These rings have market intelligence on which merchandise brings the most value and the fronts in place to sell the merchandise they sell within hours. They operate as cells that come together as a complete organization that can span the entire country or in the case of hybrid retail theft rings who target designers, they can be international.

Combating organized fraud requires the same type of intelligence gathering, case management and analysis that is present in law enforcement organizations and financial institutions. It requires the centralization of incident information across the entire franchise.

Records of shoplifters can no longer be housed store to store, they have to be centralized and databased across an entire franchise to be able to leverage visual analysis to find the relationships between the shoplifter caught in Atlanta Georgia and the one caught in California who were both stealing massive amounts of drill bits.

Databasing Retail Theft Information

Like most organized fraud rings, participants come equipped with a wide variety of identities and cover stories, however like most organized rings this can end up being a vulnerability when employing data basing and visual analysis. Because organize ring suspects get caught individual in mass, those identities end up getting re-used from time to time, thus establish a link between the participants.

Because rings focus on certain merchandise, what they steal can be as important of a link between suspects as their identities in visual analysis. Visually we can group together shoplifting suspects by the type of merchandise they specialize in stealing and moving. This is the same way police analyze how a bank robber commits his crime to link that individual to multiple bank robberies.

Refund information is essential to centralize and integrate with shoplifting data. If an organized ring can refund stolen merchandise for full price as opposed to selling it to a fence for a fraction of the retail price they will. They are experts in reproducing receipts and often partner with internal employees to assist them in refunding merchandise.

While we are talking about internal employees, ever wondered what the employee you caught last week stealing high end small electronics is doing with it. The average employee steals twenty to thirty times before being caught.

If you caught an employee handing out 20 USB flash drives to an unknown customer without charging them for it, that means the employee actually handed out 400 of them before you find out. Unless you are a major league computer pack rat, I doubt any one person needs 400 USB flash drives, those are being refunded to your sister store five states away.

When centralizing data for analysis in a retail investigation environment consider integrating the following information to maximize the potential for visual analysis:

  • External theft suspect information including detailed information regarding the items being stolen and the methods or tools used.
  • Internal theft suspects including what employment information such as what department they worked in, who they worked for and with and their responsibilities
  • Refund and rebate information, particularly refunds without receipt
  • Integrating gift card information database with your theft database
  • Integrating any store specific credit card database with your theft database
Visualizing Retail Theft Data

For this scenario I am an analyst for a large retail chain tasked with identifying organized fraud. I have the ability to query off my central database of store incidents and export the data from my database for import into my visual analysis program.

As organized fraud rings are nomadic and multi state, I need the widest sample of data possible to establish links between identifying information which has an average accuracy rate of less then 30%.

I am going to review my data and determine what fields I need to extract in order to create unique identifiers for my subjects and the merchandise involved.

By reviewing the data I have determined that I am going to create person entities by utilizing the name and DOB of my incident suspects. I will be linking my person entities to addresses, phones, incidents and merchandise.

Merchandise is going to be an entity that is specific to retail and very tricky when using it in visual analysis. For the type of merchandise to be a relevant entity in analysis, we have to make the type of merchandise stolen unique, but not too unique that at best we end up with a one on one relationship.

Ideally utilizing the SCU or product number would best, however if that data isn't captured using a combination of quantity and merchandise description could create a unique identity.

If I was dealing with a large amount of data, I might specifically query incidents with a merchandise total over $500 or a quantity over 5 so that I am not trying to link together every teenage kid who boosted the latest Dave Mathews CD across the country.

Lets take a look at my import specification:

I have linked persons to address, phone and SSN for identifiable information. For my person entities I have created a unique identifier by combining the name and the DOB, however since that is not going to make any since as a link label on my chart I am only utilizing the name as the label.

I can also utilize the DOB field in the entity attributes so that I can compare multiple individuals with the same name to find and sequence to the DOB's that were used. DOB patterning is easy if you remember that when people are making up a DOB they will use either a day, month or year which is their actual DOB but make up the other two.

If a person makes up a year, unless they are a true professional, they don't have the ability to do math when questioned. They will usually add or subtract years in systematic method from their real DOB.

I am going to execute my import specification and see if any clusters of interrelated incidents appear.

With the visualization software I am using, the clusters are going to be organized with the largest to the left, decreasing to the right. I am going to focus on the largest cluster of interrelated entities and see if I can identify any organized rings.

What we see in this cross section from my chart is a group of individuals who are associated by address, phone or incident. What is interesting is that this small subset from my chart are all related but all committed different retail theft schemes from internal theft to credit card fraud to refund fraud. This is exactly how organized shoplifting rings operate, by attacking a retail institution using varying methods and resources.

From the information I have developed through my visual analysis, I have located an organized fraud ring impacting my retail organization and through my visualization I can uncover the methods they are utilizing and the types of merchandise they are targeting enabling me to alert the stores in the States being impacting by the ring.

I can brief my in store security regarding the ring's method of operation, the times they hit, the involvement of potential internal employees and the places in the store they need to be focusing on to catch the group in action.

Once my in store resources have been alerted and have apprehended additional suspects following the same pattern uncovered from the analysis, I can integrate the new information into my link analysis chart to compile a complete case visualization for my companies inside counsel and law enforcement.


While visual analysis can be effectively deployed to locate organized theft it can also be integrated into numerous platforms to combat threats such as gift card fraud, credit card fraud, register or cash shortages and personnel social networking environments.

By proactively identifying organized rings and other fraud threats, you strengthen your organizations deterrent to future events. Fraud always follows the path of least resistance, by leveraging centralized data basing of theft intelligence and leveraging visual analysis you can drive organized rings to easier targets of opportunity.

Integrating Statistical Analysis With Link Analysis

As discussed in previous posts, visual link analysis is a powerful tool for determining and confirming relationships between entities across a large span of data however, it's limitation is that the machine and the user can only take in so many entities and links in one workspace before the quality of the analysis begins to suffer.

The ability to drill down into data to extract sets of data to visual analyze is extremely important. One way we discussed doing this was through data queries and sets, the second method I will discuss in this article, is through an integration of statistical analysis and visual analysis.

For the purpose of this article I am going to be using i2's Analyst Workstation which integrates Data Miner, a statistical data analysis tool, with Analyst Notebook for visualization. This can also be accomplished through a variety of desktop tools depending on the size of the data such as Access, Excel, Crystal Reports and Visual Studio. The benefit to using the i2 tool is the ability to switch from statistical analysis to visual analysis without any additional import steps.

Building An Analytical Data Cube

This step is specific to tools that leverage analytical services in SQL, if you are using excel or access you can skip this step. A data cube is a subset of all data contained in a database which leverages fields with statistical value.

In statistical analysis you are only using the fields in your database that have statistical relevance, those which are repeating and not unique. For example in the database I am going to analyze in Data Miner, I have fields such as "Credit Card Holders Name" which have no statistical value, and "Chargeback Date" which do. My first step is to review my data to determine which fields I want to use in building my data cube. The better I define these fields, the faster I am going to be able drill down into the data as my cube is only going to hold relevant statistical data. When I import the data into my visualization tool, I can always integrate in all the other attributes in my data set to form relationships.

Another important point when building analytical data cubes is that the more indexed fields you utilize in your cube, the faster you can drill into your cube. When I am building a cube for analysis I will normal draw the data from a view I have created in SQL and apply indexing to those fields which might not be indexed in the hosting database.

For the purpose of this example I am going to build an analytical cube around eCommerce transaction payments to detect patterns in fraudulent electronic payments.

The following fields are contained in my database that have statistical value:

  • Purchase Type: a repeating field indicating what type of purchase was made

  • Platform: a repeating field indicating which eCommerce platform a purchase was associated with

  • Card Type: A repeating field indicating what type of card was used (AMEX,MC,VISA)

  • Transaction Amount: While this field is going to contain lots of unique amounts, for statistical purposes I can group those values in my cube such as between $50-$100

  • Transaction Date

  • Bank Identifier: The code on credit cards and routing numbers for direct debit, which identify the issuing financial institution

  • Fraud Score: The score assigned to the transaction by my risk model

  • IP City, State and Country: For determining the source of eCommerce transactions

  • Transaction State: A field indicating the state of the transaction such as "chargeback" or "fraud" or "dispute"

The fields I am leaving out are those with unique identifiers and no statistical relevance such as "name, street address or phone number". It is also import to note that your analytical cube can use any combination of fields with statistical relevance based on the type of analysis you are going to perform. For example, if I were only going to look at fraud charges by Country I could leave out fields such as "fraud score" or "transaction date" because they do not apply to the scenario I am examining and will slow down drilling into the data.

One of the important things to remember about building analytical cubes is that opposed to databases which capture fields for ever possible scenario in your environment, data cubes are customized for the exact type of analysis you are going to perform and their performance and accuracy depend on you only utilizing those fields in your data with relevance to your analysis.

Once I have selected the fields I am going to utilize I build my cube utilizing analytical services in SQL Server. We can verify that the cube was built successfully by opening Analytical Services on the server containing your database. From the example below you can see that I have built several cubes around my eCommerce data.

Another important note about building analytical cubes is that the fields you select must have relationships to one an another or you will loose data in your cube. For example if I build a cube with "transaction date" and "IP country", every field in "IP Country" has to be populated, if I have any nulls in that column I will loose the entire line in my data and it will not be included in the aggregate set. For that reason it is important to plan which columns to use in your data model based on the statical analysis you are going to perform. If the IP Country is important, loosing a thousand transactions with no IP Country captured might not be a big deal, however if it is important in your statical analysis that all transactions be captured, you can only build your cube with columns that every field in each column has a relation to the other with no nulls in your data.

Drilling Into Your Data

Now that I have built my data cube I am going to begin my statistical analysis. Because I am using Data Miner, this is can also be visual making my statistical analysis much faster to understand then looking at aggregate text fields in a spreadsheet format.

Let's being drilling down into my transactions to find specific fraud issues to visualize through link analysis. I am starting with all of my transactions in my cube:

From here I am going to drill down into transaction date to determine which months had the highest amount of chargebacks. When you build a data cube in any program, SQL analytical services takes date columns and breaks them out so you can drill down by Year, Month, Day or Day of the Week, Week Day or Week Number automatically:

From this first drill down I can see that I had a huge spike in chargeback activity in April that I probably want to take a look at. For my next drill down I am going to look at all chargebacks for April 2009 by Country:

From this visualization you can see that there are allot of countries in my data. As I am looking for the main countries responsible for my chargebacks I am going to filter the results by look at the top 10 countries:

Now that I have filtered my results, its easy to see which countries I need to focus on. I am going to take the largest chargeback origination country, U.S., and drill down to get specific details on where these chargebacks are coming from. My next drill down is going to be by State:

From here I am going to sort the States from highest to lowest to determine which state had the highest amount of chargeback activity.

I can see that California had the highest number of chargebacks in April 2009. It is important to compare the total number of chargebacks by state to the total number of overall transactions. One of the reasons that California might have the highest number of chargebacks is because they also had the highest number of overall transactions so I want to establish a ratio of chargebacks to transactions to confirm any issues that might exist.

For this example we are going to assume that the ratio of chargebacks to transactions in California is out of wack and verifies that there is a problem. Now I am going to back up to my original cube with all transactions and begin drilling down into the data with the goal of isolating California's problem. In Data Miner this is very easy as the program keeps a history of each drill down, so all I have to do is click on the original cube in the history to return to it. For those using Visual Studio, this will involve beginning a new query altogether.

I return to my originial cube and drill down by Country and then by State selecting on California. Now I am going to drill down on California by Month. This was not possible from the point I was at in my last series of drill downs as I was only visualizing California for April 2009. If I was to view chargeback activity for California across all months I need to follow the route I am at now.
From this visualization, I can see that California had a spike in chargeback activity for April 2009 confirming what I had found in my last cube series. For my analysis I have sorted in Data Miner by decending order to determine which month had the highest amount of activity. My next step is to isolate the region where chargebacks are occuring by drilling down into City:

One of the first things I am going to encounter is that all cities in California where I have ever had transactions are going to be included in my drill down, even if they didn't have chargebacks in April 2009. To remove these fields I need to filter on null values so that only cities with chargebacks in April 2009 are displayed. In Data Miner this is a two click process but can also be down through filtering in Access, Excel or Visual Studio.

Now I can see that Los Angeles had the highest concentration of chargebacks for April 2009. The next few steps drill down to isolate the issues that exist in Los Angeles that may be leading to my fraud problem there:

Visualizing The Statistical Analysis Set:

I have drilled down by payment type and by day of the week. Now that I have established which card and which day of the week is the most problematic, I am going to switch to visual analysis to see if I can find out who is comitting the fraud. As Data Miner is integrated with iBase, my database of chargebacks, and Analyst Notebook, I can send all the records from Saturday to my link analysis with one right click. If I were using Access or Excel to datamine, I would simply write a new query into my database around all transactions with a transaction date of April 2009, a payment instrument of "Visa" that occurred in Los Angeles California and import that data into my visualization program. Lets see what we find:

Now that I have pulled all of the records into Analyst Notebook for visualization, I can expand on these records to include all fields in my database, including those I did not use in my data cube. This will allow me to perform tactical analysis on the fraud chargebacks to determine the source of my fraud problem:

Now that I have expanded on all of my fraud chargeback entities, I have brought in the card information, the card holder information, the IP addresses and the credit card transaction history for each of the individuals.

I am going to focus in on the largest cluster of interrelated activity in my visualization for analysis. By placing the largest cluster into a new chart for analysis and visualization and looking at the transactions in a hirachy view I can see that numerous chargebacks are all associated with a store where the cards were last used. Additionally all of the cards were swipped by the same associate at the store within a three day period of my chargebacks occurring.

There is no pattern in the IP addresses or in the individual card holders. The only central association is from the charge history of the individuals which points to an issue with a specific merchant and a specific operator at the store which is most likely the source of my fraud chargebacks.

In Conclusion

In this example we have sucessfully leveraged statistical analysis to identify a specific set of issues within a large data source for visual analysis. Performing link analysis on the entire group of 220,000 records would have been impossible but by drilling down into this mass of data to discover anomolies in activity, we have sucessfully identified a fraud issue across years of chargeback data.

Link and association analyis in a visual form is a powerful tool, however, by integrating visual analysis with other forms of data mining we are able to perform analysis across a much larger set of data then would be possible by query and visualization alone.

Integrating Visual Analysis and Fraud/Risk Modeling

For most of us, visualizing everything contained in a database for analysis is simply not an option. Even if you have the most powerful analytical software in the world, the human brain can only wrap itself around so much data. Additionally, under most circumstances, the data that you are most interested in analyzing consists of 10% of the entities and transactions contained in your data.

To illustrate this, lets assume that you have a database of credit card transactions from your medium sized e Commerce business. You receive 100,000 credit card transactions daily and you are tasked with proactively identifying the fraudulent ones. In order to establish patterns in fraudulent transactions you need a history of transactions across your data, so we are not talking about visualizing one days worth of transactions, we are talking about several months worth. Visualizing just three months of credit card transaction data would be 9 million records. Unless you are Rain Man and can count the number of toothpicks that fall out of a box in 2 seconds, don't try visualizing 9 million records. Even if the software could handle that many entities and links, it would create a staggering ball of twine that would be very pretty to look at but impossible to dissect the patterns that are contained in it.

This is where the partnering of statistical modeling and visual analysis comes into play. There are several parts to statistical analysis and modeling that we will cover later, what we are going to focus on in this article is creating an analytical subset, or fraud model, to leverage in pulling out a small percentage of the transactions which have the highest potential of fraud.

The partnership between risk or fraud modeling and visual analysis is symbiotically important. Fraud models are important tools for extracting out or scoring vast transaction pools but they do not learn new threats, only the old ones they are designed to protect against. Visual analysis is important for detecting relationships and patterns across data, but the software and the operator can only process so many entities at a time.

To optimize both, we partner visual analysis with risk and fraud modeling to offset the deficiencies in both. As an analyst I rely on my fraud model to pull out the transactions out of the big block of 9 million, that have a higher probability of fraud. Then by leveraging visual analysis, I examine these records for patterns or clusters of fraud activity by drilling into the results from the fraud model and expanding on those entities by pulling in related records throughout the entire data set. Through visual analysis of the data returned by the risk model and incorporating related data, I can not only verify fraud detected by the model but also identify new patterns of fraud in the related data which the model may miss. That information is then fed back into the modeling rules to identify that activity in the future, hence my fraud model learns from my visual analysis.

Hopefully I haven't lost everyone at this point so let me describe how I develop a "learning fraud model" by incorporating risk modeling and visual analysis. First a quick analogy of this scenario:

A customs officer at the airport is on the lookout for people who might be smuggling drugs. His observations are based on profiling the behavior of people at the airport which is made up of a series of attributes that drug smugglers display. He is standing at the security checkpoint and notices two people, first is a man who is visibility nervous and sweating. The next person is a well dressed elderly woman laughing and joking with a small child who is standing behind her.

The customs officer pulls the man out of line for further screening allowing the elderly woman through the checkpoint. The man, as it turns out, is nervous about flying because he was on a plane that crashed in the Hudson a couple of months ago and also has the flu. The elderly woman is carrying enough cocaine to kill a team of elephants.

Risk modeling works the same way as the customs officer. The model has a series of attributes that it looks for in transactions and pulls those transactions out of line for further examination. Just like the customers officer, the risk model does not look at transactions 3 dimensionally and cannot tell that the grandmothers address is linked to a person who had been arrested for smuggling drugs at another airport three months ago, if it could, the risk model and the customs officer would have pulled granny aside with the sweating guy. Just like the analyst tasked with looking at 300,000 transactions to find fraud, the customs officer can't be expected to look at 300,000 people passing through his line every day and know which person is the smuggler.

Ultimately the activity that poses the greatest threat is the activity which hasn't been seen before. Count on the fact that if you know what attributes your fraud model looks at in your transaction flow, the bad people also know it and actually some of the transactions your fraud model is stopping today are probes to learn what will get through your model and what wont.

The weakness to fraud schemes and transactions is no matter how hard you try, you can't make fraudulent transactions look exactly like good ones and eventually the activity you are analyzing will share attributes and behaviors that through visual analysis will make the well dressed old lady look more and more like the nervous sweating guy and the same rule that got the sweating guy pulled aside will be incorporated for the elderly lady.

Building a "Leaning" Fraud Model For Visual Analysis

For the purpose of this example we are going to assume that our e Commerce operation has been in business for a couple of month and through charge backs and complaints, we have identified transactions which are fraudulent and incorporated those into our standard fraud model.

I am going to import those records into my visual analysis software and start the process of expanding on those entities, leveraging every transaction data point in my database to identify undiscovered fraudulent transactions from identified ones.

Each expansion takes me one more level into all my transactional data and what I am looking for are clustered transactions that both share attributes with fraudulent transactions and also from a visual analysis standpoint, do not share the same behavior as legitimate transactions.

Here I have located a cluster of interrelated transactions which my fraud model has scored very low, but is linked through two or more levels of relationships to a charge back transactions. The cluster tells me two things; first, if these transactions are all from unrelated individuals they shouldn't be clustered and second, at some point all of these transactions are linked to a fraud charge back.

I am going to focus on this cluster of activity to determine what is occurring which joins them together and through analysis, determine if these transactions are indeed fraudulent. Once we have determined that these transactions are fraudulent there are two things I need to do:

1. Add these transactions to my analytical data set of all identified fraudulent transactions. I refer to this data set as my "scum pond", it consists of every identified fraudulent transaction and all there associated attributes and is what I use to determine to visual analysis footprint of the associated fraud scenarios to compare across transactional data. It differs from the fraud model as it has no stored procedures and is not used for the scoring of transactions, but rather an analytical model my visualization software can refer to, allowing for visual comparisons.

2. I need to determine if these newly discovered transactions share any common attributes which can be incorporated into my "learning" fraud model. Through my visualization I can determine that all of the newly discovered fraud transactions have a mismatch in the IP state and the account state, are over $1,000 and all use the @hotmail.com email domain. I am going to incorporate all of these attributes into my fraud scoring to trigger a review of transactions which share this pattern.

This process gets repeated daily, the fraud model depends on the visual analysis as much as the analyst depends on the fraud model. We have stopped this activity for now, but like all fraud and flu viruses, it will mutate over time, the fraud model will stop detecting the activity and the analyst will have to discover the change and incorporate it back in the modeling.