For most of us, visualizing everything contained in a database for analysis is simply not an option. Even if you have the most powerful analytical software in the world, the human brain can only wrap itself around so much data. Additionally, under most circumstances, the data that you are most interested in analyzing consists of 10% of the entities and transactions contained in your data.
To illustrate this, lets assume that you have a database of credit card transactions from your medium sized e Commerce business. You receive 100,000 credit card transactions daily and you are tasked with proactively identifying the fraudulent ones. In order to establish patterns in fraudulent transactions you need a history of transactions across your data, so we are not talking about visualizing one days worth of transactions, we are talking about several months worth. Visualizing just three months of credit card transaction data would be 9 million records. Unless you are Rain Man and can count the number of toothpicks that fall out of a box in 2 seconds, don't try visualizing 9 million records. Even if the software could handle that many entities and links, it would create a staggering ball of twine that would be very pretty to look at but impossible to dissect the patterns that are contained in it.
This is where the partnering of statistical modeling and visual analysis comes into play. There are several parts to statistical analysis and modeling that we will cover later, what we are going to focus on in this article is creating an analytical subset, or fraud model, to leverage in pulling out a small percentage of the transactions which have the highest potential of fraud.
The partnership between risk or fraud modeling and visual analysis is symbiotically important. Fraud models are important tools for extracting out or scoring vast transaction pools but they do not learn new threats, only the old ones they are designed to protect against. Visual analysis is important for detecting relationships and patterns across data, but the software and the operator can only process so many entities at a time.
To optimize both, we partner visual analysis with risk and fraud modeling to offset the deficiencies in both. As an analyst I rely on my fraud model to pull out the transactions out of the big block of 9 million, that have a higher probability of fraud. Then by leveraging visual analysis, I examine these records for patterns or clusters of fraud activity by drilling into the results from the fraud model and expanding on those entities by pulling in related records throughout the entire data set. Through visual analysis of the data returned by the risk model and incorporating related data, I can not only verify fraud detected by the model but also identify new patterns of fraud in the related data which the model may miss. That information is then fed back into the modeling rules to identify that activity in the future, hence my fraud model learns from my visual analysis.
Hopefully I haven't lost everyone at this point so let me describe how I develop a "learning fraud model" by incorporating risk modeling and visual analysis. First a quick analogy of this scenario:
A customs officer at the airport is on the lookout for people who might be smuggling drugs. His observations are based on profiling the behavior of people at the airport which is made up of a series of attributes that drug smugglers display. He is standing at the security checkpoint and notices two people, first is a man who is visibility nervous and sweating. The next person is a well dressed elderly woman laughing and joking with a small child who is standing behind her.
The customs officer pulls the man out of line for further screening allowing the elderly woman through the checkpoint. The man, as it turns out, is nervous about flying because he was on a plane that crashed in the Hudson a couple of months ago and also has the flu. The elderly woman is carrying enough cocaine to kill a team of elephants.
Risk modeling works the same way as the customs officer. The model has a series of attributes that it looks for in transactions and pulls those transactions out of line for further examination. Just like the customers officer, the risk model does not look at transactions 3 dimensionally and cannot tell that the grandmothers address is linked to a person who had been arrested for smuggling drugs at another airport three months ago, if it could, the risk model and the customs officer would have pulled granny aside with the sweating guy. Just like the analyst tasked with looking at 300,000 transactions to find fraud, the customs officer can't be expected to look at 300,000 people passing through his line every day and know which person is the smuggler.
Ultimately the activity that poses the greatest threat is the activity which hasn't been seen before. Count on the fact that if you know what attributes your fraud model looks at in your transaction flow, the bad people also know it and actually some of the transactions your fraud model is stopping today are probes to learn what will get through your model and what wont.
The weakness to fraud schemes and transactions is no matter how hard you try, you can't make fraudulent transactions look exactly like good ones and eventually the activity you are analyzing will share attributes and behaviors that through visual analysis will make the well dressed old lady look more and more like the nervous sweating guy and the same rule that got the sweating guy pulled aside will be incorporated for the elderly lady.
Building a "Leaning" Fraud Model For Visual Analysis
For the purpose of this example we are going to assume that our e Commerce operation has been in business for a couple of month and through charge backs and complaints, we have identified transactions which are fraudulent and incorporated those into our standard fraud model.
I am going to import those records into my visual analysis software and start the process of expanding on those entities, leveraging every transaction data point in my database to identify undiscovered fraudulent transactions from identified ones.
Each expansion takes me one more level into all my transactional data and what I am looking for are clustered transactions that both share attributes with fraudulent transactions and also from a visual analysis standpoint, do not share the same behavior as legitimate transactions.
Here I have located a cluster of interrelated transactions which my fraud model has scored very low, but is linked through two or more levels of relationships to a charge back transactions. The cluster tells me two things; first, if these transactions are all from unrelated individuals they shouldn't be clustered and second, at some point all of these transactions are linked to a fraud charge back.
I am going to focus on this cluster of activity to determine what is occurring which joins them together and through analysis, determine if these transactions are indeed fraudulent. Once we have determined that these transactions are fraudulent there are two things I need to do:
1. Add these transactions to my analytical data set of all identified fraudulent transactions. I refer to this data set as my "scum pond", it consists of every identified fraudulent transaction and all there associated attributes and is what I use to determine to visual analysis footprint of the associated fraud scenarios to compare across transactional data. It differs from the fraud model as it has no stored procedures and is not used for the scoring of transactions, but rather an analytical model my visualization software can refer to, allowing for visual comparisons.
2. I need to determine if these newly discovered transactions share any common attributes which can be incorporated into my "learning" fraud model. Through my visualization I can determine that all of the newly discovered fraud transactions have a mismatch in the IP state and the account state, are over $1,000 and all use the @hotmail.com email domain. I am going to incorporate all of these attributes into my fraud scoring to trigger a review of transactions which share this pattern.
This process gets repeated daily, the fraud model depends on the visual analysis as much as the analyst depends on the fraud model. We have stopped this activity for now, but like all fraud and flu viruses, it will mutate over time, the fraud model will stop detecting the activity and the analyst will have to discover the change and incorporate it back in the modeling.