Welcome


Welcome to Understanding Link Analysis. The purpose of my site is to discuss the methods behind leveraging visual analytics to discover answers and patterns buried within data sets.

Visual analytics provides a proactive response to threats and risks by holistically examining information. As opposed to traditional data mining, by visualizing information, patterns of activity that run contrary to normal activity surface within very few occurances.

We can dive into thousands of insurance fraud claims to discover clusters of interrelated parties involved in a staged accident ring.

We can examine months of burglary reports to find a pattern leading back to a suspect.

With the new generation of visualization software our team is developing, we can dive into massive data sets and visually find new trends, patterns and threats that would take hours or days using conventional data mining.

The eye processes information much more rapidly when information is presented as images, this has been true since children started learning to read. As our instinct develops over time so does our ability to process complex concepts through visual identification. This is the power of visual analysis that I focus on in my site.

All information and data used in articles on this site is randomly generated with no relation to actual individuals or companies.

Pushing Fraud Upstream is the Goal

The ultimate goal of any fraud program is push detection of suspect activity as far upstream as possible.  But in reality what often happens is companies become entrenched in reactive analytics at the transaction or loss level without figuring out how the threat made it through the door in the first place.  A large percentage of suspect or fraudulent activity can be detected at time of entry or account creation before a single transaction is made and often much easier then at the transaction level.



Transaction level fraud detection is an essential component of any fraud prevention platform and there is much to be leveraged at the transaction level that lends itself to robust protection, but in order for this layer to start working a fraudulent transaction has to take place.  If in your company transaction level detection is your first line of defense, then it’s the same as leaving your door unlocked for the burglar because you have a camera inside.  For the strongest protection from fraud, a layered approach to fortifying your platform starting with the behavior of users the minute they enter the door gives a first chance at profiling risk before risk can occur. 

Most of fraud prevention is pattern detection, interrelationships between activities that should be random increase the likelihood of fraud.  The more interrelationships the higher the risk of fraud and the best way to target these interrelationships is to aim detection on the activities that generate the most velocity, know your enemy as early as possible. 

Locking the front door



With any luck the vast majority of users utilizing your site are legitimate.  They browse your site, set up an account and transact in very distinct patterns based on the user funnel you have established for them.  Since most company’s goal is to establish the site with a specific user behavior in mind, it is likely that your usability engineers already have metrics on how users interact with the site in the way it was intended and the data can be as granular as page click and movement logging.

This is exactly where we want our fraud platforms first layer of security to start, I want to be able to detect when an account is created by someone with high probably for committing fraud.  Organized fraud has several weaknesses that can be exploited, they need a network that hides their identity and location, they need a financial network to execute transactions and pay for goods or extract funds (depending on your business) and to move that money to a safe harbor and third, because they are a business they need to create many accounts in a short period of time to get a return on investment.

The reason why building a strong fraud prevention system at the account creation layer is to take advantage of that weakness in the organized fraud scheme.  Fraudsters may spread attributes and velocities across transactions more effectively but in order to commit the fraud in the first place they have create multiple accounts in order to execute on your site and they must do it in a way that runs contrary to the behavior of legitimate users on your site out of sheer necessity.

Start with examining velocity signatures on your site by attributes captured at the account creation stage.  You should be able to establish a baseline of legitimate account creation based on attributes such as user agent, IP address, tracking cookie and device fingerprint. In my case I was examining a series of organized fraud activities  and started looking at user behavior on entry for indicators where I could tie the activity to a set or signature of actions and velocities from attributes.  I found that under the most extreme circumstances, normal user would never create more than X accounts from the same user agent and IP in any given session and anytime that I found a browser agent with an account creation velocity of more than X and tied to other behavior red flags there was a 95% chance that the account would engage in fraud.  In most cases the velocity was much higher than X and those were “low hanging fruit” detections and working my way down to 3 provided a reliable indication when coupled with several signature indicators and behavior (unfortunately I can tell everything, never know who is reading this).

In looking at the signature of the entry and account establishment behavior I could link multiple organized fraud instances together to gain an understanding of the scope of activity and the specific methods being used that differed from the way normal users would interact with the site.

By looking at this activity through visual analytics you can the multiple layers of interrelated attributes that are created and this represents the activity of a single organized fraud rings creating accounts in a one hour period.




Next look at the sequence of activities when creating the account.  Normal users take a certain mean minimum time when establishing a new account.  Normal users also don’t create an account and disappear, if they are going to take the time to create the account in the first place they are going to browse and interact with your site for a certain amount of time.  When your intention is organized fraud however you have to create a hundred accounts in a short amount of time.  There are two things to look for, do you have users who are creating an account in a fraction of the time your legitimate users are taking and two, do you have a number of account created within a short time frame where the page click logs show the exact (key word) pattern of account creation over and over.  Even for a simple form, ten people will not fill out the form in the same sequence, in the same method without making a mistake. 

When committing fraud at the transaction level, organized fraud can spread these transactions and attributes over a much wider set of data points but in account creation these indicators are much more condensed both from need and by design of the account creation process on most sites.

Another key indicator to look for at account creation is network attributes and the relationship between multiple accounts on certain domains and IP’s.  Again because of the nature of the account creation process, these network indicators become more condensed and dynamic when examining them through visual analysis.  Do you have abnormal user agent attributes jumping across multiple high risk domains and IP’s within a short time span, for example user agent creates 10 accounts on one IP, jumps to another IP and creates another 10.  In examining my own data, I never found one legitimate case where a user would create legitimate accounts or transactions while jumping across network attributes.  Key is understanding what the clear majority of legitimate users do in order to detect the much smaller percent that isn’t.



If you find a case where the network attributes are creating a pattern like the one below, that is an high risk cluster of interrelationships which shouldn’t exist (left)




It is an impulse to always through fraud detection at the threat “you can see”, at the transactional level.  Fraud detection and risk scoring at the transaction level has a number of challenges, the activity is much more disbursed so creating a pattern is harder, this is not true at account creation or site interaction.  Transaction level fraud behavior is much more close to normal user behavior, there is a difference but detecting and measuring it is much more complex because likely by the time a fraudster makes it to the transaction flow they are doing the same thing that normal users do and their pattern of activity is much closer, this is not true at account creation.

By coupling together behavior, risk attributes, network forensics and pattern analysis to create a high risk signature and running it against account creation or even earlier in the entry phase, your fraud system can begin risk scoring and mitigation before the first transaction is ever attempted.  I have worked at fraud prevention with many different types of companies but regardless if its social media, ecommerce or FinTech fraud the rule for account velocity has been true in targeting organized fraud across all of these business types.


In watching the news while writing this article, I heard that Facebook is hiring 3,000 people to moderate video in an effort to remove offensive or dangerous content from the site faster and I am thinking, could the people who create this type of content be cohorted and risk assessed based on the way they create and interact with their account?

Solving The Fake News Problem Through Analysis


The problem of fake news is in the forefront of the news, yet some practical solutions for conquering the issue already exist in the intelligence and fraud analytic community.

I have been engaged for the past five years in addressing social media and rating fraud.  The challenge around such detection is lack of information collected by the individuals submitting the content which other industries collect and leverage for risk assessment.  The primary solution we have utilized and adapted to social media depends heavily on two things, creating a “digital identification” of the user and applying reputation scoring of that user to determine their reliability and trustworthiness.

In every industry one facet is always true, the greatest risk to fraud or trust and safety is first time user, first transaction.  This is the point where the least is known about the transacting parties and where creative and effective attribute, behavior and risk modeling is essential.  The more you know about that first-time user or that first transaction from an historical behavioral standpoint, the more effective you will be at proactive detection of risk in these transactions.

Risk scoring and detection of fake news:

Just like in social media fraud there are two items to be analyzed, there is a person submitting news and there is the news itself.  These are the two items that we can examine and score to determine the reliability of the information being submitted by leveraging fraud and intelligence analytic principals already in place.

Why do people watch a specific network and anchor for news?  Certainly persona and style play a part, but the main reason people watch news is to learn so the most important attribute to learning is reliability and reputation of the individual dispensing the news, or in other words when this person speaks can I trust what he is saying is accurate.

Individuals who report on news gain reputation from a body of work that is reliable and accurate over time, this is behavior.  I have never seen this reporter before so I do I know what he is saying is true, because I can research and aggregate past information that is sourced to this reporter as well as determining if others have challenged the conclusions successfully.

Regardless of how much I trust a reported, if he came on the air and said that aliens from Neptune arrived in time square and kidnapped four tourists I would be suspect.  Before I believe this story I will want to know how many other people have this information, what is the source of that information and what is the reliability of the sources used to produce that story.

This is the same thing that an intelligence analyst does when he receives information from a source.  Within the intelligence community there is a methodology to rating the reliability of information (or risk in this case) called the Source and Reliability Matrix.  It scores the reliability of the source of the information and the information itself.  The source is given a grade between A and F and the information is given a grade between 1 and 6.  The two scores are aggregated to determine how seriously the information should be taken.

Code
Source rating
Explanation
A
Reliable
No doubt of authenticity, trustworthiness or competency; has a history of complete reliability
B
Usually reliable
Minor doubt about authenticity, trustworthiness or competency; has a history of valid information most of the time
C
Fairly reliable
Doubt of authenticity, trustworthiness or competency, but has provided valid information in the past
D
Not usually reliable
Significant doubt about authenticity, trustworthiness or competency but has provided valid information in the past
E
Unreliable
Lacking in authenticity, trustworthiness and competency; history of invalid information
F
Cannot be judged
No basis exists
+Information content ratings
Code
Rating
Explanation
1
Confirmed
Confirmed by other independent sources; logical in itself; consistent with other information on the subject
2
Probably true
Not confirmed; logical in itself; consistent with other information on the subject
3
Possibly true
Not confirmed; reasonably logical in itself; agrees with some other information on the subject
4
Doubtfully true
Not confirmed; possible but not logical; no other information on the subject
5
Improbable
Not confirmed; not logical in itself; contradicted by other information on the subject
6
Cannot be judged
No basis exists


In social media fraud, we utilize a similar process for determining the trustworthiness of reviews submitted on properties based on the analysis of behavior of the user and the reputation and history of the property that being reviewed.  These two scores are combined to determine a base reputation of the user and the location and how much more data we want to take into consideration to create a risk score.  New user on a risky property, we would want to look at a wide range of attributes captured during the users interaction with the site to make a better determination while a known user on a low risk property we likely can already make the assumption based on the information we have on hand.

This is the same principal that is used today by individuals transacting on shared economy sites like Airbnb, Thumbtack and Uber.  A person will have more trust in the individual he is “hiring” based on the reputation that the individual has gained through past transactions and the trust that I have in the site itself to represent that information accurately. Apply that to news, as a user I am going to have more trust in a reporter that has established a good history by the number of stories that person has generated and the sentiment of the community which I also trust who reference that reporters information.  This is not be confused with the number of “people” who simply like what a report wrote, that is not an indication of the reliability of that reporter but rather an indication of how many people agreed with what the reporter said.

To determine a reliability or risk score for a news source (individual) we have to be careful that we aren’t simply ignoring a news source because they are unknown, lack of popularity is also not an indication of trust.  To accomplish this we aggregate the reputation of the individual news source with the reputation of others that are associated with the news source.

For example, you are watching a story on 60 Minutes by a correspondent you have never seen before.  Right now that correspondent is an F on the source scale because we know nothing about the individuals history or reputation just as an individual, in other words if that same person was not on 60 minutes but rather submitting a news article on Facebook, I don’t trust it because I don’t know the source at all.  However there are elements that are associated with this correspondent that we do know which can factor into his score. 

We know 60 Minutes as a show has a lengthy history and reputation for accuracy and would likely score an A on the source scale.  We know the producer for the segment also has a verifiable history of credibility and would score an A on the source scale and the anchor introducing the piece also would score an A on the source scale.  While we know nothing about the individual correspondent, the network of individuals he is associated with speaks directly to the individuals reputation.

He is reporting on a story about the crisis in Syria.  The information he is providing is backed by attributes we can associate as being factual because, for example, there is video of the events that is clear and recognizable, he is reporting on an issue that other credible sources have reported on in the past and we can confirm the information he is providing through other sources.

So while this is a new news source reporting information, his risk is low because I can assign risk to his social network and assign risk the information he is providing through other sources.  While I have never seen this person before he would score a C-B in the source scale and a 1 in the information scale.

Now using an example of fake news seen on social media recently lets apply risk scoring to the source and the information, Pizzagate in Washington D.C.  This was a widely covered media event of fake news.

Start with the source of Pizzagate, beginning with a benign mention in the Clinton wikileaks release, the story was created on an online forum with no association to any news organization no created for the purpose of reporting news (F on the source scale).  The information itself referenced the wikileaks information however any review of the wikileaks data could not create the fact that the pizza parlor was being used for human trafficking (5 on the information scale).  The information went widespread in social media and alternative media based on likes, tweets and other voting from unspecified and unconnected sources. 

Without an historical or social network basis to apply to the source of the story, the rating of the source remained an F throughout the entire propagation of the news story through social media.  The information itself has no basis of fact from the source, had not been cited from any other reputable or score-able source  and the assertion was improbable at best.

Fake News Analysis At Scale:

How do we apply these principals at scale such as large social media site would need to do effectively?  Much in the same way a large ecommerce site assigns risk to thousands of transactions per minute, determining a risk scale and then analyzing the data in mass.

I feel social media in all of its forms didn’t understand the impact of fake news and its effect to its users as well as to its brand reputation as whole.  Millions of people a day go on their favorite media site and proclaim that they are actually Superman in disguise and that is to be expected, their audience knows them and realizes its not news.  Intertwine though people who for whatever motivation want to spread fake news by manipulating the strengths of social media creates a new dynamic where the primary risk is to the integrity of the site itself.

In fraud analysis, particularly in transactional fraud, relationships between the users attributes and the transaction are considered bad.  Cluster analysis is essential in fraud, the more clusters of interrelationships between people, their devices, their network attributes and a financial transaction the higher the risk the transaction is.

In a bit of shift, for news related risk, we are very interested in the interrelationships between the reporter of news and those in his community to establish the risk score.  This is particularly true for new users that don’t have an established history.

The first analysis is to generate a base risk score of a new source of news based on their connection to reliable and trustworthy sources of news in their network.  Just like in the 60 Minutes example, is the person who is reporting on Facebook of alien abductions connected to any source of news that we would find reliable or did they appear from nowhere and their primary social connections are friends that are likely as crazy as they are.  In this case the larger the clusters, the more attributes for accurate analysis




This will likely create the need for a separate reputational or risk score for news related information.  By establishing this risk score over the community, the ability to filter, sort and prioritize news based on the source and the information’s reliability will become as intuitive as sort orders in search engines.  Most reliable source, most reliable information on top, the same way intelligence agencies parse millions of pieces of information every year.


-->
The same should be applied to site sources of news. Sites that are reliable and trustworthy sources of information ranked over those that are not.  Ideally the scoring should be made available to the community allowing the user to determine how much credibility they are going to lend to the news. If you want a good exercise for this, examining tech sites that discuss the next model of iPhone day after day.