Welcome


Welcome to Understanding Link Analysis. The purpose of my site is to discuss the methods behind leveraging visual analytics to discover answers and patterns buried within data sets.

Visual analytics provides a proactive response to threats and risks by holistically examining information. As opposed to traditional data mining, by visualizing information, patterns of activity that run contrary to normal activity surface within very few occurances.

We can dive into thousands of insurance fraud claims to discover clusters of interrelated parties involved in a staged accident ring.

We can examine months of burglary reports to find a pattern leading back to a suspect.

With the new generation of visualization software our team is developing, we can dive into massive data sets and visually find new trends, patterns and threats that would take hours or days using conventional data mining.

The eye processes information much more rapidly when information is presented as images, this has been true since children started learning to read. As our instinct develops over time so does our ability to process complex concepts through visual identification. This is the power of visual analysis that I focus on in my site.

All information and data used in articles on this site is randomly generated with no relation to actual individuals or companies.

Solving The Fake News Problem Through Analysis


The problem of fake news is in the forefront of the news, yet some practical solutions for conquering the issue already exist in the intelligence and fraud analytic community.

I have been engaged for the past five years in addressing social media and rating fraud.  The challenge around such detection is lack of information collected by the individuals submitting the content which other industries collect and leverage for risk assessment.  The primary solution we have utilized and adapted to social media depends heavily on two things, creating a “digital identification” of the user and applying reputation scoring of that user to determine their reliability and trustworthiness.

In every industry one facet is always true, the greatest risk to fraud or trust and safety is first time user, first transaction.  This is the point where the least is known about the transacting parties and where creative and effective attribute, behavior and risk modeling is essential.  The more you know about that first-time user or that first transaction from an historical behavioral standpoint, the more effective you will be at proactive detection of risk in these transactions.

Risk scoring and detection of fake news:

Just like in social media fraud there are two items to be analyzed, there is a person submitting news and there is the news itself.  These are the two items that we can examine and score to determine the reliability of the information being submitted by leveraging fraud and intelligence analytic principals already in place.

Why do people watch a specific network and anchor for news?  Certainly persona and style play a part, but the main reason people watch news is to learn so the most important attribute to learning is reliability and reputation of the individual dispensing the news, or in other words when this person speaks can I trust what he is saying is accurate.

Individuals who report on news gain reputation from a body of work that is reliable and accurate over time, this is behavior.  I have never seen this reporter before so I do I know what he is saying is true, because I can research and aggregate past information that is sourced to this reporter as well as determining if others have challenged the conclusions successfully.

Regardless of how much I trust a reported, if he came on the air and said that aliens from Neptune arrived in time square and kidnapped four tourists I would be suspect.  Before I believe this story I will want to know how many other people have this information, what is the source of that information and what is the reliability of the sources used to produce that story.

This is the same thing that an intelligence analyst does when he receives information from a source.  Within the intelligence community there is a methodology to rating the reliability of information (or risk in this case) called the Source and Reliability Matrix.  It scores the reliability of the source of the information and the information itself.  The source is given a grade between A and F and the information is given a grade between 1 and 6.  The two scores are aggregated to determine how seriously the information should be taken.

Code
Source rating
Explanation
A
Reliable
No doubt of authenticity, trustworthiness or competency; has a history of complete reliability
B
Usually reliable
Minor doubt about authenticity, trustworthiness or competency; has a history of valid information most of the time
C
Fairly reliable
Doubt of authenticity, trustworthiness or competency, but has provided valid information in the past
D
Not usually reliable
Significant doubt about authenticity, trustworthiness or competency but has provided valid information in the past
E
Unreliable
Lacking in authenticity, trustworthiness and competency; history of invalid information
F
Cannot be judged
No basis exists
+Information content ratings
Code
Rating
Explanation
1
Confirmed
Confirmed by other independent sources; logical in itself; consistent with other information on the subject
2
Probably true
Not confirmed; logical in itself; consistent with other information on the subject
3
Possibly true
Not confirmed; reasonably logical in itself; agrees with some other information on the subject
4
Doubtfully true
Not confirmed; possible but not logical; no other information on the subject
5
Improbable
Not confirmed; not logical in itself; contradicted by other information on the subject
6
Cannot be judged
No basis exists


In social media fraud, we utilize a similar process for determining the trustworthiness of reviews submitted on properties based on the analysis of behavior of the user and the reputation and history of the property that being reviewed.  These two scores are combined to determine a base reputation of the user and the location and how much more data we want to take into consideration to create a risk score.  New user on a risky property, we would want to look at a wide range of attributes captured during the users interaction with the site to make a better determination while a known user on a low risk property we likely can already make the assumption based on the information we have on hand.

This is the same principal that is used today by individuals transacting on shared economy sites like Airbnb, Thumbtack and Uber.  A person will have more trust in the individual he is “hiring” based on the reputation that the individual has gained through past transactions and the trust that I have in the site itself to represent that information accurately. Apply that to news, as a user I am going to have more trust in a reporter that has established a good history by the number of stories that person has generated and the sentiment of the community which I also trust who reference that reporters information.  This is not be confused with the number of “people” who simply like what a report wrote, that is not an indication of the reliability of that reporter but rather an indication of how many people agreed with what the reporter said.

To determine a reliability or risk score for a news source (individual) we have to be careful that we aren’t simply ignoring a news source because they are unknown, lack of popularity is also not an indication of trust.  To accomplish this we aggregate the reputation of the individual news source with the reputation of others that are associated with the news source.

For example, you are watching a story on 60 Minutes by a correspondent you have never seen before.  Right now that correspondent is an F on the source scale because we know nothing about the individuals history or reputation just as an individual, in other words if that same person was not on 60 minutes but rather submitting a news article on Facebook, I don’t trust it because I don’t know the source at all.  However there are elements that are associated with this correspondent that we do know which can factor into his score. 

We know 60 Minutes as a show has a lengthy history and reputation for accuracy and would likely score an A on the source scale.  We know the producer for the segment also has a verifiable history of credibility and would score an A on the source scale and the anchor introducing the piece also would score an A on the source scale.  While we know nothing about the individual correspondent, the network of individuals he is associated with speaks directly to the individuals reputation.

He is reporting on a story about the crisis in Syria.  The information he is providing is backed by attributes we can associate as being factual because, for example, there is video of the events that is clear and recognizable, he is reporting on an issue that other credible sources have reported on in the past and we can confirm the information he is providing through other sources.

So while this is a new news source reporting information, his risk is low because I can assign risk to his social network and assign risk the information he is providing through other sources.  While I have never seen this person before he would score a C-B in the source scale and a 1 in the information scale.

Now using an example of fake news seen on social media recently lets apply risk scoring to the source and the information, Pizzagate in Washington D.C.  This was a widely covered media event of fake news.

Start with the source of Pizzagate, beginning with a benign mention in the Clinton wikileaks release, the story was created on an online forum with no association to any news organization no created for the purpose of reporting news (F on the source scale).  The information itself referenced the wikileaks information however any review of the wikileaks data could not create the fact that the pizza parlor was being used for human trafficking (5 on the information scale).  The information went widespread in social media and alternative media based on likes, tweets and other voting from unspecified and unconnected sources. 

Without an historical or social network basis to apply to the source of the story, the rating of the source remained an F throughout the entire propagation of the news story through social media.  The information itself has no basis of fact from the source, had not been cited from any other reputable or score-able source  and the assertion was improbable at best.

Fake News Analysis At Scale:

How do we apply these principals at scale such as large social media site would need to do effectively?  Much in the same way a large ecommerce site assigns risk to thousands of transactions per minute, determining a risk scale and then analyzing the data in mass.

I feel social media in all of its forms didn’t understand the impact of fake news and its effect to its users as well as to its brand reputation as whole.  Millions of people a day go on their favorite media site and proclaim that they are actually Superman in disguise and that is to be expected, their audience knows them and realizes its not news.  Intertwine though people who for whatever motivation want to spread fake news by manipulating the strengths of social media creates a new dynamic where the primary risk is to the integrity of the site itself.

In fraud analysis, particularly in transactional fraud, relationships between the users attributes and the transaction are considered bad.  Cluster analysis is essential in fraud, the more clusters of interrelationships between people, their devices, their network attributes and a financial transaction the higher the risk the transaction is.

In a bit of shift, for news related risk, we are very interested in the interrelationships between the reporter of news and those in his community to establish the risk score.  This is particularly true for new users that don’t have an established history.

The first analysis is to generate a base risk score of a new source of news based on their connection to reliable and trustworthy sources of news in their network.  Just like in the 60 Minutes example, is the person who is reporting on Facebook of alien abductions connected to any source of news that we would find reliable or did they appear from nowhere and their primary social connections are friends that are likely as crazy as they are.  In this case the larger the clusters, the more attributes for accurate analysis




This will likely create the need for a separate reputational or risk score for news related information.  By establishing this risk score over the community, the ability to filter, sort and prioritize news based on the source and the information’s reliability will become as intuitive as sort orders in search engines.  Most reliable source, most reliable information on top, the same way intelligence agencies parse millions of pieces of information every year.


-->
The same should be applied to site sources of news. Sites that are reliable and trustworthy sources of information ranked over those that are not.  Ideally the scoring should be made available to the community allowing the user to determine how much credibility they are going to lend to the news. If you want a good exercise for this, examining tech sites that discuss the next model of iPhone day after day.