Welcome to Understanding Link Analysis. The purpose of my site is to discuss the methods behind leveraging visual analytics to discover answers and patterns buried within data sets.

Visual analytics provides a proactive response to threats and risks by holistically examining information. As opposed to traditional data mining, by visualizing information, patterns of activity that run contrary to normal activity surface within very few occurances.

We can dive into thousands of insurance fraud claims to discover clusters of interrelated parties involved in a staged accident ring.

We can examine months of burglary reports to find a pattern leading back to a suspect.

With the new generation of visualization software our team is developing, we can dive into massive data sets and visually find new trends, patterns and threats that would take hours or days using conventional data mining.

The eye processes information much more rapidly when information is presented as images, this has been true since children started learning to read. As our instinct develops over time so does our ability to process complex concepts through visual identification. This is the power of visual analysis that I focus on in my site.

All information and data used in articles on this site is randomly generated with no relation to actual individuals or companies.

Fraud in Social Media and User Generated Content (UGC) Sites

Discovering and detecting fraud is social media or user generated content is essential to establishing the credibility of any UGC site, particularly those engaged in collecting content which establishes the value and reputation of products or services. Sites which aggregate user supplied information, data and rankings face the same threats as those that engage in online financial transactions and ecommerce activity. The primary difference, and the challenge in the UGC space, is the amount of information gathered and supplied in UGC is significantly less then those in the ecommerce or financial space. Even when fraud is attempted in financial based transactions, the user must enter a significant amount of information to complete the transactions and even if this information is fake it create patterns within the collected attributes. Fraud analysis needs all the data points and attributes it can get its hand on, these attributes create clusters and clusters are definitive red flags that can be leveraged for prevention. In the UGC space, the information entered by the user and the attributes that are collected during the transaction are vastly different from finance and normally much less so fraud analysis of user generated content requires a different mindset and creative use of analysis. While UGC attributes are different, there are plenty of them, you just have to dig a little deeper. This was a challenge I had when I switch from a lifetime of fraud analysis in the financial space to tackling fraud in social media.

  Behavior Analysis

 Key to discovering fraud in social media and UGC comes from understanding the behavior of users on the site. This is one area where UGC has the advantage over financial transaction sites where user behavior is not as easily aggregated. Social media sites encourage longer durations of interaction with the site then those in the financial space. In financial or ecommerce sites, users, even fraudulent users, go through a specific sequence of events to accomplish transactions and the difference between the fraudulent user and legitimate user are much more concurrent. In social media or UGC sites, there are many transaction threads that exist and which is the entire purpose, to provide users on UGC sites with expanded opportunities to explore, elongating the engagement on the site. By leveraging the data generated as users interact with the site, and utilizing visual analytics to examine the data, you can effectively assign risk to users who leave content on the site by cohorting users into risk groups based on their behavior.

  Fraud Risk Assessment of UGC Users

 Lets use an example of a social media site that aggregates reviews and ratings for hotels. The site among other things, provides a list of hotels in geographies around the world along with pictures, reviews and ratings left by users on the site. First we need to understand the behavior of the majority of users who engage the site and leave content. It’s important to remember that the majority of users on the site are legitimate and their behavior follows similar patterns that can be understood through visual analysis.

 A user comes on the site who is planning a trip to Paris France. Everyone knows the best part of any vacation is planning it, so the user does what all of us would do, and several weeks before the trip, the user searches on hotels in Paris that are being considered. The user looks at several hotels, views the pictures, reads some of the reviews and leaves the site. On several occasions prior to his trip, he returns to the site and looks at the one or two properties the user has narrowed down in his consideration. The user then takes the trip, returns home and then leaves a review for the hotel that is his first on the site. As with any business, the highest rate of fraud is that by the new user on the first transaction.

 The challenge is establishing risk for a user that has no history, no attributes and no patterns to leverage. By using behavior analysis, we get a head start in understanding that user and can assign risk based on his cohort. The group of users that exhibit the activity found in the paragraph above most likely represent about 95% of the users which visit your UGC site. All UGC sites have a theme, some wider then others, but in the end the idea is collect content surrounding that theme be it travel, dining, what are my friends doing, what products do I like and so on. The lowest risk cohort on your site is going to be those that fall into behavior I described above.

The second, third and fourth cohort groups are going to be comprised of users who behavior is expected but differ from the 95%.  This could be users which appear on the site and leave content due to marketing campaigns, advertising, surveys and so on, reasons that the UGC company are actually targeting.

The last cohort group represents about 2% of the remaining users but that activity is present in more then 90% of the users that have been confirmed as fraudulent.  These are the users we see doing things on the site that no one else does.  They appear out of nowhere, they go directly to the hotel listing review form, they execute the form immediately and they are never seen again.  This is the group who’s behavior attributes we can leverage for fraud prevention and increase scrutiny.

Analysis for Cohorting Users

Even if a user is not logged in or known on the site, most UGC sites assign every user on the site a unique identifier which is comprised of the browser and network data that a user communicates with the site.  This includes browser strings, cookies and IP addresses.  Unless a user uses a completely different device or network (that in itself can be considered a flag), that user identifier will be fairly consistent over a reasonable amount of time.  That identifier degrades over time so establishing the degradation period is important in analysis, this will differ by site depending on supporting platforms and time to conversion from user to contributor.

Even if the period that can be accurately examined is only three months, that is a wealth of information which can be segmented into cohort groups for fraud analysis.  Those cohort groups can then be incorporated into the UGC sites fraud modeling to increase detection and reduce friction, a win/win for any UGC site because remember, at the end of the day the value of a UGC site is the amount of content left by the user.  It is a greater failure by the fraud department to reject a submission left by a legitimate user then to allow a contribution by a fraudulent one.  Sounds counterintuitive right???  and definitely different from financial or ecommerce sites where the friction rates differ.  

Think of this, if a user spends 15 minutes to search out a product they really want from an online ecommerce site and their transaction is declined, they are more likely to contact the site to resolve the issue.  In UGC, if a new user comes to the site, spends 15 minutes to submit content and that content is rejected there is an 85% chance that user will never leave another content submission.

Now that risk cohorts are established, the UGC fraud team can understand that a user who came to the site a month ago, searched for a hotel in Paris, looked at several, viewed pictures and read content then returned a month later and submitted a review for that hotel has a high degree of reliability then the user in the high risk cohort group which is basically a user who appears out of the blue, unsolicited, goes directly to a hotel site, clicks immediately on the review site and leaves content.