Welcome to Understanding Link Analysis. The purpose of my site is to discuss the methods behind leveraging visual analytics to discover answers and patterns buried within data sets.

Visual analytics provides a proactive response to threats and risks by holistically examining information. As opposed to traditional data mining, by visualizing information, patterns of activity that run contrary to normal activity surface within very few occurances.

We can dive into thousands of insurance fraud claims to discover clusters of interrelated parties involved in a staged accident ring.

We can examine months of burglary reports to find a pattern leading back to a suspect.

With the new generation of visualization software our team is developing, we can dive into massive data sets and visually find new trends, patterns and threats that would take hours or days using conventional data mining.

The eye processes information much more rapidly when information is presented as images, this has been true since children started learning to read. As our instinct develops over time so does our ability to process complex concepts through visual identification. This is the power of visual analysis that I focus on in my site.

All information and data used in articles on this site is randomly generated with no relation to actual individuals or companies.

Visualizing Network Information For Analysis And Investigation

Those who are engaged in internal investigations will no doubt end up having to incorporate network analysis into their investigations. With so many companies and criminals for that matter, going paperless; email logs, network and firewall logs and server logs have increasingly been integrated into criminal and corporate investigations.

If you lucky enough to have a strong network security department or vendor who provides network trace or packet sniffing services, you are at least halfway ahead of the game. The next task is how to organize the data that has been provided to you in hundreds of thousands of lines of network traffic logs.

There are two scenarios that I would like to cover, and while they are just the tip of the iceberg for network analysis, they are the most common types to be leveraged by investigators who are not specialists in network analysis.

The first scenario I would like to examine is an analysis of client network traffic. As a law enforcement or corporate investigator, you receive information that an individual is utilizing a network for the purpose of downloading child pornography. You request your network lead or IT specialist to put a packet sniffer on the individuals network to capture the information being sent and received from the suspects computer.

For the purpose of this first scenario I am going to utilize a freeware program call Wireshark, a network analysis package that is similar to those use in a corporate environment. For demonstration, I ran a network trace on my own laptop for approximately 10 minutes which produced 22,000 lines of IP and packet information (none of the IP's actually go to a child pornography site, this will be simulated). During the 10 minutes, I performed multiple tasks online, including visiting one site repeatedly which is my simulated child pornography site.

Network Trace Analysis:

The first step I am going to need to take in order to visualize the packet information I have captured from Wireshark is to export the log file into a format that can be imported by my visualization software.

From the screen shot above you can see that wireshark has logged all packet information and has the ability to export that data into a comma delineated format which is perfect for importing into my visualization software. Exported into excel as a comma delineated file the information is organized below

My investigative objective is to determine what web sites this client is going to and the velocity they are going to them. For this type of investigation, it resembles telephone polling analysis in the way that there is a origination and destination entity as the primary links in the visualization. Like telephone toll analysis I am going to visualize velocity between the originating and destination entities by link line width, making the links with the highest velocity thicker. Assuming that someone who is crazy or sick enough to hit child porno links at work, the chances are they do it quite a bit, those are the sites I want to initially focus on.

I begin setting up my import specification by establishing the identity of the originating IP as the originating computer in my visualization assigning it the "source" column as the identity. The originating IP is linked to the destination IP. To contrast the difference I am color coding the destination IP or computer icon, a different color, in this case its red. I am assigning the destination IP the "destination" column as the identity.

There is some additional information that I want to include in my origination and destination entities that is important in my analysis and captured by wireshark, this includes the "info" column which contains the packet information sent between the two IP's.

For the link line I am going to assign the label of the link line the total number of occurrences in the data. This is important when creating link lines which are linear based on occurrence. There is also a date and time column captured in the data which I am going to map to the link line. This will be important in my second visualisation where I am creating a timeline.

Now that the import specification has been created, lets pull the trigger on it and see what our network visualization looks like.

As you can see, I have managed to condense down the 22,000 lines of network log into a visualization that quickly identifies the highest velocity destination IP's for my investigation. This was a fairly small sample and only from one client host, the same analysis works for multiple clients with hundreds of thousands of lines.

I can then take the same data and alter my import specification to create a timeline to illustrate the network traffic in a timeline, indicating what date and time the client computer connected to specific web sites.

For my timeline, the import specification is going to differ a bit. I am going to assign the originating IP and destination IP as the theme line entities. For the link line I have two options, I can create a new link line every time the originating IP connected to the destination IP and bind it to the time line or I can make the link line linear based on occurrence just like the link chart, however only the first occurrence will be bound to the time line as only one link line will be created for each unique connection between the two IP addresses.

For those who are engaged full time in network analysis and investigations, there are much more complex scenarios you will be tackling such as server log connections, server file access, firewall logs and the like. All of these scenarios can be visualized by employing the same import methodology just at a larger scale such as in botnet, hacking or virus investigations. Hopefully the examples provided give some insight and ideas into creating much more complex network analysis visualizations.

Email / Exchange Server Communication Analysis:

The second scenario is common in a wide variety of investigations, the visualization of outlook, lotus or exchange server communication.

In this scenario, just like before, I have completed my network analysis of an individual suspected of accessing child porn sites on a client computer. The next step in my investigation is to determine who this person communicates with the most.

I am going to leverage visual analysis to import in outlook or exchange server logs to analyze the flow of communication by email this person has. This type of analysis can be considered association analysis, you could also theoretically use this type of analysis across multiple email accounts for social networking analysis within an organization.

The first step is to export the outlook logs into a comma delineated format for import into my visual analysis software. The output from outlook or an exchange server is going to look similar to the illustration below.

From this data, there are several types of visualizations I can make. For determining strength of associations I would import the data into a link chart using linear widths for link lines to determine who this person communicates with. For time sensitive investigations, I would import this information into a time line.

In email data there are three entities to create and link, each have a different importance in an investigation. The first entity is establish by the "from" field in the outlook log. Since this field is going to change based on if email is coming in our going out, its important to create a link chart with directed links to establish the flow of communication.

For example Andrew Marane could be in the from field if I were sending a message out or I could appear in the "to" field if I were receiving the message. In my visualization there will be one "Andrew Marane" entity created so by using directed links I can establish who "Andrew Marane" sends messages to and receives from.

I am going to begin to set up my import specification from the outlook data. I am going to utilize a theme line layout but there is a difference to the way I am going to import in the data compared to other imports. I have four potential entities but I can only utilize two in a theme line. All of the entities relate to who sent the message, or the "from" entity.

In order to show the string of communication in my visualization I am going to have an import spec to import and visualize:

From linked to To
From linked to CC
From linked to BCC

Each import specification resembles the other, the only change is to the identity of the second entity for each of the different message roles. When I am ready to import my data for visualization I am going to run the three import specifications in a row to bring all the data into one chart.

Lets run our import specification and see what it looks like:

From this visualization, we learn several different things. First, I changed the link label color to correspond to the role, in this case red is a BCC link. My suspect is BCC'ing Brandy White on almost every communication. Also because of the velocity, the link label is thickest on the subjects who communicate the most.

This was a fairly small demonstration file for email. The majority of email logs I have looked at contains thousands of line entries. Within each line entry you can have anywhere between one to ten different people the message is sent to, copied to or blind copied on. By visualizing this information we can quickly establish a pattern of networking and communication between groups of individuals under investigation.