The-Data-Guy: March 2015

One page introduction to Graphs.

A Graph is a collection of Dots and lines, connected in a manner that represents some physical or logical phenomenon.

The Dots are also under some circumstances called either a Node or a Vertex (Vertices for more than one).

The line, or Edge is the action or thing that connects two Dots. If we were to represent the fact that on social media I am friends with you, but you are not friends with me, this is a directed edge.

If many people consider themselves friends with you, you have a high number of incoming lines, or in-degrees.

If you consider yourself to have lots of friends, you have a high out-degree.

The Degree or total-degree of a node is the number of combined in and out degree connections.

A path is either a finite or infinite sequence of edges which connect a sequence of vertices.

A walk is a sequence of vertices and edges where each edges endpoints are the preceding and following vertex in the sequence. A walk is closed if the first and last vertex is the same, and the walk is open if the first and last vertex is not the same. A walk that starts and ends at the same vertex, but has no other repeated vertices or edges is called a cycle.

A trail is a walk in which all edges are distinct.

These concept of trails and walks will become important on this blog as we discuss Data Structure Graphs, and the application of Graph theory to the concepts of data modeling, and data architecture.

By labeling edges consistently in certain large graphs, we are able to see some patterns that may be difficult otherwise. If we label the edges of my walk as edge_label := ‘doug’s walk’, and that walk is a->b->c->d->e in a graph full of vertices, then we can easily pull that particular walk out of the overall image.

To the right is a small graph with vertices named a-z. I created a number of walks throughout this graph, the source data for this image is walk_demo.csv at: https://github.com/dougneedham/Data-Structure-Graph-Book/blob/master/Data/Input

I uploaded the walk_demo.csv file to my Shiny Application at: DSG Shiny then selected dougs walk from the drop down.

The master graph is displayed above, and the graph of my walk through these nodes is :

This makes it easy to see the path I took through the crowd. If you download this file and experiment for yourself you can see some of the other paths described with the csv file.

Each of these terms or concepts are the foundation of much further discussion. I will be using these definitions as I explore more about Data Structure Graphs.

Graph on!

Graphs are everywhere.

This is a phrase that I have heard many times, especially coming from the Neo4J community. The more I study Graphs, the more I see ways to represent things as a Graph.

Some things are easy to see as a graph. Bridges and Landmasses, People following other people, bank loans to other banks, for example.

Things that you use every day being represented as a graph sometimes allows new insight to be given to the tools you are used to working with.

An Entity Relationship Diagram (ERD) is typically used by Data Architects to design the storage mechanism for an application in a relational database.

Entity relationship diagram, essential for the design of database tables, extracts, and metadata. (Photo credit: Wikipedia)

There are tools that allow a designer to create an ERD, and generally they are focused on creating the most efficient Relational data model for the Relational database engine where the application will be storing its data. ERWin is a really good tool for this.

These tools generally do not include the capability to do an analysis of the Graph itself. When I refer to the “Graph” itself (in this context), I am referring to the structural artifact that is generated when a relationship is established between two entities in an ERD. These concepts are based in Graph Theory

One thing that may make it difficult to fully make the transition from a proper ERD to a Graph is that an ERD is very granular with all of the attributes of every Entity documented to a great level of detail.

To translate this to a Graph, the low-level details need to be trimmed off. The entities themselves, and the relationship between them should be captured in a format that a Graph analysis tool can use.

Many Graph analysis tools are also referred to as Social Network Analysis tools. The Graph analysis tool I find most useful is Gephi.

By doing this, translating an ERD to a Graph and performing a Social Network Analysis on it, what can we learn?

If the ERD is translated properly to a Graph format, once you gather the statistics that are a part of Graph Theory on the Data Structure you are analyzing, interesting things are found.

This is how the PageRank works. (Photo credit: Wikipedia)

For example, running PageRank on the Data Structure Graph will show you the most interconnected Entity. This Entity, which is physically instantiated as a table is the table that is the most critical to your application.

This table should be given precedence for indexes and database utilities like gather statistics.

For a full blown application, if performance problems arise, SQL queries should be examined where this table is referenced. Perhaps even more caching should be done to keep this particular table, and those that have a similarly high PageRank score in memory.

What other insights could be gained by doing a Graph analysis on a Data Structure?

I have published an analysis tool written in R to start people on the journey of analyzing data Structures at Data Structure Graph Shiny Application.

The format that this application requires is easily imported into Gephi, and the analysis discussed above can be done with just a few clicks.

This particular type of Data Structure Graph, I refer to as a Data Structure Graph Level 1.

The-Data-Guy

Pages

2015-03-31

Brief-introduction-to-graphs

One page introduction to Graphs.

2015-03-23

Why-Graph-A-Data-Structure?

Graphs are everywhere.