One page introduction to Graphs.

A Graph is a collection of Dots and lines, connected in a manner that represents some physical or logical phenomenon. 

The Dots are also under some circumstances called either a Node or a Vertex (Vertices for more than one).

The line, or Edge is the action or thing that connects two Dots. If we were to represent the fact that on social media I am friends with you, but you are not friends with me, this is a directed edge.

If many people consider themselves friends with you, you have a high number of incoming lines, or in-degrees.

If you consider yourself to have lots of friends, you have a high out-degree. 

The Degree or total-degree of a node is the number of combined in and out degree connections. 

A path is either a finite or infinite sequence of edges which connect a sequence of vertices.

A walk is a sequence of vertices and edges where each edges endpoints are the preceding and following vertex in the sequence. A walk is closed if the first and last vertex is the same, and the walk is open if the first and last vertex is not the same. A walk that starts and ends at the same vertex, but has no other repeated vertices or edges is called a cycle.

A trail is a walk in which all edges are distinct. 

These concept of trails and walks will become important on this blog as we discuss Data Structure Graphs, and the application of Graph theory to the concepts of data modeling, and data architecture. 

By labeling edges consistently in certain large graphs, we are able to see some patterns that may be difficult otherwise. If we label the edges of my walk as edge_label := ‘doug’s walk’, and that walk is a->b->c->d->e in a graph full of vertices, then we can easily pull that particular walk out of the  overall image. 

To the right is a small graph with vertices named a-z. I created a number of walks throughout this graph, the source data for this image is walk_demo.csv at:

I uploaded the walk_demo.csv file to my Shiny Application at: DSG Shiny then selected dougs walk from the drop down. 

The master graph is displayed above, and the graph of my walk through these nodes is : 

This makes it easy to see the path I took through the crowd. If you download this file and experiment for yourself you can see some of the other paths described with the csv file.

Each of these terms or concepts are the foundation of much further discussion. I will be using these definitions as I explore more about Data Structure Graphs. 

Graph on! 



Graphs are everywhere. 

This is a phrase that I have heard many times, especially coming from the Neo4J community. The more I study Graphs, the more I see ways to represent things as a Graph.

Some things are easy to see as a graph. Bridges and Landmasses, People following other people, bank loans to other banks, for example. 

Things that you use every day being represented as a graph sometimes allows new insight to be given to the tools you are used to working with. 

An Entity Relationship Diagram (ERD) is typically used by Data Architects to design the storage mechanism for an application in a relational database. 

Entity relationship diagram, essential for the...
Entity relationship diagram, essential for the design of database tables, extracts, and metadata. (Photo credit: Wikipedia)

There are tools that allow a designer to create an ERD, and generally they are focused on creating the most efficient Relational data model for the Relational database engine where the application will be storing its data. ERWin is a really good tool for this.

These tools generally do not include the capability to do an analysis of the Graph itself. When I refer to the “Graph” itself (in this context), I am referring to the structural artifact that is generated when a relationship is established between two entities in an ERD. These concepts are based in Graph Theory

One thing that may make it difficult to fully make the transition from a proper ERD to a Graph is that an ERD is very granular with all of the attributes of every Entity documented to a great level of detail. 

To translate this to a Graph, the low-level details need to be trimmed off. The entities themselves, and the relationship between them should be captured in a format that a Graph analysis tool can use.

Many Graph analysis tools are also referred to as Social Network Analysis tools. The Graph analysis tool I find most useful is Gephi. 

By doing this, translating an ERD to a Graph and performing a Social Network Analysis on it, what can we learn? 

If the ERD is translated properly to a Graph format, once you gather the statistics that are a part of Graph Theory on the Data Structure you are analyzing, interesting things are found. 

This is how the PageRank works.
This is how the PageRank works. (Photo credit: Wikipedia)
For example, running PageRank  on the Data Structure Graph will show you the most interconnected Entity. This Entity, which is physically instantiated as a table is the table that is the most critical to your application. 

This table should be given precedence for indexes and database utilities like gather statistics.

For a full blown application, if performance problems arise, SQL queries should be examined where this table is referenced. Perhaps even more caching should be done to keep this particular table, and those that have a similarly high PageRank score in memory.

What other insights could be gained by doing a Graph analysis on a Data Structure? 

I have published an analysis tool written in R to start people on the journey of analyzing data Structures at Data Structure Graph Shiny Application.

The format that this application requires is easily imported into Gephi, and the analysis discussed above can be done with just a few clicks. 

This particular type of Data Structure Graph, I refer to as a Data Structure Graph Level 1.



Recently I mentioned to a few people that I was doing an analysis on a Data Structure Graph. The immediate response is – What is a Data Structure Graph?
The image shows a Data Structure Graph rendered for a fictional organization where the applications are the dots(Nodes), and the data transfers between the applications are the lines (Edges)


A Definition

A Data Structure Graph is a group of atomic entities that are related to each other, stored in a repository, then moved from one persistence layer to another, rendered as a Graph.

A group of atomic entities.

An atomic entity is an entity that cannot be broken down any further. This Entity (Vertex) could be an application, or a table in a database, or a document stored on a file system. These Entities are related to each other through some mechanism (Edge), the mechanism could be a simple foreign key, or transfer of a subset of data.

Related to each other.

For two entities to have a relationship means that one entity refers to another entity. In a relational database example this is what is called a foreign key. I have also seen cases where documents are related to each other by having some type of business key in the title of the document along with a prefix, or suffix indicating what type of document it represents.

Stored in a repository.

I use the term repository here because it is no longer the case that all data is stored in a relational database.  In the document case these could be simply documents stored in a file server. For the case of an application, any application that an enterprise relies on for its duties persists data in some type of repository.

Moved from one persistence layer to another.

This is the portion of the definition, in general,  where we make the transition from a Level 1 to a Level 2 Data Structure Graph. Seldom in my experience is one application sufficient to meet the needs of an entire organization. Data has to flow from some systems to other systems.  This is generally where a Data Architect should have the most influence in an organization. This portion of the definition is for data that has to move between applications, then be persisted in the new application for some period of time, for use-cases not originally designed as part of the application of origin.

Rendered as a Graph.

Why do I say it is rendered as a Graph? In a number of instances where I have worked as a data architect, more time was spent discussing the layout of a diagram than the content of the diagram. On a recent project I set out to change the discussion. Using some of the tools available for Graph analysis, such as Gephi, NetworkX, and iGraph, along with the tenets of graph theory along with a bit of data cleansing I was able to gather, consolidate, render, and present data about the overall structure and relationships of enterprise applications to my customer in a rapid fashion. This completely changed the conversation from the specifics of the diagram to the manner in which things needed to be changed to get away from the hairball they had created.

I will be writing more about Data Structure Graphs as time goes on.  This blog gives us a foundational definition of what a Data Structure Graph is, in future blogs I will write about the particular use-cases and interesting analysis done with Data Structure Graphs.