Recently I mentioned to a few people that I was doing an analysis on a Data Structure Graph. The immediate response is – What is a Data Structure Graph?
The image shows a Data Structure Graph rendered for a fictional organization where the applications are the dots(Nodes), and the data transfers between the applications are the lines (Edges)


A Definition

A Data Structure Graph is a group of atomic entities that are related to each other, stored in a repository, then moved from one persistence layer to another, rendered as a Graph.

A group of atomic entities.

An atomic entity is an entity that cannot be broken down any further. This Entity (Vertex) could be an application, or a table in a database, or a document stored on a file system. These Entities are related to each other through some mechanism (Edge), the mechanism could be a simple foreign key, or transfer of a subset of data.

Related to each other.

For two entities to have a relationship means that one entity refers to another entity. In a relational database example this is what is called a foreign key. I have also seen cases where documents are related to each other by having some type of business key in the title of the document along with a prefix, or suffix indicating what type of document it represents.

Stored in a repository.

I use the term repository here because it is no longer the case that all data is stored in a relational database.  In the document case these could be simply documents stored in a file server. For the case of an application, any application that an enterprise relies on for its duties persists data in some type of repository.

Moved from one persistence layer to another.

This is the portion of the definition, in general,  where we make the transition from a Level 1 to a Level 2 Data Structure Graph. Seldom in my experience is one application sufficient to meet the needs of an entire organization. Data has to flow from some systems to other systems.  This is generally where a Data Architect should have the most influence in an organization. This portion of the definition is for data that has to move between applications, then be persisted in the new application for some period of time, for use-cases not originally designed as part of the application of origin.

Rendered as a Graph.

Why do I say it is rendered as a Graph? In a number of instances where I have worked as a data architect, more time was spent discussing the layout of a diagram than the content of the diagram. On a recent project I set out to change the discussion. Using some of the tools available for Graph analysis, such as Gephi, NetworkX, and iGraph, along with the tenets of graph theory along with a bit of data cleansing I was able to gather, consolidate, render, and present data about the overall structure and relationships of enterprise applications to my customer in a rapid fashion. This completely changed the conversation from the specifics of the diagram to the manner in which things needed to be changed to get away from the hairball they had created.

I will be writing more about Data Structure Graphs as time goes on.  This blog gives us a foundational definition of what a Data Structure Graph is, in future blogs I will write about the particular use-cases and interesting analysis done with Data Structure Graphs.



"Treat your servers like Cattle, not puppies" is  mantra I have heard repeatedly in a former engagement where I was actively involved in deploying a cloud based solution with RightScale across Rackspace, AWS, and Azure.

The idea is that if you have a problem with a particular server, simply kill it, relaunch it with the appropriate rebuild scripts like chef and some other automation, have it reconnect to the load balancer and you are off to the races.

I think there is one significant flaw in this philosophy.

Database servers, or data servers are neither Cattle nor puppies. They are however like Pack Mules.

If you have an issue with some data servers like Cassandra have some built in rebalancing capabilities that allow you to kill one particular instance of a Cassandra ring, bring up a new one, and it will redistribute the loads of data you may have. Traditional database engines like SQL Server, Oracle, MySQL, etc... do not have this built in capability.

Backups still need to be done, Restores still need to be tested, bringing up a new relational database server requires a bit of expertise to get it right. There is still plenty of room for automation, scripting, and other capabilities.

That being said, database infrastructure needs to have a trained, competent, database administrator overseeing its care and maintenance.

We have all seen movies where pack mules were carrying supplies of the adventurers. 

As you take a Pack Mule on your adventures or explorations with you, if it breaks its leg, you can get a new pack pule easy enough. However, you have to remove all of the things that the pack mule was carrying to the new pack mule. If you don't have a new pack mule, or can't get one quickly the adventurers themselves have to carry the supplies, the load is redistributed, priorities of what is truly needed to make it through the foreseeable next steps, and plans are made to find the local town to get a new pack mule.

Back in the present database infrastructure is truly the lifeblood of all of our organizations. Trying to "limp" through, or simply "blow away" our servers and rebuilding them is an extreme philosophy. There are laws, regulations, and customer agreements regarding the treatment and protection of data that must be adhered to.

Who is taking care of your pack mules? Will your current pack mule make it over the next hill you have to climb?

Enhanced by Zemanta


Steps to successful adoption of a new data warehouse

What is taking so long to get the data warehouse ready?

In a new deployment of a data warehouse there are many infrastructure components that have to be put in place. Modeling tools, ETL Servers, ETL processes, BI Servers,  and Bi interfaces and finally reports and dashboards. Not to mention sessions for user interviews, business process review and metadata capture.

I say server(s) because there should be dev/test and prod platforms for each of these.

Figure 3-4: how data models deliver benefitImage via WikipediaA recent article at talks about data modeling taking too much time if done correctly.

Add all of these things together and you have a significant period of time to wait before seeing a benefit to a Data Warehouse/Business Intelligence project.

Here are some suggestions to reassure the stakeholders early on during the project lifecycle.

Give them data early and often.

     Put together a small and simple data model for the first pass. Load the small star schema with a subset of the data relevant to a group of business users, then create some reports or give some power users access to create their own reports.

    This shows the concept of continuity. A Continuity test in electronics is the checking of an electrical circuit to see if current flows, or that it is a complete circuit.

Show the data quality issues

  "A problem well stated is a problem half solved" Without seeing data quality issues, the people that enter data into the system of record can not fix it.

Get and give feedback often

   As soon as people start using the "prototype", you will get feedback. Use this as an opportunity to explain why the process should take longer. It also identifies gaps in understanding among the team. Once people have a hands-on view of the presentation layer they will try a number of things.

They will use it to answer questions they already have answers to. Thus validating the transformation processes.

They will also start to try to answer questions they may not have asked before. This is the best opportunity for learning more about how the data is being used.

These steps lay the foundation for making data work for you and your business.

Enhanced by Zemanta


3 Great Reasons to Build a Data Warehouse

Why should you build a Data Warehouse?

What problems do a Data Warehouse and Business Intelligence platform solve?

There are strong debates about the methods chosen for building a data warehouse, or choosing a business
intelligence tool.Data Warehouse OverviewImage via Wikipedia

Here are three great reasons for building a data warehouse.

Make more money

The initial cost of building a data warehouse can appear to be large. However, what is the cost in time for the people that are analyzing the data without a data warehouse. Ultimately each department, analyst or business unit is going through a similar process of getting data, putting it in a usable format, and storing it for reporting purposes(ETL). After going through this process they have to create reports, prepare presentations and perform analysis. The immediate time savings benefit comes to these folks who do not have to worry about finding the data once the data warehouse platform is built.

The following two points also allow you to make more money.

Make better decisions

In order to better know your customers, you must first better understand what they want from you.Once the people that spend most of their time analyzing the data do not have to spend so much time finding the data and focus their time on reviewing the data and making recommendations, the speed of decision making will increase. As better decisions are made, more decisions can be made faster. This increases agility, improves response time to the customer or environment, and intensifies decision making processes.

Once a decision making platform is built you can better see which type of customer is purchasing what type of product. This allows the marketing department to advertise to those types of customers. The merchandising department can ensure products are available when they are wanted. Purchasing can better anticipate getting raw materials so products are available. Inventory can best be managed when you are able to anticipate orders, shortages, and re-orders.

Make lasting impressions.

Customer service is improved when you better understand your customer. When you can recommend to your customers other products that they may like you become a partner to your customer. Amazon does an amazing job of this. Their recommendation engine is closely tied to their historical data, and pattern matching of which products are similar. Likewise, you may want to tell a customer that they may not want something that they want to purchase because a better solution is available. This makes a lasting impression on them that you are the one to help them in their decision making process.

Make data work

Building a data warehouse platform is one of the best ways to make data work for you, rather than you have to work for your data.

Enhanced by Zemanta


Datagraphy or Datalogy?

What is the study of data management best practices?

Do data management professionals study Datagraphy, or Datalogy?

A few of the things that a data management professional studies and applies are
  • Tools
    • Data Modeling tools
    • ETL tools
    • Database Management tools
  • Procedures 
    • Bus Matrix development
    • User session facilitation
    • Project feedback and tracking
  • Methodologies 
    • Data Normalization
    • Dimensional Modeling
    • Data Architecture approaches

These, among many others, are applied to the needs of the business. Our application of these best practices make our enterprises more successful.

What should be the suffix of the word that sums up our body of knowledge?

Both "-graphy" and "logy" make sense, but let's look at these suffixes and their meaning.


The wiki page for "-graphy"  says: -graphy is the study, art, practice or occupation of... 

The dictionary entry for "-graphy" says -"a process or form of drawing, writing, representing, recording, describing, etc., or an art or science concerned with such a process"


The wiki page for  "-logy"  says -logy is the study of ( a subject or body of knowledge).

The dictionary entry for  "-logy" says: a combining form used in the names of sciences or bodies of knowledge. 


The key word that we all focus on is data. 

In a previous blog entry, I wrote a review of the DAMA-DMBOK  which is the Data Management Association Data Management Body Of Knowledge. 

Data Management professionals study and contribute to this body of knowledge. As a data guy, I am inclined to study to works of those who have gone before. I want to both learn from their successes and avoid solutions that have been unsuccessful. 

Some of the writings I study are by people like:  Dan LinstedtLen Silverston, Bill Inmon, Ralph Kimball, Karen Lopez, William Mcknight and many others. 

I have seen first hand what happens to a project when expertise from the body of knowledge produced by these professionals has been discarded. It is not pretty. 

Why do I study these particular authors? These folks share their experiences. When I face an intricate problem, I research some of their writings to see what they have done. Some tidbit of expertise they have written about has shed light on many problem I have faced, helping me to find the solution that much sooner.

When I follow their expertise my solutions may still be unique, but the solutions fit into patterns that have already been faced. I am standing on the shoulders of giants when I heed their advice. 

When I am forced to ignore their advice, I struggle, fight and do battle with problems that either should not be solved or certainly not be solved in the manner in which I am forced to solve them. 

Should the study of and contribution to the body of knowledge of data management be called data-graphy or data-logy? 


The term Datagraphy sums up the study of the data management body of knowledge succintly. 

I refer back to the dictionary definition of the suffix "-graphy": "a process or form of drawing, writing, representing, recording, describing, etc., or an art or science concerned with such a process"

Data is recorded, described, written down,written about, represented (in many ways) and used as a source for many drawings and graphical representations. 

What do you think? I will certainly be using Datagraphy.
Enhanced by Zemanta