The-Data-Guy: April 2015

2015-04-24

#DataOps

#DataOps: Oh, that's what they call what I do?

I just read a few amazing articles:

Agile DataOps
DataOps a New Discipline

Trying to explain to people what I do becomes tedious at times.

Yes, I can build a Cassandra Cluster, or an Oracle Cluster, or a Cloudera Cluster. Each of those clusters has its own challenges and rewards. But once the Cluster is built and we start putting data into it, what do you want to do with that data? How do you want the data organized? Is your data model correct? Do you have a Data model? Do you know how big the cluster or database is going to get? Have you done a Volumetric calculation? Is your budget big enough to allow for fail-over? Downtime? Maintenance? Disaster recover practice?

Data Scientist, Business Analyst, Executive Analyst, Business User, Project Manager.

Each of these specialties have their own unique challenges. Being The Data Guy in an organization, requires the ability to at least communicate effectively to all of these specialists.

Here are some of the things we do:

Oracle Certifications, Microsoft Certifications, Red Hat Certifications. Statistics, ETL, Informatica, Data Stage, Pentaho, Data mining, R, Python, Scala, Spark, Data modeling, SQL, CQL, Hive, Hadoop, Impala, Map Reduce, Spring, Data Movement, Data Plumbing.

Backups, Restores, Performance checks, SQL tuning, Code tuning to match the data platform. These are all the day in and day out life of Data Operations. Data modeling, ERWin, ERStudio, JSON, XML, Column stores, Document stores, Text mining, Text processing and storage. RAID, SAN, NAS, local storage, spindles, SATA drives. Jobs, batches, Schedulers, 3:00 a.m. wake-up calls, alerts, on-call troubleshooting.

Now it is all summed up in a new hashtag: #DataOps

What is it that you do?

Just as DevOps is important to make our organizations Agile and responsive to the needs of the business users, so to does DataOps have it's unique and peculiar take on impacting the business.

Data Scientist on the left, Business Analyst on the right, Developers behind us, and Project management ahead of us. Standing on the infrastructure that we work together with DevOps to create, implement, and manage.

This is #DataOps.

Are you up for it?

2015-04-21

Centrality-and-Architecture

How does centrality affect your Architecture?

Some time ago, I was responsible for a data architecture I had mostly inherited. There were a number of tweaks I worked to on to refine the monolithic nature of the main database. It was a time of upheaval in this organization. They had outgrown their legacy Computer Telephony Interface application. It was time to create something new.

A large new application development team was brought in to develop some new software.

There was a large division of labor and processing where some things were handled by the new application, and another thing was developed to handle the data. Reporting, cleansing, analysis, ingress feeds, egress feeds, all of these went through the “less important” system.

This was the system I was responsible for.

In thinking about how best to explain a Data Structure Graph, I spent some time revisiting this architecture and brought it into a format that could be analyzed with the tools of Network Analysis.

After anonymizing the data a bit, and limiting the data flows to only the principle data flows, I constructed a csv file to load into Gephi for analysis.

Source	Target	Edge_Label
Spider	ODS	Application
ODS	Spider	Prospect
Vendor1	ODS	Prospect
Vendor2	ODS	Prospect
Vendor3	ODS	Prospect
ODS	Servicing	Application
Legacy	ODS	Application
ODS	Legacy	Prospect
ODS	Dialer1	Prospect
ODS	Dialer2	Prospect
Gov	ODS	DNC
ODS	Spider	LegacyData1
ODS	Spider	LegacyData2
ODS	Spider	LegacyData3
Spider	ODS	LegacyData1
Spider	ODS	LegacyData2
Spider	ODS	LegacyData3
ODS	ThirdParty	Prospect
ThirdParty	ODS	Application
Legacy	ODS	Application
Legacy	ODS	DialerStats
Dialer1	ODS	DialerStats
Dialer2	ODS	DialerStats

I ran a few simple statistics on the graph, then did some partitioning to color the graph to make it apparent the degree of a node this is the first output of Gephi:

The actual statistics Gephi calculated are in this table:

Id	Label	PageRank	Eigenvector Centrality	In-Degree	Out-Degree	Degree
Vendor1	Vendor1	0.01991719	0.00000000	0	1	1
Vendor2	Vendor2	0.01991719	0.00000000	0	1	1
Vendor3	Vendor3	0.01991719	0.00000000	0	1	1
Gov	Gov	0.01991719	0.00000000	0	1	1
Spider	Spider	0.08121259	0.44698155	1	1	2
Servicing	Servicing	0.08121259	0.44698155	1	0	1
Legacy	Legacy	0.08121259	0.44698155	1	1	2
Dialer1	Dialer1	0.08121259	0.44698155	1	1	2
Dialer2	Dialer2	0.08121259	0.44698155	1	1	2
ThirdParty	ThirdParty	0.08121259	0.44698155	1	1	2
ODS	ODS	0.43305573	1.00000000	9	6	15

From the Data Architecture perspective, which “application” has the greatest impact to the organization if there were a failure?

Which “application” should have the greatest degree of protection, redundancy, and expertise

associated with it?

Let's cover in detail the two metrics in the middle of the last table PageRank, and Eigenvector Centrality.

I will have to create individual blog entries for both PageRank and Eigenvector Centrality to discuss the actual mechanism for how these are calculated. The math for these can be a bit cumbersome, and each algorithm should be given due attention on its own.

The point of this analysis is to determine which component of the architecture should have additional resources devoted to it. For any customer facing application, it should be given due attention, and infrastructure. However, one question I have seen many of my clients struggle with is what is the priority of the back-end infrastructure? Should once component of the architecture be given more attention than another? I have 90 databases throughout the organization, which one is the most important?

These centrality calculations show unequivocally which component of the architecture has the most impact in the event of an outage, or where the most value can be provided for an upgrade.

This type of analysis can begin to shed light on the answers to these questions. A methodical approach to an architecture based on data, rather than the division that screams the loudest can give insight into how an architecture is truly implemented.

I call these artifacts a Data Structure Graph

Pages