Comparing Data Science and Business Intelligence

Over the past few years as I have been supporting more and more "non-traditional" (i.e. not a Data Warehouse or Data Mart) analytical platforms, I have noticed a number of differences between Data Science approaches and Business Intelligence approaches.

This image sums up many of my observations and gives a touch point for comparing the differences as well as similarities between the two approaches.

Reproducible versus Repeatable

One of the goals of #DataOps is to keep data moving to the right location in a repeatable, automatized manner. Most of the data warehouse environments I have worked on, the person doing the analysis does not run the ETL jobs. Todays data flows into existing data marts, dashboards, dimensional models, and queries that drive it all. These are repeatable processes.

Performing a Reproducible process on the other hand shows the entire process soup to nuts. The analyst pulled this data from that system, used this transformation on these data elements, combined this data with that data, ran this regression, and produced this result. Therefore if we raise the price of this widget by $.05 we will have this lift in profit (Ceteris paribus).

Predictive versus Descriptive

As described above the Data Scientist attempts to make a prediction about something, whereas the Business Intelligence analyst is usually reporting on a condition that is considered a Key Performance Indicator of the company.

Explorative versus Comparative

In most Business Intelligence environments I have worked with, the questions are usually along these lines: "Is this product selling more than that product?"

The Data Scientist would want to look at what product has the highest margin, or the product that has the largest impact on the bottom line. If someone buys product X, do they also purchase product Y?
What else is impacting this particular store? Does weather have an impact on purchase patterns? What about twitter hashtags?  What in our product line is most similar to a product that has a high purchase volume in the overall consumer community. 

Attentive versus Advocating

Data Scientist: The data shows us that consumers that purchased X also purchased Y. I suggest we relocate Y for the stores in this geographic area by 1 meter away from X, and in this geographic area they should be 2 meters away. Then we will analyze the same visit purchases for those two items to determine if this should be done in all stores.

Business Intelligence: The latest data from our campaign is shown here. The response rate among 18-24 year old males is less than what we wanted but we expect to see more lift in the coming weeks. 

Accepting versus Prescriptive

Data Scientist: Give me all data, I will analyze it as is, and determine what needs to be cleaned and what represents further opportunities. If there is a quality issue I will document it as part of the assumptions in my analysis.

Business Intelligence: The data has to be cleaned and high quality before it can be analyzed. No one should see the data before all of the quality checks, verifications, and cleansing processes are done.

Both of these approaches have business value.

I think Data Science will continue to get some press for quite a while, there will always be some amazing break through that someone used the algorithm of the day to solve a business problem. Then the performance of that algorithm will become a metric on a dashboard that is put into a data mart.

The guys and gals in #DataOps will make sure the data is current.

The Data is protected.

The Data is available.

The Data is shown on the right report to the people that are authorized to see it.

Your Data is safe. 



Data Strategy, but no Spark Strategy, how cute.

I see blogs, and blogs, and articles, and presentations, and Slideshares on creating a Big Data Strategy.  How does "it" (usually Hadoop) fit into your organization, who should get access, who should "it", so many questions, comments and opinions.

Guess what?

Your data is growing.

And growing.

If you use a Graph to analyze the data flows in your organization, chances are you will see ways to cut cost, and consolidate architectural components.

Bring things together and store them in one location.


Now what?

What tools do you use to analyze it?

You want to do the same thing you have always done?


Send two people in your organization to the upcoming Spark Summit. Let them show you there is a better way.

Spark is not just a new tool.

It is a new Path.




Was Charles Eppes the first fictional Data Scientist


Sherlock Holmes is the most well known fictional detective, although many students of literature will tell you Holmes was not the actual first consulting detective.

A recent conversation about detectives, and data scientists led me to wonder who could be considered the first fictional Data Scientist.

To answer this question we should consider what a data scientist is and what they do.

They work with data gathered in the real world, do some analysis, derive a model that can represent the data, explain the model and data to others, add new data to the model, perform some type of prediction, refine results, then test in the real world.

While their are many definitions, and I feel like the definition of the term Data Scientist changes on an ongoing basis, for the purpose of this article I think these general thoughts are sufficient.

On the show Numb3rs, Charles Eppes, played by David Krumholtz was a mathematician who had a brother in the F.B.I. in California. Through a number of seasons Charlie, as he was affectionately known, worked through "doing the math" of these hard problems helping his brother and his team both solve crimes, as well as understand the math behind his explanation. By taking things this next step, actually explaining through analogy that non-mathematicians could understand he was able to provide them insight into the crime and the criminal.

For those of you that have attended any of my presentations on data science, you know that I consider Johannes Kepler to be one of the first true data scientists. Like Charlie, Johannes gathered data from the real world, (painstakingly collected over years by Tycho Brahe), did some analysis, derived a model to represent the data, then began to explain the model and the data to others. As new data came in, Kepler refined his model until all the data points fit with his model. From there Kepler was able to make predictions, refine his results and show others how the real world worked.

There are many other shows that apply the principles of Forensics to solving crime, some of them are quite interesting, although I am not sure  of the veracity of the capabilities of crime solvers to do some of the things that their Television counterparts do on a weekly basis.

Numb3rs, to me will always be about using Data Science to solve real world problems.If you haven't seen an episode, the whole series is on Netflix.

After all, isn't that what the data we work with on a daily basis represents? Something in the "real world" ?


PloyGlot Management

PloyGlot Data Management


What is the data stored in? 

A traditional database administrator is familiar with a set of utilities, SQL, and a stack of tools specific to the RDBMS that they are supporting.

SQL Server, SQL Server Analysis Service, SQL Server Integration Services, SQL Server Reporting Services. SQL Server Management Studio.

Not to mention the whole disk partitioning, SAN layout, RAID level and such that they need to know.

Oracle has it's own set of peculiarities as does DB2, MySQL, and others.

There is a new kid on the block.


Martin Fowler does a phenomenal job introducing the topic of NoSQL in this talk 

While most people do not get into too many of the details about how their data is stored and structured, the world of #DataOps the nature of how and where the data is stored becomes incredibly important.

In any given day managing a PloyGlot environment the command line tools used could be sqlplus, bcp, sqlplus,hdfs,spark-shell, xd-shell.

Each of these data storage engines have different requirements. The manner in which databases are clustered when moving between RDBMS, and the NoSQL platforms are quite different.

Putting Hadoop or Cassandra on a SAN should require due thought before doing so.

Likewise creating an environment with isolated disks for an Oracle Cluster may not be the correct solution.

Managing a PolyGlot environment, is by itself a challenge. Sometimes this requires a team,
sometimes this requires a lightly shaded purple squirrel that is equally at home at the command line, the SQL prompt,a REPL, a console, a white board, an AWS management browser, or a management console like Grid control, Ops Center, or Cloudera Manager.

Working with this variety of data, and the variety of the types of teams, and people that need access to this data, or even a subset of the data requires it's own level of understanding of the data, data management, and how to make the data itself work to contribute to the bottom line of the organization.

Are your purple squirrels only on the data science team? Probably not.



SPARCL - Simple Python/Awk Reduce Command Line

I need to analyze some data.
It is not overly large, so it isn't really a Big Data problem. However, working with the data in Excel is unwieldy.

I really need to analyze the structure of the data, and the unique values of either a single column, or combinations of columns.

There are actually a number of ways to do this.
Load the data into a MySQL database run SQL Group-by statements on the table.
Load it in to R, use the sqldf package to do the same thing.
Write a Python script to read the whole file and spit out combinations of distinct values for various columns.

All of these approaches have their advantages and disadvantages.

Each of them take a bit of time, and some of them just sit in memory.

Then I remembered a class I took from Cloudera some time ago. In the class the instructor showed us how to do a simple Map Reduce program without invoking Hadoop.

The command is simple:  
cat filename | ./mapper.py | sort | ./reducer.py

He suggested we run our code through this pipeline before submitting a full on hadoop job, just to make sure there were no syntax errors, and all of the packages were available to the python command interpreter.

This is exactly what I need to do.

The first attempt to do this I wrote a simple awk script for the "mapper" portion.
#!/usr/bin/awk -f
BEGIN { FS ="," }
{print $1;}

This I ran with:
cat file | ./parser.awk  |sort  |./reducer.py

However, as time went on I needed to either look at different columns from my CSV file, or combinations of columns from my CSV file.

Rather than do a lot of AWK coding I wrote this Python:
import sys
line = []
indexes = []
list_arg = 0
if len(sys.argv) == 2:
        index = int(sys.argv[1])
        list_arg = 1
        indexes = sys.argv[1:]
for data in sys.stdin:
        line = data.split(',')
        if list_arg == 0:
                print "{0},1".format(line[index])
                string = ""
                for index in indexes:
                        string = string+line[int(index)]
                        if (index == indexes[-1]):
                                string = string+"\t1"
                                string = string+","
                print string

Now I can do:
cat file | ./parser.py 2 4 6 99 3 | sort | ./reducer.py

The reducer function I will keep on my github to keep it easy to read.

Feel free to give this a try. I find it works best from within a VM like the DataScienceAtTheCommandLine

Good luck, and comment if you find this useful.



Do you need a #DataTherapist? 

A few weeks ago, I was  enjoying a lunch with some of my coworkers. We were discussing some of the use cases of Data Science and Business analysis that we are building for our various clients.

Somebody made the comment, that some use-cases don't need a data scientist they need a data therapist.

Many of us laughed, and I even made a Twitter comment about this.  A number of other people on Twitter not only retweeted the comment, but began to make comments about the application of a #DataTherapist to particular use-cases.

Here are a few recent definitions that have evolved in the Twitterverse related to the data hashtag: #Data.

My definition of Data Science: The application of Statistical and Mathematical rigor to Business Data. There should be the proper application of confidence intervals, and p-values to data when making decisions. When doing some type of predictive analysis this becomes even more important.  Data Scientists also do research with all of your business data to include even adding third-party data as a source of enrichment in order to best understand what has happened historically, and what is likely to happen given a set of conditions in the future.

When doing historical reporting, the analyst is reporting on facts that have occurred that the data represents. This is usually your Data Warehouse, Business Intelligence type data product. These things are repeatable, predefined business questions.

A Data Architect designs how an infrastructure should be built from the servers to the data model, and works with other Architects to ensure things are streamlined to meet the goals and objectives of the organization.

#DataOps, are the people that make the Data Architects vision happen. They not only keep the lights on, but also make it easy for the Data Scientist, Business Analysts, Application Developers, and all other manner of people that have a need to work with an enterprises Data Products to do so.

But what is a Data Therapist?

What started out as a "joke" may truly be something important.

So here goes. An initial definition of a #DataTherapist.

A Data Therapist is the person, or group of people that shows the way for what not only Could be done with an enterprises data, but what Should be done with the Enterprises data. They scope out data projects, and work with the executive team to clear the runways for the Data Architect, Data Scientist, Business Analyst and the DataOps team to get their job done.

Not all data problems require a Data Scientist.

Not all Data Science problems are Big Data problems.

Not all Big Data problems are Data Science problems.

Data, regardless of structure, use-case, application, or data product supported will continue to grow. The leaders in this space continue to push the limits on what can be done with Python, R, Spark, SQL, and other tools.

The Data Therapist shines a light on what can be done, and whether your organization should undertake a large scale data project. They also can tell you when you need to stop doing something and simply start anew.

Are you ready to listen to the #DataTherapist?

You may not always like what they have to say.

Please comment below about how you think a Data Therapist can help an organization.