edX Intro to Big Data with Apache Spark

BerkeleyX - CS100.1x

The title of the course is Intro to Big Data with Apache Spark. 

This course is a collaboration between UC-Berkeley and DataBricks in order to formally introduce many continuous learners to Apache Spark. 

I have a minor (very minor) advantage, in that I have actually done a few small research projects around Apache Spark. Including doing a writeup for setting up Spark 1.4 to use SparkR on Windows 7.

Week 1 - Data Science Background and course setup 
Week 2 - Introduction to Apache Spark. 
Week 3 - Data Management
Week 4 - Data Quality, Exploratory Data Analysis and machine Learning. 

The setup portion, to me is very valuable. Creating a locally running Spark environment can be a little tedious. Working with Databricks Cloud is a dream, and running things with EC2 I haven't tried, but it does take a couple steps to get things going.

The labs are straightforward with a few challenges to make you think about what you are doing. Regular expressions are incredibly useful, not only in particular for this class, but they can be used in a variety of settings. I would strongly recommend reviewing Regular Expressions, and Python before getting started.

The instructor is very active on the discussion forums and has done a phenomenal job of working with some very frustrated students. 

I think the frustration of many of the students are a result of being new to functional programming,  There have been discussions around this, and I think the next iteration of the class may spend more time emphasizing this.

Overall, each lab has demonstrated some new capability. The later labs do make reference to the earlier ones. 

The class is still going on, and I believe students can still enroll. 

This class can be taken by itself or in conjunction with the followup  course: Scalable Machine Learning

If the follow-on course is as exciting as this course is, it will be well worth the time invested in learning Spark Machine Learning. 

Now if they would come up with a full edX course on GraphX, that would be awesome!


SparkR on Windows 7

Apache SparkR on Windows 7

Like many people, I have been looking forward to working with SparkR. It was just released yesterday, you can find the full Apache Spark download here: Apache Spark Download

I have set up SparkR on a couple machines to run locally for testing. I did have a few hiccups getting it working, so I will document what I did here.

1. JDK 1.8_045
2. R 3.1.3
3. Apache Spark prebuilt with Hadoop-2.6
4. hadoop-common-2.2.0-bin-master
5. Fix logging. 
6. Paths
7. Run as Administrator.

1. Install the JDK 1.8_045 available here: jdk 1.8
2. Install R 3.1.3 available here: R 3.1.3

The first two items are standard Windows based installs, so they don't have to be put in any particular location. For the following downloads, I recommend creating a directory called Spark under your My Documents folder (C:\users\\Documents\Spark)

3.  Download and unzip Spark1.4 prebuilt with hadoop-2.6 available here: Spark 1.4 to the spark directory just referenced.

4. This link describes the winutils problem pretty well. Unzip the hadoop-common-2.2.0-bin.master, set up a HADOOP_HOME environment variable. I created this under the Spark directory mentioned above.

5. Copy spark-1.4.0-bin-hadoop2.6\conf\ to spark-1.4.0-bin-hadoop2.6\conf\
Edit the file.
log4j.rootCategory=INFO, console
log4j.rootCategory=ERROR, console

6. Make sure your path includes all the tools you just set up. A portion of my path is:
C:\Program Files\Java\jdk1.8.0_45\bin;C:\Users\<yourname>\Documents\Spark\spark-1.4.0-bin-hadoop2.6\bin;C:\Users\<yourname>\Documents\Spark\spark-1.4.0-bin-hadoop2.6\sbin;C:\Users\<yourname>\Documents\R\R-3.1.3\bin\x64

7. Run a command prompt as Administrator.

If I missed any of these steps, I had a number of issues to get things to work properly.

Now you can run sparkR

Have fun with Apache SparkR. I know I have much to try!

Good luck.



Comparing Data Science and Business Intelligence

Over the past few years as I have been supporting more and more "non-traditional" (i.e. not a Data Warehouse or Data Mart) analytical platforms, I have noticed a number of differences between Data Science approaches and Business Intelligence approaches.

This image sums up many of my observations and gives a touch point for comparing the differences as well as similarities between the two approaches.

Reproducible versus Repeatable

One of the goals of #DataOps is to keep data moving to the right location in a repeatable, automatized manner. Most of the data warehouse environments I have worked on, the person doing the analysis does not run the ETL jobs. Todays data flows into existing data marts, dashboards, dimensional models, and queries that drive it all. These are repeatable processes.

Performing a Reproducible process on the other hand shows the entire process soup to nuts. The analyst pulled this data from that system, used this transformation on these data elements, combined this data with that data, ran this regression, and produced this result. Therefore if we raise the price of this widget by $.05 we will have this lift in profit (Ceteris paribus).

Predictive versus Descriptive

As described above the Data Scientist attempts to make a prediction about something, whereas the Business Intelligence analyst is usually reporting on a condition that is considered a Key Performance Indicator of the company.

Explorative versus Comparative

In most Business Intelligence environments I have worked with, the questions are usually along these lines: "Is this product selling more than that product?"

The Data Scientist would want to look at what product has the highest margin, or the product that has the largest impact on the bottom line. If someone buys product X, do they also purchase product Y?
What else is impacting this particular store? Does weather have an impact on purchase patterns? What about twitter hashtags?  What in our product line is most similar to a product that has a high purchase volume in the overall consumer community. 

Attentive versus Advocating

Data Scientist: The data shows us that consumers that purchased X also purchased Y. I suggest we relocate Y for the stores in this geographic area by 1 meter away from X, and in this geographic area they should be 2 meters away. Then we will analyze the same visit purchases for those two items to determine if this should be done in all stores.

Business Intelligence: The latest data from our campaign is shown here. The response rate among 18-24 year old males is less than what we wanted but we expect to see more lift in the coming weeks. 

Accepting versus Prescriptive

Data Scientist: Give me all data, I will analyze it as is, and determine what needs to be cleaned and what represents further opportunities. If there is a quality issue I will document it as part of the assumptions in my analysis.

Business Intelligence: The data has to be cleaned and high quality before it can be analyzed. No one should see the data before all of the quality checks, verifications, and cleansing processes are done.

Both of these approaches have business value.

I think Data Science will continue to get some press for quite a while, there will always be some amazing break through that someone used the algorithm of the day to solve a business problem. Then the performance of that algorithm will become a metric on a dashboard that is put into a data mart.

The guys and gals in #DataOps will make sure the data is current.

The Data is protected.

The Data is available.

The Data is shown on the right report to the people that are authorized to see it.

Your Data is safe. 



Data Strategy, but no Spark Strategy, how cute.

I see blogs, and blogs, and articles, and presentations, and Slideshares on creating a Big Data Strategy.  How does "it" (usually Hadoop) fit into your organization, who should get access, who should "it", so many questions, comments and opinions.

Guess what?

Your data is growing.

And growing.

If you use a Graph to analyze the data flows in your organization, chances are you will see ways to cut cost, and consolidate architectural components.

Bring things together and store them in one location.


Now what?

What tools do you use to analyze it?

You want to do the same thing you have always done?


Send two people in your organization to the upcoming Spark Summit. Let them show you there is a better way.

Spark is not just a new tool.

It is a new Path.




Was Charles Eppes the first fictional Data Scientist


Sherlock Holmes is the most well known fictional detective, although many students of literature will tell you Holmes was not the actual first consulting detective.

A recent conversation about detectives, and data scientists led me to wonder who could be considered the first fictional Data Scientist.

To answer this question we should consider what a data scientist is and what they do.

They work with data gathered in the real world, do some analysis, derive a model that can represent the data, explain the model and data to others, add new data to the model, perform some type of prediction, refine results, then test in the real world.

While their are many definitions, and I feel like the definition of the term Data Scientist changes on an ongoing basis, for the purpose of this article I think these general thoughts are sufficient.

On the show Numb3rs, Charles Eppes, played by David Krumholtz was a mathematician who had a brother in the F.B.I. in California. Through a number of seasons Charlie, as he was affectionately known, worked through "doing the math" of these hard problems helping his brother and his team both solve crimes, as well as understand the math behind his explanation. By taking things this next step, actually explaining through analogy that non-mathematicians could understand he was able to provide them insight into the crime and the criminal.

For those of you that have attended any of my presentations on data science, you know that I consider Johannes Kepler to be one of the first true data scientists. Like Charlie, Johannes gathered data from the real world, (painstakingly collected over years by Tycho Brahe), did some analysis, derived a model to represent the data, then began to explain the model and the data to others. As new data came in, Kepler refined his model until all the data points fit with his model. From there Kepler was able to make predictions, refine his results and show others how the real world worked.

There are many other shows that apply the principles of Forensics to solving crime, some of them are quite interesting, although I am not sure  of the veracity of the capabilities of crime solvers to do some of the things that their Television counterparts do on a weekly basis.

Numb3rs, to me will always be about using Data Science to solve real world problems.If you haven't seen an episode, the whole series is on Netflix.

After all, isn't that what the data we work with on a daily basis represents? Something in the "real world" ?