2015-06-25

edX Intro to Big Data with Apache Spark

BerkeleyX - CS100.1x


The title of the course is Intro to Big Data with Apache Spark. 

This course is a collaboration between UC-Berkeley and DataBricks in order to formally introduce many continuous learners to Apache Spark. 

I have a minor (very minor) advantage, in that I have actually done a few small research projects around Apache Spark. Including doing a writeup for setting up Spark 1.4 to use SparkR on Windows 7.

Week 1 - Data Science Background and course setup 
Week 2 - Introduction to Apache Spark. 
Week 3 - Data Management
Week 4 - Data Quality, Exploratory Data Analysis and machine Learning. 


The setup portion, to me is very valuable. Creating a locally running Spark environment can be a little tedious. Working with Databricks Cloud is a dream, and running things with EC2 I haven't tried, but it does take a couple steps to get things going.

The labs are straightforward with a few challenges to make you think about what you are doing. Regular expressions are incredibly useful, not only in particular for this class, but they can be used in a variety of settings. I would strongly recommend reviewing Regular Expressions, and Python before getting started.

The instructor is very active on the discussion forums and has done a phenomenal job of working with some very frustrated students. 

I think the frustration of many of the students are a result of being new to functional programming,  There have been discussions around this, and I think the next iteration of the class may spend more time emphasizing this.

Overall, each lab has demonstrated some new capability. The later labs do make reference to the earlier ones. 

The class is still going on, and I believe students can still enroll. 

This class can be taken by itself or in conjunction with the followup  course: Scalable Machine Learning

If the follow-on course is as exciting as this course is, it will be well worth the time invested in learning Spark Machine Learning. 

Now if they would come up with a full edX course on GraphX, that would be awesome!




2015-06-12

SparkR on Windows 7

Apache SparkR on Windows 7

Like many people, I have been looking forward to working with SparkR. It was just released yesterday, you can find the full Apache Spark download here: Apache Spark Download

I have set up SparkR on a couple machines to run locally for testing. I did have a few hiccups getting it working, so I will document what I did here.

1. JDK 1.8_045
2. R 3.1.3
3. Apache Spark prebuilt with Hadoop-2.6
4. hadoop-common-2.2.0-bin-master
5. Fix logging. 
6. Paths
7. Run as Administrator.

1. Install the JDK 1.8_045 available here: jdk 1.8
2. Install R 3.1.3 available here: R 3.1.3

The first two items are standard Windows based installs, so they don't have to be put in any particular location. For the following downloads, I recommend creating a directory called Spark under your My Documents folder (C:\users\\Documents\Spark)

3.  Download and unzip Spark1.4 prebuilt with hadoop-2.6 available here: Spark 1.4 to the spark directory just referenced.


4. This link describes the winutils problem pretty well. Unzip the hadoop-common-2.2.0-bin.master, set up a HADOOP_HOME environment variable. I created this under the Spark directory mentioned above.


5. Copy spark-1.4.0-bin-hadoop2.6\conf\log4j.properties.template to spark-1.4.0-bin-hadoop2.6\conf\log4j.properties.
Edit the file.  log4j.properties
Change:
log4j.rootCategory=INFO, console
to:
log4j.rootCategory=ERROR, console

6. Make sure your path includes all the tools you just set up. A portion of my path is:
C:\Program Files\Java\jdk1.8.0_45\bin;C:\Users\<yourname>\Documents\Spark\spark-1.4.0-bin-hadoop2.6\bin;C:\Users\<yourname>\Documents\Spark\spark-1.4.0-bin-hadoop2.6\sbin;C:\Users\<yourname>\Documents\R\R-3.1.3\bin\x64

7. Run a command prompt as Administrator.

If I missed any of these steps, I had a number of issues to get things to work properly.

Now you can run sparkR


Have fun with Apache SparkR. I know I have much to try!

Good luck.