2015-12-29

Learning new things for Dummies

A floating dummy used for man over board train...
A floating dummy used for man over board training, commonly referred to as Oscar. (Photo credit: Wikipedia)

Learning new things for Dummies


I read incessantly.

Seldom do I read anything that is fiction any longer.

I find I simply have too little time that I can dedicate to leisure reading.

However, the topics I do read about vary widely.

Algebra, Geometry, Calculus, Statistics, Probability, Genetics, Econometrics, Weather, Day-Trading.

All of these topics I have learned about through the series of books from Wiley labeled the "..for Dummies" series.

Each book, that I have read, is a little over 300 pages or so. This usually takes me slightly less than a week depending on my interest, and whether or not I am completely unfamiliar with the topic. They all come with extensive reference information. If you read through one of the books and need more detailed information it will point you to the right direction.

Web links, other books in the series, other books that are generally used in the topic discussed are all referenced.

The main thing that drew me to the series, and keeps me buying new books in the series again and again is in the title itself.

If you are new to a subject you are a dummy. 

English: Histogram of sepal widths for Iris ve...
English: Histogram of sepal widths for Iris versicolor from Fisher's Iris flower data set. SVG redraw of original image. (Photo credit: Wikipedia)
 Guess what? We all are dummies about one subject or another.

These books seldom make an assumption that the reader has in-depth knowledge of the topic or even related topics  They do not even make the assumption that the reader is familiar with basic terms.

So far in the twenty or so books in this series I have read, none of them have started using a term that was new to me, and failed to properly introduce it.

Some of the books I get to serve as an overview of a topic I may have some mild interest in so that I may be able to more intelligently discuss the topic with others that are more fluent than I in the topic.

Other topics I have purchased whole collections of the series that are interrelated (Algebra, Geometry, Trigonometry, Calculus, Statistics, and Probability). They serve as a foundation refresher since I have not studied some of these topics for some time.

I do not limit my reading to this series, but this is usually my starting off point when I set out to learn a new subject. So far, the foundation this series lays allows me to press onward to even advanced topics quickly, efficiently and with little delay in looking for a glossary of terms.

Now, go learn a new subject you Dummy!








2015-06-25

edX Intro to Big Data with Apache Spark

BerkeleyX - CS100.1x


The title of the course is Intro to Big Data with Apache Spark. 

This course is a collaboration between UC-Berkeley and DataBricks in order to formally introduce many continuous learners to Apache Spark. 

I have a minor (very minor) advantage, in that I have actually done a few small research projects around Apache Spark. Including doing a writeup for setting up Spark 1.4 to use SparkR on Windows 7.

Week 1 - Data Science Background and course setup 
Week 2 - Introduction to Apache Spark. 
Week 3 - Data Management
Week 4 - Data Quality, Exploratory Data Analysis and machine Learning. 


The setup portion, to me is very valuable. Creating a locally running Spark environment can be a little tedious. Working with Databricks Cloud is a dream, and running things with EC2 I haven't tried, but it does take a couple steps to get things going.

The labs are straightforward with a few challenges to make you think about what you are doing. Regular expressions are incredibly useful, not only in particular for this class, but they can be used in a variety of settings. I would strongly recommend reviewing Regular Expressions, and Python before getting started.

The instructor is very active on the discussion forums and has done a phenomenal job of working with some very frustrated students. 

I think the frustration of many of the students are a result of being new to functional programming,  There have been discussions around this, and I think the next iteration of the class may spend more time emphasizing this.

Overall, each lab has demonstrated some new capability. The later labs do make reference to the earlier ones. 

The class is still going on, and I believe students can still enroll. 

This class can be taken by itself or in conjunction with the followup  course: Scalable Machine Learning

If the follow-on course is as exciting as this course is, it will be well worth the time invested in learning Spark Machine Learning. 

Now if they would come up with a full edX course on GraphX, that would be awesome!




2015-06-12

SparkR on Windows 7

Apache SparkR on Windows 7

Like many people, I have been looking forward to working with SparkR. It was just released yesterday, you can find the full Apache Spark download here: Apache Spark Download

I have set up SparkR on a couple machines to run locally for testing. I did have a few hiccups getting it working, so I will document what I did here.

1. JDK 1.8_045
2. R 3.1.3
3. Apache Spark prebuilt with Hadoop-2.6
4. hadoop-common-2.2.0-bin-master
5. Fix logging. 
6. Paths
7. Run as Administrator.

1. Install the JDK 1.8_045 available here: jdk 1.8
2. Install R 3.1.3 available here: R 3.1.3

The first two items are standard Windows based installs, so they don't have to be put in any particular location. For the following downloads, I recommend creating a directory called Spark under your My Documents folder (C:\users\\Documents\Spark)

3.  Download and unzip Spark1.4 prebuilt with hadoop-2.6 available here: Spark 1.4 to the spark directory just referenced.


4. This link describes the winutils problem pretty well. Unzip the hadoop-common-2.2.0-bin.master, set up a HADOOP_HOME environment variable. I created this under the Spark directory mentioned above.


5. Copy spark-1.4.0-bin-hadoop2.6\conf\log4j.properties.template to spark-1.4.0-bin-hadoop2.6\conf\log4j.properties.
Edit the file.  log4j.properties
Change:
log4j.rootCategory=INFO, console
to:
log4j.rootCategory=ERROR, console

6. Make sure your path includes all the tools you just set up. A portion of my path is:
C:\Program Files\Java\jdk1.8.0_45\bin;C:\Users\<yourname>\Documents\Spark\spark-1.4.0-bin-hadoop2.6\bin;C:\Users\<yourname>\Documents\Spark\spark-1.4.0-bin-hadoop2.6\sbin;C:\Users\<yourname>\Documents\R\R-3.1.3\bin\x64

7. Run a command prompt as Administrator.

If I missed any of these steps, I had a number of issues to get things to work properly.

Now you can run sparkR


Have fun with Apache SparkR. I know I have much to try!

Good luck.

2015-05-27

Data-Science-or-Business-Intelligence?

Comparing Data Science and Business Intelligence


Over the past few years as I have been supporting more and more "non-traditional" (i.e. not a Data Warehouse or Data Mart) analytical platforms, I have noticed a number of differences between Data Science approaches and Business Intelligence approaches.

This image sums up many of my observations and gives a touch point for comparing the differences as well as similarities between the two approaches.



Reproducible versus Repeatable


One of the goals of #DataOps is to keep data moving to the right location in a repeatable, automatized manner. Most of the data warehouse environments I have worked on, the person doing the analysis does not run the ETL jobs. Todays data flows into existing data marts, dashboards, dimensional models, and queries that drive it all. These are repeatable processes.

Performing a Reproducible process on the other hand shows the entire process soup to nuts. The analyst pulled this data from that system, used this transformation on these data elements, combined this data with that data, ran this regression, and produced this result. Therefore if we raise the price of this widget by $.05 we will have this lift in profit (Ceteris paribus).

Predictive versus Descriptive


As described above the Data Scientist attempts to make a prediction about something, whereas the Business Intelligence analyst is usually reporting on a condition that is considered a Key Performance Indicator of the company.

Explorative versus Comparative


In most Business Intelligence environments I have worked with, the questions are usually along these lines: "Is this product selling more than that product?"

The Data Scientist would want to look at what product has the highest margin, or the product that has the largest impact on the bottom line. If someone buys product X, do they also purchase product Y?
What else is impacting this particular store? Does weather have an impact on purchase patterns? What about twitter hashtags?  What in our product line is most similar to a product that has a high purchase volume in the overall consumer community. 

Attentive versus Advocating


Data Scientist: The data shows us that consumers that purchased X also purchased Y. I suggest we relocate Y for the stores in this geographic area by 1 meter away from X, and in this geographic area they should be 2 meters away. Then we will analyze the same visit purchases for those two items to determine if this should be done in all stores.

Business Intelligence: The latest data from our campaign is shown here. The response rate among 18-24 year old males is less than what we wanted but we expect to see more lift in the coming weeks. 

Accepting versus Prescriptive



Data Scientist: Give me all data, I will analyze it as is, and determine what needs to be cleaned and what represents further opportunities. If there is a quality issue I will document it as part of the assumptions in my analysis.









Business Intelligence: The data has to be cleaned and high quality before it can be analyzed. No one should see the data before all of the quality checks, verifications, and cleansing processes are done.







Both of these approaches have business value.

I think Data Science will continue to get some press for quite a while, there will always be some amazing break through that someone used the algorithm of the day to solve a business problem. Then the performance of that algorithm will become a metric on a dashboard that is put into a data mart.

The guys and gals in #DataOps will make sure the data is current.

The Data is protected.

The Data is available.

The Data is shown on the right report to the people that are authorized to see it.

Your Data is safe. 



2015-05-26

Spark-Strategy

Data Strategy, but no Spark Strategy, how cute.


I see blogs, and blogs, and articles, and presentations, and Slideshares on creating a Big Data Strategy.  How does "it" (usually Hadoop) fit into your organization, who should get access, who should "it", so many questions, comments and opinions.

Guess what?

Your data is growing.

And growing.

If you use a Graph to analyze the data flows in your organization, chances are you will see ways to cut cost, and consolidate architectural components.

Bring things together and store them in one location.

Done!

Now what?

What tools do you use to analyze it?

You want to do the same thing you have always done?

Really?

Send two people in your organization to the upcoming Spark Summit. Let them show you there is a better way.





Spark is not just a new tool.

It is a new Path.

 




2015-05-11

The-First-Data-Scientist?

Was Charles Eppes the first fictional Data Scientist

 

Sherlock Holmes is the most well known fictional detective, although many students of literature will tell you Holmes was not the actual first consulting detective.

A recent conversation about detectives, and data scientists led me to wonder who could be considered the first fictional Data Scientist.

To answer this question we should consider what a data scientist is and what they do.

They work with data gathered in the real world, do some analysis, derive a model that can represent the data, explain the model and data to others, add new data to the model, perform some type of prediction, refine results, then test in the real world.

While their are many definitions, and I feel like the definition of the term Data Scientist changes on an ongoing basis, for the purpose of this article I think these general thoughts are sufficient.

On the show Numb3rs, Charles Eppes, played by David Krumholtz was a mathematician who had a brother in the F.B.I. in California. Through a number of seasons Charlie, as he was affectionately known, worked through "doing the math" of these hard problems helping his brother and his team both solve crimes, as well as understand the math behind his explanation. By taking things this next step, actually explaining through analogy that non-mathematicians could understand he was able to provide them insight into the crime and the criminal.

For those of you that have attended any of my presentations on data science, you know that I consider Johannes Kepler to be one of the first true data scientists. Like Charlie, Johannes gathered data from the real world, (painstakingly collected over years by Tycho Brahe), did some analysis, derived a model to represent the data, then began to explain the model and the data to others. As new data came in, Kepler refined his model until all the data points fit with his model. From there Kepler was able to make predictions, refine his results and show others how the real world worked.

There are many other shows that apply the principles of Forensics to solving crime, some of them are quite interesting, although I am not sure  of the veracity of the capabilities of crime solvers to do some of the things that their Television counterparts do on a weekly basis.

Numb3rs, to me will always be about using Data Science to solve real world problems.If you haven't seen an episode, the whole series is on Netflix.


After all, isn't that what the data we work with on a daily basis represents? Something in the "real world" ?






2015-05-07

PloyGlot Management

PloyGlot Data Management

 

What is the data stored in? 


A traditional database administrator is familiar with a set of utilities, SQL, and a stack of tools specific to the RDBMS that they are supporting.


SQL Server, SQL Server Analysis Service, SQL Server Integration Services, SQL Server Reporting Services. SQL Server Management Studio.

Not to mention the whole disk partitioning, SAN layout, RAID level and such that they need to know.

Oracle has it's own set of peculiarities as does DB2, MySQL, and others.

There is a new kid on the block.

NoSQL.

Martin Fowler does a phenomenal job introducing the topic of NoSQL in this talk 

While most people do not get into too many of the details about how their data is stored and structured, the world of #DataOps the nature of how and where the data is stored becomes incredibly important.

In any given day managing a PloyGlot environment the command line tools used could be sqlplus, bcp, sqlplus,hdfs,spark-shell, xd-shell.

Each of these data storage engines have different requirements. The manner in which databases are clustered when moving between RDBMS, and the NoSQL platforms are quite different.

Putting Hadoop or Cassandra on a SAN should require due thought before doing so.

Likewise creating an environment with isolated disks for an Oracle Cluster may not be the correct solution.


Managing a PolyGlot environment, is by itself a challenge. Sometimes this requires a team,
sometimes this requires a lightly shaded purple squirrel that is equally at home at the command line, the SQL prompt,a REPL, a console, a white board, an AWS management browser, or a management console like Grid control, Ops Center, or Cloudera Manager.

Working with this variety of data, and the variety of the types of teams, and people that need access to this data, or even a subset of the data requires it's own level of understanding of the data, data management, and how to make the data itself work to contribute to the bottom line of the organization.

Are your purple squirrels only on the data science team? Probably not.



2015-05-04

SPARCL-Simple-Python/AWK-Reduce-Command-Line

SPARCL - Simple Python/Awk Reduce Command Line

I need to analyze some data.
It is not overly large, so it isn't really a Big Data problem. However, working with the data in Excel is unwieldy.

I really need to analyze the structure of the data, and the unique values of either a single column, or combinations of columns.

There are actually a number of ways to do this.
Load the data into a MySQL database run SQL Group-by statements on the table.
Load it in to R, use the sqldf package to do the same thing.
Write a Python script to read the whole file and spit out combinations of distinct values for various columns.

All of these approaches have their advantages and disadvantages.

Each of them take a bit of time, and some of them just sit in memory.

Then I remembered a class I took from Cloudera some time ago. In the class the instructor showed us how to do a simple Map Reduce program without invoking Hadoop.

The command is simple:  
cat filename | ./mapper.py | sort | ./reducer.py

He suggested we run our code through this pipeline before submitting a full on hadoop job, just to make sure there were no syntax errors, and all of the packages were available to the python command interpreter.

This is exactly what I need to do.

The first attempt to do this I wrote a simple awk script for the "mapper" portion.
parser.awk:
#!/usr/bin/awk -f
BEGIN { FS ="," }
{print $1;}


This I ran with:
cat file | ./parser.awk  |sort  |./reducer.py

However, as time went on I needed to either look at different columns from my CSV file, or combinations of columns from my CSV file.

Rather than do a lot of AWK coding I wrote this Python:
#!/usr/bin/python
import sys
line = []
indexes = []
list_arg = 0
if len(sys.argv) == 2:
        index = int(sys.argv[1])
else:
        list_arg = 1
        indexes = sys.argv[1:]
for data in sys.stdin:
        line = data.split(',')
        if list_arg == 0:
                print "{0},1".format(line[index])
        else:
                string = ""
                for index in indexes:
                        string = string+line[int(index)]
                        if (index == indexes[-1]):
                                string = string+"\t1"
                        else:
                                string = string+","
                print string


Now I can do:
cat file | ./parser.py 2 4 6 99 3 | sort | ./reducer.py

The reducer function I will keep on my github to keep it easy to read.

Feel free to give this a try. I find it works best from within a VM like the DataScienceAtTheCommandLine

Good luck, and comment if you find this useful.


2015-05-03

#DataTherapist

Do you need a #DataTherapist? 

A few weeks ago, I was  enjoying a lunch with some of my coworkers. We were discussing some of the use cases of Data Science and Business analysis that we are building for our various clients.

Somebody made the comment, that some use-cases don't need a data scientist they need a data therapist.

Many of us laughed, and I even made a Twitter comment about this.  A number of other people on Twitter not only retweeted the comment, but began to make comments about the application of a #DataTherapist to particular use-cases.

Here are a few recent definitions that have evolved in the Twitterverse related to the data hashtag: #Data.

My definition of Data Science: The application of Statistical and Mathematical rigor to Business Data. There should be the proper application of confidence intervals, and p-values to data when making decisions. When doing some type of predictive analysis this becomes even more important.  Data Scientists also do research with all of your business data to include even adding third-party data as a source of enrichment in order to best understand what has happened historically, and what is likely to happen given a set of conditions in the future.


When doing historical reporting, the analyst is reporting on facts that have occurred that the data represents. This is usually your Data Warehouse, Business Intelligence type data product. These things are repeatable, predefined business questions.

A Data Architect designs how an infrastructure should be built from the servers to the data model, and works with other Architects to ensure things are streamlined to meet the goals and objectives of the organization.

#DataOps, are the people that make the Data Architects vision happen. They not only keep the lights on, but also make it easy for the Data Scientist, Business Analysts, Application Developers, and all other manner of people that have a need to work with an enterprises Data Products to do so.


But what is a Data Therapist?

What started out as a "joke" may truly be something important.

So here goes. An initial definition of a #DataTherapist.

A Data Therapist is the person, or group of people that shows the way for what not only Could be done with an enterprises data, but what Should be done with the Enterprises data. They scope out data projects, and work with the executive team to clear the runways for the Data Architect, Data Scientist, Business Analyst and the DataOps team to get their job done.


Not all data problems require a Data Scientist.

Not all Data Science problems are Big Data problems.

Not all Big Data problems are Data Science problems.

Data, regardless of structure, use-case, application, or data product supported will continue to grow. The leaders in this space continue to push the limits on what can be done with Python, R, Spark, SQL, and other tools.

The Data Therapist shines a light on what can be done, and whether your organization should undertake a large scale data project. They also can tell you when you need to stop doing something and simply start anew.


Are you ready to listen to the #DataTherapist?

You may not always like what they have to say.

Please comment below about how you think a Data Therapist can help an organization.

2015-04-24

#DataOps

#DataOps: Oh, that's what they call what I do?

I just read a few amazing articles:

Agile DataOps
DataOps a New Discipline

Trying to explain to people what I do becomes tedious at times.


Yes, I can build a Cassandra Cluster, or an Oracle Cluster, or a Cloudera Cluster. Each of those clusters has its own challenges and rewards. But once the Cluster is built and we start putting data into it, what do you want to do with that data?  How do you want the data organized? Is your data model correct? Do you have a Data model? Do you know how big the cluster or database is going to get? Have you done a Volumetric calculation? Is your budget big enough to allow for fail-over? Downtime? Maintenance? Disaster recover practice?

Data Scientist, Business Analyst, Executive Analyst, Business User, Project Manager.


Each of these specialties have their own unique challenges. Being The Data Guy in an organization, requires the ability to at least communicate effectively to all of these specialists.

Here are some of the things we do:

Oracle Certifications, Microsoft Certifications, Red Hat Certifications. Statistics, ETL, Informatica, Data Stage, Pentaho,  Data mining, R, Python, Scala, Spark, Data modeling, SQL, CQL, Hive, Hadoop, Impala, Map Reduce, Spring, Data Movement, Data Plumbing.


Backups, Restores, Performance checks, SQL tuning, Code tuning to match the data platform. These are all the day in and day out life of Data Operations. Data modeling, ERWin, ERStudio, JSON, XML, Column stores, Document stores, Text mining, Text processing and storage. RAID, SAN, NAS, local storage, spindles, SATA drives. Jobs, batches, Schedulers, 3:00 a.m. wake-up calls, alerts, on-call troubleshooting.


Now it is all summed up in a new hashtag: #DataOps

What is it that you do?

Just as DevOps is important to make our organizations Agile and responsive to the needs of the business users, so to does DataOps have it's unique and peculiar take on impacting the business.

Data Scientist on the left, Business Analyst on the right, Developers behind us, and Project management ahead of us. Standing on the infrastructure that we work together with DevOps to create, implement, and manage.

This is #DataOps.

Are you up for it?


2015-04-21

Centrality-and-Architecture



How does centrality affect your Architecture?


Some time ago, I was responsible for a data architecture I had mostly inherited. There were a number of tweaks I worked to on to refine the monolithic nature of the main database. It was a time of upheaval in this organization. They had outgrown their legacy Computer Telephony Interface application. It was time to create something new. 

A large new application development team was brought in to develop some new software.
There was a large division of labor and processing where some things were handled by the new application, and another thing was developed to handle the data. Reporting, cleansing, analysis, ingress feeds, egress feeds, all of these went through the “less important” system. 

This was the system I was responsible for. 

In thinking about how best to explain a Data Structure Graph, I spent some time revisiting this architecture and brought it into a format that could be analyzed with the tools of Network Analysis. 

After anonymizing the data a bit, and limiting the data flows to only the principle data flows, I constructed a csv file to load into Gephi for analysis.

Source
Target
Edge_Label
Spider
ODS
Application
ODS
Spider
Prospect
Vendor1
ODS
Prospect
Vendor2
ODS
Prospect
Vendor3
ODS
Prospect
ODS
Servicing
Application
Legacy
ODS
Application
ODS
Legacy
Prospect
ODS
Dialer1
Prospect
ODS
Dialer2
Prospect
Gov
ODS
DNC
ODS
Spider
LegacyData1
ODS
Spider
LegacyData2
ODS
Spider
LegacyData3
Spider
ODS
LegacyData1
Spider
ODS
LegacyData2
Spider
ODS
LegacyData3
ODS
ThirdParty
Prospect
ThirdParty
ODS
Application
Legacy
ODS
Application
Legacy
ODS
DialerStats
Dialer1
ODS
DialerStats
Dialer2
ODS
DialerStats

I ran a few simple statistics on the graph, then did some partitioning to color the graph to make it apparent the degree of a node this is the first output of Gephi:


The actual statistics Gephi calculated are in this table:
Id
Label
PageRank
Eigenvector Centrality
In-Degree
Out-Degree
Degree
Vendor1
Vendor1
0.01991719
0.00000000
0
1
1
Vendor2
Vendor2
0.01991719
0.00000000
0
1
1
Vendor3
Vendor3
0.01991719
0.00000000
0
1
1
Gov
Gov
0.01991719
0.00000000
0
1
1
Spider
Spider
0.08121259
0.44698155
1
1
2
Servicing
Servicing
0.08121259
0.44698155
1
0
1
Legacy
Legacy
0.08121259
0.44698155
1
1
2
Dialer1
Dialer1
0.08121259
0.44698155
1
1
2
Dialer2
Dialer2
0.08121259
0.44698155
1
1
2
ThirdParty
ThirdParty
0.08121259
0.44698155
1
1
2
ODS
ODS
0.43305573
1.00000000
9
6
15

From the Data Architecture perspective, which “application” has the greatest impact to the organization if there were a failure?

Which “application” should have the greatest degree of protection, redundancy, and expertise 
associated with it? 

Let's cover in detail the two metrics in the middle of the last table PageRank, and Eigenvector Centrality. 

I will have to create individual blog entries for both PageRank and Eigenvector Centrality to discuss the actual mechanism for how these are calculated. The math for these can be a bit cumbersome, and each algorithm should be given due attention on its own.

The point of this analysis is to determine which component of the architecture should have additional resources devoted to it. For any customer facing application, it should be given due attention, and infrastructure. However, one question I have seen many of my clients struggle with is what is the priority of the back-end infrastructure? Should once component of the architecture be given more attention than another? I have 90 databases throughout the organization, which one is the most important?

These centrality calculations show unequivocally which component of the architecture has the most impact in the event of an outage, or where the most value can be provided for an upgrade.
 
This type of analysis can begin to shed light on the answers to these questions. A methodical approach to an architecture based on data, rather than the division that screams the loudest can give insight into how an architecture is truly implemented.

I call these artifacts a Data Structure Graph