Pages

2015-05-11

The-First-Data-Scientist?

Was Charles Eppes the first fictional Data Scientist

 

Sherlock Holmes is the most well known fictional detective, although many students of literature will tell you Holmes was not the actual first consulting detective.

A recent conversation about detectives, and data scientists led me to wonder who could be considered the first fictional Data Scientist.

To answer this question we should consider what a data scientist is and what they do.

They work with data gathered in the real world, do some analysis, derive a model that can represent the data, explain the model and data to others, add new data to the model, perform some type of prediction, refine results, then test in the real world.

While their are many definitions, and I feel like the definition of the term Data Scientist changes on an ongoing basis, for the purpose of this article I think these general thoughts are sufficient.

On the show Numb3rs, Charles Eppes, played by David Krumholtz was a mathematician who had a brother in the F.B.I. in California. Through a number of seasons Charlie, as he was affectionately known, worked through "doing the math" of these hard problems helping his brother and his team both solve crimes, as well as understand the math behind his explanation. By taking things this next step, actually explaining through analogy that non-mathematicians could understand he was able to provide them insight into the crime and the criminal.

For those of you that have attended any of my presentations on data science, you know that I consider Johannes Kepler to be one of the first true data scientists. Like Charlie, Johannes gathered data from the real world, (painstakingly collected over years by Tycho Brahe), did some analysis, derived a model to represent the data, then began to explain the model and the data to others. As new data came in, Kepler refined his model until all the data points fit with his model. From there Kepler was able to make predictions, refine his results and show others how the real world worked.

There are many other shows that apply the principles of Forensics to solving crime, some of them are quite interesting, although I am not sure  of the veracity of the capabilities of crime solvers to do some of the things that their Television counterparts do on a weekly basis.

Numb3rs, to me will always be about using Data Science to solve real world problems.If you haven't seen an episode, the whole series is on Netflix.


After all, isn't that what the data we work with on a daily basis represents? Something in the "real world" ?






2015-05-07

PloyGlot Management

PloyGlot Data Management

What is the data stored in? 

A traditional database administrator is familiar with a set of utilities, SQL, and a stack of tools specific to the RDBMS that they are supporting.


SQL Server, SQL Server Analysis Service, SQL Server Integration Services, SQL Server Reporting Services. SQL Server Management Studio.

Not to mention the whole disk partitioning, SAN layout, RAID level and such that they need to know.

Oracle has it's own set of peculiarities as does DB2, MySQL, and others.

There is a new kid on the block.

NoSQL.

Martin Fowler does a phenomenal job introducing the topic of NoSQL in this talk 

While most people do not get into too many of the details about how their data is stored and structured, the world of #DataOps the nature of how and where the data is stored becomes incredibly important.

In any given day managing a PloyGlot environment the command line tools used could be sqlplus, bcp, sqlplus,hdfs,spark-shell, xd-shell.

Each of these data storage engines have different requirements. The manner in which databases are clustered when moving between RDBMS, and the NoSQL platforms are quite different.

Putting Hadoop or Cassandra on a SAN should require due thought before doing so.

Likewise creating an environment with isolated disks for an Oracle Cluster may not be the correct solution.


Managing a PolyGlot environment, is by itself a challenge. Sometimes this requires a team,
sometimes this requires a lightly shaded purple squirrel that is equally at home at the command line, the SQL prompt,a REPL, a console, a white board, an AWS management browser, or a management console like Grid control, Ops Center, or Cloudera Manager.

Working with this variety of data, and the variety of the types of teams, and people that need access to this data, or even a subset of the data requires it's own level of understanding of the data, data management, and how to make the data itself work to contribute to the bottom line of the organization.

Are your purple squirrels only on the data science team? Probably not.



2015-05-04

SPARCL-Simple-Python/AWK-Reduce-Command-Line

SPARCL - Simple Python/Awk Reduce Command Line

I need to analyze some data.
It is not overly large, so it isn't really a Big Data problem. However, working with the data in Excel is unwieldy.

I really need to analyze the structure of the data, and the unique values of either a single column, or combinations of columns.

There are actually a number of ways to do this.
Load the data into a MySQL database run SQL Group-by statements on the table.
Load it in to R, use the sqldf package to do the same thing.
Write a Python script to read the whole file and spit out combinations of distinct values for various columns.

All of these approaches have their advantages and disadvantages.

Each of them take a bit of time, and some of them just sit in memory.

Then I remembered a class I took from Cloudera some time ago. In the class the instructor showed us how to do a simple Map Reduce program without invoking Hadoop.

The command is simple:  
cat filename | ./mapper.py | sort | ./reducer.py

He suggested we run our code through this pipeline before submitting a full on hadoop job, just to make sure there were no syntax errors, and all of the packages were available to the python command interpreter.

This is exactly what I need to do.

The first attempt to do this I wrote a simple awk script for the "mapper" portion.
parser.awk:
#!/usr/bin/awk -f
BEGIN { FS ="," }
{print $1;}


This I ran with:
cat file | ./parser.awk  |sort  |./reducer.py

However, as time went on I needed to either look at different columns from my CSV file, or combinations of columns from my CSV file.

Rather than do a lot of AWK coding I wrote this Python:
#!/usr/bin/python
import sys
line = []
indexes = []
list_arg = 0
if len(sys.argv) == 2:
        index = int(sys.argv[1])
else:
        list_arg = 1
        indexes = sys.argv[1:]
for data in sys.stdin:
        line = data.split(',')
        if list_arg == 0:
                print "{0},1".format(line[index])
        else:
                string = ""
                for index in indexes:
                        string = string+line[int(index)]
                        if (index == indexes[-1]):
                                string = string+"\t1"
                        else:
                                string = string+","
                print string


Now I can do:
cat file | ./parser.py 2 4 6 99 3 | sort | ./reducer.py

The reducer function I will keep on my github to keep it easy to read.

Feel free to give this a try. I find it works best from within a VM like the DataScienceAtTheCommandLine

Good luck, and comment if you find this useful.


2015-05-03

#DataTherapist

Do you need a #DataTherapist? 

A few weeks ago, I was  enjoying a lunch with some of my coworkers. We were discussing some of the use cases of Data Science and Business analysis that we are building for our various clients.

Somebody made the comment, that some use-cases don't need a data scientist they need a data therapist.

Many of us laughed, and I even made a Twitter comment about this.  A number of other people on Twitter not only retweeted the comment, but began to make comments about the application of a #DataTherapist to particular use-cases.

Here are a few recent definitions that have evolved in the Twitterverse related to the data hashtag: #Data.

My definition of Data Science: The application of Statistical and Mathematical rigor to Business Data. There should be the proper application of confidence intervals, and p-values to data when making decisions. When doing some type of predictive analysis this becomes even more important.  Data Scientists also do research with all of your business data to include even adding third-party data as a source of enrichment in order to best understand what has happened historically, and what is likely to happen given a set of conditions in the future.


When doing historical reporting, the analyst is reporting on facts that have occurred that the data represents. This is usually your Data Warehouse, Business Intelligence type data product. These things are repeatable, predefined business questions.

A Data Architect designs how an infrastructure should be built from the servers to the data model, and works with other Architects to ensure things are streamlined to meet the goals and objectives of the organization.

#DataOps, are the people that make the Data Architects vision happen. They not only keep the lights on, but also make it easy for the Data Scientist, Business Analysts, Application Developers, and all other manner of people that have a need to work with an enterprises Data Products to do so.


But what is a Data Therapist?

What started out as a "joke" may truly be something important.

So here goes. An initial definition of a #DataTherapist.

A Data Therapist is the person, or group of people that shows the way for what not only Could be done with an enterprises data, but what Should be done with the Enterprises data. They scope out data projects, and work with the executive team to clear the runways for the Data Architect, Data Scientist, Business Analyst and the DataOps team to get their job done.


Not all data problems require a Data Scientist.

Not all Data Science problems are Big Data problems.

Not all Big Data problems are Data Science problems.

Data, regardless of structure, use-case, application, or data product supported will continue to grow. The leaders in this space continue to push the limits on what can be done with Python, R, Spark, SQL, and other tools.

The Data Therapist shines a light on what can be done, and whether your organization should undertake a large scale data project. They also can tell you when you need to stop doing something and simply start anew.


Are you ready to listen to the #DataTherapist?

You may not always like what they have to say.

Please comment below about how you think a Data Therapist can help an organization.

2015-04-24

#DataOps

#DataOps: Oh, that's what they call what I do?

I just read a few amazing articles:

Agile DataOps
DataOps a New Discipline

Trying to explain to people what I do becomes tedious at times.


Yes, I can build a Cassandra Cluster, or an Oracle Cluster, or a Cloudera Cluster. Each of those clusters has its own challenges and rewards. But once the Cluster is built and we start putting data into it, what do you want to do with that data?  How do you want the data organized? Is your data model correct? Do you have a Data model? Do you know how big the cluster or database is going to get? Have you done a Volumetric calculation? Is your budget big enough to allow for fail-over? Downtime? Maintenance? Disaster recover practice?

Data Scientist, Business Analyst, Executive Analyst, Business User, Project Manager.


Each of these specialties have their own unique challenges. Being The Data Guy in an organization, requires the ability to at least communicate effectively to all of these specialists.

Here are some of the things we do:

Oracle Certifications, Microsoft Certifications, Red Hat Certifications. Statistics, ETL, Informatica, Data Stage, Pentaho,  Data mining, R, Python, Scala, Spark, Data modeling, SQL, CQL, Hive, Hadoop, Impala, Map Reduce, Spring, Data Movement, Data Plumbing.


Backups, Restores, Performance checks, SQL tuning, Code tuning to match the data platform. These are all the day in and day out life of Data Operations. Data modeling, ERWin, ERStudio, JSON, XML, Column stores, Document stores, Text mining, Text processing and storage. RAID, SAN, NAS, local storage, spindles, SATA drives. Jobs, batches, Schedulers, 3:00 a.m. wake-up calls, alerts, on-call troubleshooting.


Now it is all summed up in a new hashtag: #DataOps

What is it that you do?

Just as DevOps is important to make our organizations Agile and responsive to the needs of the business users, so to does DataOps have it's unique and peculiar take on impacting the business.

Data Scientist on the left, Business Analyst on the right, Developers behind us, and Project management ahead of us. Standing on the infrastructure that we work together with DevOps to create, implement, and manage.

This is #DataOps.

Are you up for it?