Becoming a Data Scientist

Everyone's path to becoming a Data Scientist is different.

This makes it difficult to recommend to someone outside the academic world how re-invent themselves into the image of a Data Scientist.

 Joel Grus makes a quite succinct overview of what a Data Scientist should be able to do. I first saw this image on twitter, and searching around found his presentation on Slideshare.

Joel makes some great points, about learning Data Science.

Science is a tool for understanding the world around us.

"Science is a systematic enterprise that creates, builds and organizes knowledge in the form of testable explanations and predictions about the universe."

Bill Nye the Science Guy at The UP Experience ...Some people may think of Data Science as part of the tools used by Scientists wearing white coats in pristine labs. You can begin doing Science in your own kitchen. I am much more a follower of Bill Nye, rather than Brian Cox.

One of the earliest Data predictions I did was to answer the following question: How big should this database be?

I created a Data Model in ERWin, then did Volumetric estimation based on daily table growth.

I had a model (the Entity Relationship Diagram). I had some assumptions (we are adding this many records every day). I was able to do forecasting(another word for prediction.).

Using this information we knew how to size the database server for this application, and use that information as part of the infrastructure pricing estimates given to management.

While this may not be considered "proper data science" by many people this is a simple example of how to begin using Science in even the most mundane way.

Sure there are some foundation components like Math, Data Management, Programming skills, Business acumen, and data visualization that need to be learned and understood. Try to solve a problem in your universe.

So the question before us is: How do I become a Data Scientist?

Here are some steps to follow:

  1. Pick a problem that interests you. (Ask an interesting question.)
  2. Learn all you can about the problem. (What is already known about the problem?) 
  3. Collect as much data as you can about the problem. (From as many sources as you can.)
  4. Make a prediction about the problem. (Don't worry if others have already done this.)
  5.  Be wrong! (This is most important. Fail first, Fail fast, and Fail Often!) 
  6.  Figure out what you did wrong, and correct it. (Then iterate. Go back to learn more (2), or get more data(3), or update your prediction(4) based on new information.)
  7.  Be right! (Finally! Now show why you were correct, and how you can apply this to other domains.)

If you are just starting on the journey to become a Data Scientist, do this a few times. The idea is to learn this process. The tools will change based on your environment, the specific problem you are trying to solve, and what tools your employer will allow you to use.

The point is, as Joel Grus, Bill Nye, and Nike says: Just do it!



RFI Analysis

Expanding on my prior post about Bootstrap Analysis, this post is a demonstration of one particular technique for learning more about how deep the rabbit hole goes when you are exploring a new data set

RFM analysis is a way to summarize the interactions your customers have with your organization.  Using the three concepts of Recency, Frequency and Monetary Value, and their various interactions and combinations you can create a very simple statistical model of the value of a given set of customers. 

Now, what if your customers are not spending money? Changing Monetary Value to the Intensity of their interactions with your organization, store, or web-site is one way to use the common tools available for RFM analysis as an RFI analysis. 

Recency - How recently, according to some definition, did an interaction occur? 

Frequency - How frequently does a particular customer or set of customers perform an interaction? 

Intensity - How intently did the customer interact? 

Each of these metrics needs to be interpreted in the context of your particular brand.

Here is a small example visualization using some random data I generated: 

I generated a list of customers, randomly assigned them to 5 different stores. After that I set up a random number generator in Excel to populate the three metrics of Recency, Frequency, and Intensity. 

In this particular case Recency could be how recently a customer visited a store in a Month. So if a customer visited on the last day of the month, the Recency would be high. Frequency is how frequently this customer visited this store during the month. Intensity could be how much money, or how much time they spent in the store.

So long as the metric is applicable to your use-case, and the definitions are consistent throughout the analysis, each of them

This analysis tool can be applied in a variety of settings where there is recurring interactions between a customer base, and a set of customers. 

With the proper visualization of data similar to the above image, many data points can be represented. In this case we represent the RFI metric itself as a point in the geometric cube. The shape of the point represents the store. The color represents the customer. 

So in this simple diagram there are 5 dimensions of data represented. 

The R code to do this is: 

#Load the correct packages for 3d representation and reading csv files

#read the data
rfm.df <- read_csv('rfm.csv')

#Define some shapes based on unique store
shapes = c(16,17,18,19,20)
shapes <- shapes[as.numeric(rfm.df$Store)]
rfm.df$Customer <- as.factor(rfm.df$Customer)
s3d <- scatterplot3d(rfm.df$Recency,rfm.df$Frequency,rfm.df$Monetary,
                     xlab="Recency",ylab="Frequency",zlab="Monetary Value",pch=shapes,color=as.integer(rfm.df$Customer))


Bootstrap analysis

The scatterplot of Iris flower data set, colle...

Bootstrap Analysis

There are times in the life of a Data Scientist where you have a bit of data, and the broad directive of: "See what this says."

Recently I have faced this, and came up with this concept - Doing a Bootstrap analysis. 

Spend a very small time-period. (Small is relative to your initial understanding of the data). Pull the data together, and munge it into your tool of choice. 

Then come up with broad features that describe the data set at a high-level. These features are an enrichment of the data set itself. So in R terms it would be adding a column to a data frame. In SQL terms, creating a new table with a ranking column based on some grouping criteria for example. 

Once you have this enriched data set perform some simple trending analysis on your features compared to some outcome variable in the original data set. If you see something interesting, that is the thread you can pull to find out more details. 

Further contextualizing the data will give further insight, but this initial analysis will give you an idea of what further contextual information would be useful.

This "limited" analysis should enable you to answer the initial question of: How long will it take to get something useful out of this data?

This is a different concept from "bootstrapping ", others have explained in far more detail, and far better than I ever could about that technique. This is just a way to communicate to stakeholders that there needs to be some time given to initial analysis of a data set to get an idea of how deep the rabbit hole goes...