The-Data-Guy: January 2016

2016-01-31

Therein Squats the Toad

One of my favorite phrases from the show Dharma and Greg is "Therein Squats the Toad".

When Dharma first says this in the episode it is in reference to a problem of their own making.

I so fell in love with the phrase that I began using it quite often.

So where does the Toad Squat?

Chuck Lorre is the writer of the episode, and he also used a modified form of the phrase in The Big Bang Theory.

While I won't presume to think that this explanation is the one he had in mind, I will say the following picture is what I think of when I say the phrase.

Imagine an old abandoned swimming pool.

When I was a kid, I did some summer jobs cleaning up some abandoned swimming pools. Not the glamorous swimming pools with bikini clad figures aligned on either side.

Yucky. Algae filled, Broken pumps of goo swimming pools.

One thing we would do is to turn the pump on, and set up the drain to get rid of the mess.

Before we could turn the pump on we would have to clean out the inlet drains.

The inlet drains are just above the water line, but they are covered so the sun does not blare down on top of the water.

Inside this little covered inlet is cool water. (This was south Texas, by the way).

In every pool I worked with we would open it up and find....

You guessed it, Toads, Frogs, and turtles.

We would move them out, then clean out the particulates. After that we could use the drain, shock the water, scrub the pool and refill it with clean water.

So the first thing we had to work on, that was also one of the larger messes was the place wherein squatted the toad.

2016-01-30

Meetup with some new people.

Meet new people.

Earlier I wrote about Learning New things , and I hope that you have picked up a new book or two to learn something new.

Now that you have learned something, why not take the new found knowledge and share it with others?

Meetup is an online social networking site with the goal of taking social networking offline.

This sounds counter intuitive, but ultimately it works quite well.

Humans have evolved to be social creatures. Even us introverts enjoy the occasional get together with others.

I have attended Meetups in a few different cities, and even if I am not a regular attendee of the meetings I usually get the opportunity to both learn something, as well as share something I know with a new friend.

For example, I relocated a few years ago to near the Cincinnati, OH area (Technically I live in Kentucky, but it is very close to the city of Cincinnati). Meetup.com was a great way to meet people new people.

I was able to get to know the technology landscape here in my new adopted home as well as meet some other new people outside the technology world I call home.

Meetup has a variety of categories, and there is sure to be something for everyone. If you can't find a local meetup with a topic you find interesting, Create one!

Chances are if you are inteterested in something, then others around you may be interested as well. Create a meetup and get to know other people around you.

"Hi" is a dull word, but that's how some of the most interesting things get started.

2016-01-29

Can you Kaggle?

Kaggle?

Do you even Kaggle, Bro?

English: Kaggle logo (Photo credit: Wikipedia)

Kaggle is a platform for predictive modelling and analytic competitions on which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models. This crowdsourcing approach relies on the fact that there are countless strategies that can be applied to any predictive modelling task and it is impossible to know at the outset which technique or analyst will be most effective. - Kaggle Site Definition

Basically, Kaggle is a site where they peridocially have competitions for Machine Learning, and Data Mining Practitioners.

Generally the format of the competition is this:

Here is a training dataset

Here is a test data set

Here is a sample submission file

The training and test data set have the "key" features that the people designing the competition recognizes as being important. One row per observation with some particular "training" or "outcome" variable. The biggest difference between the two is that the test data set does not have the "training" or "outcome" variable.

This is what you need to accurately predict.

The sample submission file is generally a file with 2 columns an ID variable, and the Response variable.

The ID from the submission is a unique identifier from the test data set, and the response is what you, and your algorithm predict.

There is an evaluation process behind the scenes. Once you submit your prediction, your submission is compared to a set of "known" predictions. The predictions behind the scenes are the ones generally considered to be accurate by the group hosting the competition.

An evaluation score is calculated on your prediction versus the accepted predicted values. Some mechanism like F-Statistic, or RMSE, or area under the ROC are used as a score.

These competitions show a wide variety of industries, and some of the competitions allow you to win money.

Since there is money on the line, there are some groups that take these things incredibly seriously.

They will dedicate resources and time to winning a competition. As such the leaderboard may or may not be an accurate representation for how well someone knows an algorithm or methodology, rather it may represent the amount of time and resources dedicated to winning the competition.

For example, in one competition recently Springleaf the top score on the leaderboard is: .80427 second place is: .80394 a difference of .00033

The difference between my score and the leader was .04759!

These competitions can be fun, exciting, and provide for opportunities to try new tools, methods, and algorithms. However, unless you dedicate quite a decent amount of time and energy into them you may not come in first place.

2016-01-28

From Data Warehouse to Big Data

And now, Big Data...

Big Data has been defined as any data whose processing will not fit into a single machine.

While the assumption generally made is thinking of Hadoop and Big Data to be the same thing, many database systems I have worked on have been clustered.

Oracle, MySQL, SQL Server can all be cluster based.

While Hive is a great tool to leverage your existing SQL knowledge base, limiting your problem solving to SQL based access will impact the types of problems you can solve with your data.

Having been a DBA for many years, I clearly see the advantage of simply adding more machines to your cluster, and either doing a re-balance or suddenly having more space without the need for a data migration.

Early on in my career, if we were running out of space, we would have to go through a request process, then a migration downtime was undertaken to get the new space allocated to the right database.

Now, all you need to do (with Hadoop anyway) is to add another machine to the cluster.

There are some distinct similarities between a Big Data project and a data warehouse.

1) You must have a problem in mind to solve.

2) You must have executive buy in and sponsorship. Even if this is a research project you must have the resources to actually build your cluster, then move data onto the cluster.

3) The data must be organized. This will probably not be a normalization scheme like we saw previously from Codd. You should place similar data near each other. (Logs should be organized in folders for logs, RDBMS extracts should be organized in folders named in a similar manner to the original system schemas.)

There is lots of room for innovation, and application of best practices from non hadoop based structures.

But some guidelines should be followed. These guidelines should be consistent with the rest of your data architecture.

Otherwise things will fall into disarray quickly indeed.

One way to organize the movement of data into your data lake, as well as were the summary data goes later is to use a Data Structure Graph this will also help you with the rest of your data architecture.
I could go into lots of details about how to set up a Hadoop cluster, but things evolve quickly in that space, and I am sure anything I write would be outdated quickly.

I encourage you to look for a practice VM to work with like : the cloudera quick start VM

Practice, try new things, then move them to production. If you need help, reach out, there are many sources to find good advice for setting up your hadoop environment.

2016-01-27

Job History Data

What is your Job History?

What is the biggest difference between a resume and a job application?

The timing and the details.

A resume is a marketing document, You are marketing yourself. A job application is an accurate representation of the dates identifying who provided your paycheck.

There is significant overlap between these documents, but your job application is generally a document with a standard organization for everyone working for the same employer.

Just as we Munge data that we pull from one or more other applications to fit into our particular analysis, we must Munge our data from our resume, portfolio, and reference information into a single format for Human Resources consumption.

I always try to keep job applications simple and straightforward.

Some of my previous employers are actually no longer in business, so that makes things a little difficult with some people that want full contact information for every employer for a long period of time.

Be honest in filling out your job history.

Never lie on either a job history, application, resume, or LinkedIN.

It is quite acceptable to say you have done a particular thing in an academic setting.

Knowing academically how something works is a bit different than seeing something work a particular way in production, but the fact that you know how something works is still a step up from not even knowing such a thing even exists.

For example: I have taken a number of Machine learning classes, and participated in a number of Kaggle competitions.

I have a basic understanding of Machine Learning, and being able to apply xgboost, or RandomForest to a data set.

I have yet to do so in a production environment.

Does this mean I should not even mention it on my LinkedIN profile?

I suggest writing about what you have done, and let others decide the value of that knowledge in their environment.

Good luck with your job searches!

2016-01-26

Brand yourself

What is your brand?

When you introduce yourself to others, what do they remember?

Is it your eyes, your smile, the color of your shirt?

How do you introduce yourself?

If you are like me you could go on for hours about your accomplishments or what are good at.

No one will remember you droning on and on about the problems you have solved, or your years of expertise on a topic.

What they will remember is something distinctive.

Darwinian evolution shows us that those individuals that have some distinctive feature, and are able to pass that feature on to their offspring, will eventually dominate an environmental niche.

In the job market, or even when networking, some simple distinctive feature that differentiates you from your competition, or others in your environment will keep you on someones mind longer.

This distinctive feature should be consistent. Every time you go to an event it should be with you.

Just as Colonel Sanders always wore his distinctive white suit and bow tie, and Alice Cooper always had his grubby looking jacket, you need to have something with you. Whether it be an item of clothing, or a phrase, or a mannerism.

These are the things that people will associate with you. For good or Ill.

Find your distinction, and embrace it.

For me, I am the data guy.

Do I work with computers, yes? Can I fix your printer, sure? Can I build a web-site, or other application, of course?

However, my forte, the thing that I do best is to work with data.

I am the data guy.

I can optimize, organize, backup, restore, safeguard, and analyze data in a variety of ways.

I was a DBA (Data base administrator), Data Warehouse Architect, Data Architect, and even Hadoop engineer.

All these things, evolve and change over time. The data that is important to our organization will always need to be put into the right peoples hands at the right time, and in the right format.

I am the data guy.

How can I help you?

2016-01-25

What to do about data overload

Data Overload

Some time ago I wrote a blog about data overload.

Here are some ways to help address data overload.

1) Don't confuse a backup system as a reporting system.

    A full backup of your production data is needed for recover and safekeeping purposes, do not confuse this repository with a reporting platform. A reporting platform should limit the data that it keeps to the data that supports the key performance indicators for your enterprise.

2) All of the data for your key performance indicators should be mapped to the originating system.

This is all part of your Meta-Data. For any metric you make any significant decisions with it should be clear where the data came from.

3) Data should be optimized for reporting.

   A Data Mart using a proper dimensional model allows for rapid, isolated reporting and analysis

4) Data should be centralized, yet distributed easily.

   This may seem like a contradiction, but if you have an analytics platform where your data is cleansed, transformed and aligned with key business entities, this data can and should be distributed throughout the organization to make it easy for business users to use with tools they are familiar with.

2016-01-24

Are you just keeping the lights on?

Shine the light on your data!

Data Management can mostly a "keep the lights on" activity.

However, with some imagination, intuition and innovation, you can do more than just keep the lights on. You can shine a light on new opportunities.

When you are creating a MVP, or even as you rush through the process of your back-log to meet the needs of your customer requests, you may or may not actually have time to have someone spend some time with the data your application is generating.

There may be another way to look at this data.

Whether it be as a Graph, or a data mart, or even some specialized visualization and story telling. Some people with a focus on data products should have access to this data and be given the time to shine a new light on what could be done.

If you have a huge back-log to create reports for your application, it may be time to look at a re-organizing the data to make it easy for some self-service reporting for some people.

If you have some time of data sequencing process, perhaps that should be loaded into a graph tool for graph analysis.

In the rush to meet the demands of our immediate customers, we should be looking downstream and asking ourselves what do tomorrows customers need? What should we be doing today to prepare for them?

What insights can I get from my data to prepare me for the next product my organization needs to build?

Where do you shine the light?

2016-01-23

Melvil Dewey - The first Data Architect

English: From left to right: R. R. Bowker, Mrs. Dewey and Melvil Dewey (Photo credit: Wikipedia)

What is a data Architect?

Let us start with understanding a basic definition of an application architect.

java Architects know the ins and outs of Java, the versions, the capabilities, the tools, the memory requirements, etc...

A Data Architect on the other hand has a detailed understanding of the impact of how data is organized.

All of the traditional normalization rules to which we owe Ted Codd , not to mention all of the denormalized usage patterns that Ralph Kimball wrote about.

Staging areas, ETL processes, the impact of the various RAID levels on performance. Which Data center which data should be copied to, how to do sharding on a relational system.

Basically, Data Architects do not write the original software that produces the data, but they know how the data should be organized in that application, as well as what to do with the data once it leaves the original application repository for use-cases the application developers did not envision.

Melvil Dewey was not an author.

However, he did more to allow people to have access to books, and be able to find books that were related to topics they were interested in than any other person.

Today with our Data Lakes, Data Warehouses, Spark Clusters, Hadoop Clusters, Data Marts, Data Scientists, and Data Analysts all trying to pull data together, organize it, channel it, and transform this raw data into business value.

We should remember the simple approach that Dewey took.

Know where your data is located.

Organize the data to make it easy for others to both use and find.

Categorize your data in a manner that makes sense to the most amount of people.

His method is not perfect, and some improvement has been made in the way data is organized, cataloged and searched for. But his methods stood the test of time for many years until later inventions were able to use newer methods to find and access data.

Will your architecture stand for more than a hundred years?

Will it survive the next CIO that takes over?

How much thought is given to the organization of the data within your organization, and how the various needs of different systems are met?

Do you use a Data Structure Graph to keep track of the importance of the various data feeds that are the life-blood of your organization?

How do you organize your data?

2016-01-22

Meta-Data and SEO

SEO Meta-Data

Meta-Data is the contextual information about the data stored and used in an enterprise.

English: a chart to describe the search engine market (Photo credit: Wikipedia)

SEO (Search Engine Optimization) is the process of improving the visibility of a website or a web page in search engines via the "natural" or un-paid ("organic" or "algorithmic") search results. Making yourself easily found by the particular types of keywords you are using.

In an SEO project a website administer has to pick out appropriate keywords that describe the contextual information about a particular page of a web-site. Keywords are chosen and provided to search engines so that a particular site can be found easily by search engine users.

What Meta-Data is useful?

When you started using a particular search term.

When you stopped using it.

Synonyms, antonyms, homonyms, and alternative spellings.

Keeping track of all of these things will help with your analysis of how effective your SEO campaign is, and where you may need to change.

This will be valuable external data to enrich your SEO analysis.

2016-01-21

Writing challenge update 2

20 days so far...

I had some interesting ideas that I had started to write about.

Photograph of a statue of an ape, examining a human skull. Writing on the book on the right - "Darwin", writing on bottom - "Darwin's ideas" (in Hebrew). (Photo credit: Wikipedia)

I should have completed those articles some time ago.

Most of these things I had actually forgotten about.

Writing every day, is forcing me to hit the publish button and not weight for the article to be "perfect".

Adhering to that standard, I don't think anything would ever be finished.

I want to at least make my ideas clear, concise, and out for all the world to see.

Some of my ideas may be good, some may be silly.

But nevertheless I will put them out there.

Do you find anything I have written useful?

Have you written?

Did you write today?

2016-01-20

Data Will Work if it flows

How does data actually do work?

Data does work in a similar manner to water.

Water left alone can do very little.

Cleansed water can rejuvenate a tired body.

Channeled water can turn wheels to create energy for transforming raw materials into usable materials.

Transforming water into steam can turn mighty turbines to create energy for a myriad of purposes depending on the enterprises goals.

Cleansing

Clean drinking water...not self-evident for ev...

Image via Wikipedia

Water must be cleansed in order to be consumed. Impurities from the environment can contaminate the water making it unsuitable for a particular use.

Just as water needs to be cleansed before it can be used for a particular purpose, so to must data be cleansed to ensure that its usefulness can be guaranteed.

Data Cleansing is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant etc. parts of the data and then replacing, modifying or deleting this dirty data.

For all data there is a standard of use. Is the data useful the way it was entered into the data capture system for the function that we are trying to perform? For example, do we need to look up any codes or convert timestamps from one timezone to another? If the data meets this standard, then it does not need cleansing. If it does not meet this standard then it does need cleansing. The standard of use is either explicit with rules around the processing of the data or implicit the data is simply accepted as it was entered.

Channeling

Schematic diagram of an overshot water wheel.

Image via Wikipedia
When a large body of water is channeled through a small channel the force of the volume of water moving through the channel moves obstacles out of the way.

When you take Big Data, and mine through the data to find a particular use case you are channeling the data to focus on a particular problem.

Transforming

English: Ice Sculpture, Natural History Museum, London SW7 The lighting brings the ice sculpture to life. (Photo credit: Wikipedia)

When water is heated it becomes steam. When frozen it becomes ice.
Ice structures can be amazingly beautiful when created by the right artist.

Today's Enterprises, whether they be great or small, are all but ships on the great ocean of information. Each piece of data we encounter is as a drop in that ocean.

How do you make data work for you?

I originally wrote this article before the concept of a Data Lake became popular. As I am cleaning out my drafts as part of my #Writefor30days Write For 30 Days challenge, I found this draft.

The concept of Data Lake, just like a man made reservoir, is to store the data until such time as you know you will need to use it. Proper Data Management best practices should be applied to the Data Lake, and when you need to understand a particular business question use some of the tools at your disposal, hopefully ones you have built, to filter the "water" in your data lake to answer your question.

By the way cleansing, transforming, and channeling is just another way to say use a Lambda function to apply to all of your data to get the answers you need.

And if you want to understand all of the "tributaries" coming into and leaving your Data Lake, might I suggest a Data Structure Graph? Keeping track of all of the data flows through an organization will become a tedious task if not done right from the beginning.

It's not magic.

Just follow best practices.

2016-01-19

A personal impact of data management

The personal impact of data management.

Some time ago I began feeling odd.

After a few Google searches to get a better idea of what my symptoms could be attributed to we decided to go to the emergency room.

A number of tests later it was decided I needed to have my appendix removed.

Diagram showing the importance and result of well thought out Student Data Management. (Photo credit: Wikipedia)

Throughout this process I was asked at each "station" that I was sent to for simple things like "What brings you in tonight?" or "Why are we testing you?". Simple straightforward ice-breaker questions. As the night wore on and the medication for pain they gave me began to take more of a toll I didn't really think about why they were asking the questions they were asking.

I have been to this particular hospital a few times for various tests, so they already had my insurance and contact information, yet when they sent someone from admissions they had to re-confirm all of my insurance, contact and emergency notification information.

I said to the lady that it was in the system, since nothing has really changed. The response I received was that they upgraded to a new patient care system in December. They have access to the historical data for records purposes but they have to re-enter the patient information into the new system whenever a previous patient comes in for a current visit.

I was already medicated, so little of the rest of the conversation do I remember.

I basically handed over my license and insurance information and did my best to answer her questions.

However, I do remember this part of the conversation, because this is what I do.

I manage data.

The mental image ran through my mind of some management team discussing the complications of migrating historical data from the legacy system into a new system. (A discussion I have been a part of on numerous occasions.) I could see the person at the head of the table shaking their head and saying:

"It will cost too much money to do a historical migration of everything. We will only migrate enough data from the old system to the new system to keep things going. Any historical patient data is not that important."

I had pretty much heard those exact words from a CIO before.

As I have written before, the historical data you have represents your customers. Ignoring their history is to ignore your own history. What would be the effect, if every time you went in to a bank you had to fill out all of the forms for opening an account?

Each, and every time?

Now I do more with understanding the nuances of what the data represents, and why customers,patients,prospects, or other things that data represents are doing the things they are doing.

This episode stuck with me for some time, and I hope that we as those that are responsible for data recognize that the data we work with represents real living breathing people.

Historical data is important.

Ignore it at your peril.

2016-01-18

When Startups fail: Moving forward with Lessons learned.

Startup Adventures

There are a few people that can call themselves serial entrepreneurs because they have started a few ventures.

I think there are many more people that could call themselves serial entrepreneurs even though the startups they participated in did not take off.

There are many reasons why a startup does not take off.

school for startups black logo (Photo credit: Wikipedia)

Very few startups I have worked with have failed because we did NOT have a product or technology behind the sales force. We had tools that worked, but market force, timing or just a lousy economy sometimes cause startups to fail.

How do you emphasize the positive work done in this kind of environment?

Demonstrate your understanding of the companies goals, value proposition, and the components of the strategy you were contributing.

Chances are you had some ideas for ways in which something could be done better. During an interview, or talk about your failed startup focus on how you were able to provide for some solutions, and how those solutions may fit into a new organization.

Perhaps, one day I will write a post about all of the startups I have been involved in.

Some of them were great ideas that the market was not ready for, some where lack-luster ideas, some where ideas that ran out of money before we could build the technology.

A few of these "startups" were really just con-men trying to get something for nothing.

Those are the ones that I have the most negative memories.

The Startups I worked with that were innovative, challenging, and that set out to make a big difference with technology were fun to work with.

2016-01-17

Packaged reports versus integrated reporting.

Data Silo?

Software Packages like SAP, Peoplesoft, or Kronos are dedicated to solving problems. They are full-featured packages that come with plenty of options to meet the needs of their customers.

However a lot of people I speak with about these packages act as if these packages are the only packages in the enterprise. More often than not, these "all encompassing" packages are not alone in the enterprise; there are other systems with which they need to interact.

A package is not alone

I was reminded of some of my discussions about this topic when I saw a toy commercial recently. They show children playing with just that particular toy set that the advertisement is showing. In reality this one toy set is part of an entire room full of toys. Cleaning up all of these toys can become a challenge.

If you have ever helped a child find a part of a toy you will understand.

For a toy set if you want to take a car from one set and play with that car on a new race set you simply pick up the car and start playing with it. If it doesn’t quite fit the child will use their imagination to make the car either a giants car, or a tiny robot car looking for other tiny cars in that world.

For an application it’s not quite that simple. If you have a large volume of data in an older application or legacy application, and you want to use that data in your new application you have to have a migration strategy requiring expertise in both applications as well as data manipulation. If these applications use different data repositories, Oracle and the other SQL server, that can make it more complicated.

If you want to use both applications at the same time, during a soft launch, for example, data integration becomes even more vital.

Data needs to be integrated

Constant feedback of performance is a vital business function. Seeing the performance of only one application will limit your vision in other areas. Data Management professionals are the ones ultimately responsible for knowing where all of the “toys” are located and how to get them as well show how they are represented.

Proactive data management planning ensures that some of these problems never emerge. If every week or month or day you have lots of IT people and Business Analysts looking for data, you need to re-think your strategy.Because then you are working for your data, rather than having your data work for you.

Pages

2016-01-31

Therein Squats the Toad

Related articles

2016-01-30

Meet new people.

2016-01-29

Kaggle?

2016-01-28

And now, Big Data...

Related articles

2016-01-27

What is your Job History?

2016-01-26

What is your brand?

Related articles

2016-01-25

Data Overload

Related articles

2016-01-24

Shine the light on your data!

Related articles

2016-01-23

What is a data Architect?

Related articles

2016-01-22

SEO Meta-Data

2016-01-21

20 days so far...

Related articles

2016-01-20

How does data actually do work?

Cleansing

Channeling

Transforming

2016-01-19

The personal impact of data management.

Related articles

2016-01-18

Startup Adventures

Related articles

2016-01-17

Data Silo?

A package is not alone

Data needs to be integrated

Related articles