The-Data-Guy: Data

Showing posts with label Data. Show all posts

2016-01-25

What to do about data overload

Data Overload

Some time ago I wrote a blog about data overload.

Here are some ways to help address data overload.

1) Don't confuse a backup system as a reporting system.

    A full backup of your production data is needed for recover and safekeeping purposes, do not confuse this repository with a reporting platform. A reporting platform should limit the data that it keeps to the data that supports the key performance indicators for your enterprise.

2) All of the data for your key performance indicators should be mapped to the originating system.

This is all part of your Meta-Data. For any metric you make any significant decisions with it should be clear where the data came from.

3) Data should be optimized for reporting.

   A Data Mart using a proper dimensional model allows for rapid, isolated reporting and analysis

4) Data should be centralized, yet distributed easily.

   This may seem like a contradiction, but if you have an analytics platform where your data is cleansed, transformed and aligned with key business entities, this data can and should be distributed throughout the organization to make it easy for business users to use with tools they are familiar with.

2016-01-20

Data Will Work if it flows

How does data actually do work?

Data does work in a similar manner to water.

Water left alone can do very little.

Cleansed water can rejuvenate a tired body.

Channeled water can turn wheels to create energy for transforming raw materials into usable materials.

Transforming water into steam can turn mighty turbines to create energy for a myriad of purposes depending on the enterprises goals.

Cleansing

Clean drinking water...not self-evident for ev...

Image via Wikipedia

Water must be cleansed in order to be consumed. Impurities from the environment can contaminate the water making it unsuitable for a particular use.

Just as water needs to be cleansed before it can be used for a particular purpose, so to must data be cleansed to ensure that its usefulness can be guaranteed.

Data Cleansing is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant etc. parts of the data and then replacing, modifying or deleting this dirty data.

For all data there is a standard of use. Is the data useful the way it was entered into the data capture system for the function that we are trying to perform? For example, do we need to look up any codes or convert timestamps from one timezone to another? If the data meets this standard, then it does not need cleansing. If it does not meet this standard then it does need cleansing. The standard of use is either explicit with rules around the processing of the data or implicit the data is simply accepted as it was entered.

Channeling

Schematic diagram of an overshot water wheel.

Image via Wikipedia
When a large body of water is channeled through a small channel the force of the volume of water moving through the channel moves obstacles out of the way.

When you take Big Data, and mine through the data to find a particular use case you are channeling the data to focus on a particular problem.

Transforming

English: Ice Sculpture, Natural History Museum, London SW7 The lighting brings the ice sculpture to life. (Photo credit: Wikipedia)

When water is heated it becomes steam. When frozen it becomes ice.
Ice structures can be amazingly beautiful when created by the right artist.

Today's Enterprises, whether they be great or small, are all but ships on the great ocean of information. Each piece of data we encounter is as a drop in that ocean.

How do you make data work for you?

I originally wrote this article before the concept of a Data Lake became popular. As I am cleaning out my drafts as part of my #Writefor30days Write For 30 Days challenge, I found this draft.

The concept of Data Lake, just like a man made reservoir, is to store the data until such time as you know you will need to use it. Proper Data Management best practices should be applied to the Data Lake, and when you need to understand a particular business question use some of the tools at your disposal, hopefully ones you have built, to filter the "water" in your data lake to answer your question.

By the way cleansing, transforming, and channeling is just another way to say use a Lambda function to apply to all of your data to get the answers you need.

And if you want to understand all of the "tributaries" coming into and leaving your Data Lake, might I suggest a Data Structure Graph? Keeping track of all of the data flows through an organization will become a tedious task if not done right from the beginning.

It's not magic.

Just follow best practices.

2016-01-13

What do DBA's do?

"Data! Data! Data!”

he cried impatiently, “I cannot make bricks without clay!” — The Adventures of Sherlock Holmes, The Adventure of the Copper Beeches - Sir Arthur Conan Doyle, 1892

Photograph of Arthur Conan Doyle, seated sideways on chair. Published in "Men and Women of the Day 1893", Eglington & Co., England, 1893. (Photo credit: Wikipedia)

Over 100 years ago, Sir Arthur Conan Doyle lamented through his character Sherlock Holmes that more data is always needed than is available.

Today we have amazing systems capable of capturing, recording, retrieving, disseminating and presenting vast quantities of data.

The backbone of any organization is its data, and the person most responsible for this is the Database Administrator or DBA.

A DBA is the administrator of a Database Management System.

However, there is more than a single type of DBA. For a number of years throughout my career, I have fulfilled at least one of these roles. Here is a highlight of what each type of DBA is responsible for.

The various responsibilities of all DBA's we will cover in another thread, but separating them into groups helps clarify the best roles.

Production DBA

The IT professional role responsible for and specializing in software and hardware associated with a
a particular DBMS. Often the DBA is prefixed with the DBMS engine they are a specialist in:
Oracle DBA, MySQL DBA, PostgreSQL DBA, SQL Server DBA.

Transaction Development DBA

When creating a new application that will store persistent data for a long time to come, this DBA manages the development data environment, schema, and assists with query optimization and performance. This DBA should be quite familiar with the development methodologies, languages, and tools that the development team is using. These applications in some areas are called OLTP systems.

Data Warehouse DBA

A data warehouse data environment is very different from an application. Data flows into a data warehouse from multiple transaction or application systems. All of the transaction development skills plus further skills related to ETL tools, and Business Intelligence tools and needs. Data modeling is truly the paramount skill for the data warehouse DBA. Knowing how to translate transaction data structures to data structures that are optimized for reporting and repeatable analytics is the differentiating skill for the data warehouse DBA.

Application DBA

An Application DBA supports application systems, for example an ERP administrator could be considered a DBA because they manage the overall application stack in support of business users of the application. This application stack includes components that use a DBMS as a back-end repository, but all of the components are inter-related. As more and more applications migrate to cloud based solutions the need for this type of expertise is being replaced by the need for DBA's that are more familiar with cloud based solutions and data stores.

Operational DBA

The Operational DBA focus is on performance tuning, backup and recovery,high availability, job scheduling, data delivery, security access levels, etc... in production environments. The Operational DBA has a specific responsibility for change management in production environments.This is slightly different from a production DBA in that the production DBA may work more with creation, and implementation of new environments rather than the day-to-day operations of keeping the lights on.

For the Production, Operational, and Data Warehouse DBA a tool like a Data Structure Graph could be incredibly useful as they work with the movement of data throughout an organization, and all of the inter-dependencies that support the ongoing work of the DBA.

One person could fulfill all of these roles depending on the size of the organization.

How many DBA's do I need?

As with any technical question, the answer is always “It depends.” I propose a formula that you can use to determine how many DBA’s of each type that you will need depending on your organization.

For every two large scale in house development (be this development of software, or customization of an off the shelf product) project you will need 1 Application DBA.

For every five production database application (In house developed or Off the shelf) you will need at least 1 Operations DBA and one Production DBA.

For a data warehouse project you will need 2 data warehouse DBA’s.

For every 2 Operational DBA’s you have you need at least 1 Production DBA.

These are also minimums, because depending on your training and support needs you may want to consider having a few more DBA’s than what is suggested to be able to rotate things like time off and on-call support. Let me tell you from personal experience, being on call constantly and then having to support new development splits the attention of resources so much that both tasks ultimately suffer when it is done this way.

2016-01-12

We Start with Data

#WeStartwithData

I was in an application development meeting that needs some operational reports.

The application is still under development, so no data is being stored for any length of time in the various development and testing environments.

The reports are designed to show activity throughout the life-cycle of using the application.

No proper usage data exists.

As we were pointing this out to the team, someone said: "Well, data is always the last thing we worry about when building an application."

This is why I am the data guy

A data guy (or girl) always starts with data.

We can design your structures so they are optimized for writing, or reading, or storage, or updates.

We can make it easier to report on (Dimensional modeling)

We can do longitudinal analysis looking for patterns.

We can do query optimization to make give you results quicker than you have seen in the past.

We can do many things.

However, we have to have Data to do it.

If you have data, I can tell you something interesting about it.

If you have no data, I can tell tell you exactly what the data says -

Nothing.

2015-05-27

Data-Science-or-Business-Intelligence?

Comparing Data Science and Business Intelligence

Over the past few years as I have been supporting more and more "non-traditional" (i.e. not a Data Warehouse or Data Mart) analytical platforms, I have noticed a number of differences between Data Science approaches and Business Intelligence approaches.

This image sums up many of my observations and gives a touch point for comparing the differences as well as similarities between the two approaches.

Reproducible versus Repeatable

One of the goals of #DataOps is to keep data moving to the right location in a repeatable, automatized manner. Most of the data warehouse environments I have worked on, the person doing the analysis does not run the ETL jobs. Todays data flows into existing data marts, dashboards, dimensional models, and queries that drive it all. These are repeatable processes.

Performing a Reproducible process on the other hand shows the entire process soup to nuts. The analyst pulled this data from that system, used this transformation on these data elements, combined this data with that data, ran this regression, and produced this result. Therefore if we raise the price of this widget by $.05 we will have this lift in profit (Ceteris paribus).

Predictive versus Descriptive

As described above the Data Scientist attempts to make a prediction about something, whereas the Business Intelligence analyst is usually reporting on a condition that is considered a Key Performance Indicator of the company.

Explorative versus Comparative

In most Business Intelligence environments I have worked with, the questions are usually along these lines: "Is this product selling more than that product?"

The Data Scientist would want to look at what product has the highest margin, or the product that has the largest impact on the bottom line. If someone buys product X, do they also purchase product Y?
What else is impacting this particular store? Does weather have an impact on purchase patterns? What about twitter hashtags? What in our product line is most similar to a product that has a high purchase volume in the overall consumer community.

Attentive versus Advocating

Data Scientist: The data shows us that consumers that purchased X also purchased Y. I suggest we relocate Y for the stores in this geographic area by 1 meter away from X, and in this geographic area they should be 2 meters away. Then we will analyze the same visit purchases for those two items to determine if this should be done in all stores.

Business Intelligence: The latest data from our campaign is shown here. The response rate among 18-24 year old males is less than what we wanted but we expect to see more lift in the coming weeks.

Accepting versus Prescriptive

Data Scientist: Give me all data, I will analyze it as is, and determine what needs to be cleaned and what represents further opportunities. If there is a quality issue I will document it as part of the assumptions in my analysis.

Business Intelligence: The data has to be cleaned and high quality before it can be analyzed. No one should see the data before all of the quality checks, verifications, and cleansing processes are done.

Both of these approaches have business value.

I think Data Science will continue to get some press for quite a while, there will always be some amazing break through that someone used the algorithm of the day to solve a business problem. Then the performance of that algorithm will become a metric on a dashboard that is put into a data mart.

The guys and gals in #DataOps will make sure the data is current.

The Data is protected.

The Data is available.

The Data is shown on the right report to the people that are authorized to see it.

Your Data is safe.

2015-05-26

Spark-Strategy

Data Strategy, but no Spark Strategy, how cute.

I see blogs, and blogs, and articles, and presentations, and Slideshares on creating a Big Data Strategy. How does "it" (usually Hadoop) fit into your organization, who should get access, who should "it", so many questions, comments and opinions.

Guess what?

Your data is growing.

And growing.

If you use a Graph to analyze the data flows in your organization, chances are you will see ways to cut cost, and consolidate architectural components.

Bring things together and store them in one location.

Done!

Now what?

What tools do you use to analyze it?

You want to do the same thing you have always done?

Really?

Send two people in your organization to the upcoming Spark Summit. Let them show you there is a better way.

Spark is not just a new tool.

It is a new Path.

2015-05-11

The-First-Data-Scientist?

Was Charles Eppes the first fictional Data Scientist?

Sherlock Holmes is the most well known fictional detective, although many students of literature will tell you Holmes was not the actual first consulting detective.

A recent conversation about detectives, and data scientists led me to wonder who could be considered the first fictional Data Scientist.

To answer this question we should consider what a data scientist is and what they do.

They work with data gathered in the real world, do some analysis, derive a model that can represent the data, explain the model and data to others, add new data to the model, perform some type of prediction, refine results, then test in the real world.

While their are many definitions, and I feel like the definition of the term Data Scientist changes on an ongoing basis, for the purpose of this article I think these general thoughts are sufficient.

On the show Numb3rs, Charles Eppes, played by David Krumholtz was a mathematician who had a brother in the F.B.I. in California. Through a number of seasons Charlie, as he was affectionately known, worked through "doing the math" of these hard problems helping his brother and his team both solve crimes, as well as understand the math behind his explanation. By taking things this next step, actually explaining through analogy that non-mathematicians could understand he was able to provide them insight into the crime and the criminal.

For those of you that have attended any of my presentations on data science, you know that I consider Johannes Kepler to be one of the first true data scientists. Like Charlie, Johannes gathered data from the real world, (painstakingly collected over years by Tycho Brahe), did some analysis, derived a model to represent the data, then began to explain the model and the data to others. As new data came in, Kepler refined his model until all the data points fit with his model. From there Kepler was able to make predictions, refine his results and show others how the real world worked.

There are many other shows that apply the principles of Forensics to solving crime, some of them are quite interesting, although I am not sure of the veracity of the capabilities of crime solvers to do some of the things that their Television counterparts do on a weekly basis.

Numb3rs, to me will always be about using Data Science to solve real world problems.If you haven't seen an episode, the whole series is on Netflix.

After all, isn't that what the data we work with on a daily basis represents? Something in the "real world" ?

2015-05-03

#DataTherapist

Do you need a #DataTherapist?

A few weeks ago, I was enjoying a lunch with some of my coworkers. We were discussing some of the use cases of Data Science and Business analysis that we are building for our various clients.

Somebody made the comment, that some use-cases don't need a data scientist they need a data therapist.

Many of us laughed, and I even made a Twitter comment about this. A number of other people on Twitter not only retweeted the comment, but began to make comments about the application of a #DataTherapist to particular use-cases.

Here are a few recent definitions that have evolved in the Twitterverse related to the data hashtag: #Data.

My definition of Data Science: The application of Statistical and Mathematical rigor to Business Data. There should be the proper application of confidence intervals, and p-values to data when making decisions. When doing some type of predictive analysis this becomes even more important. Data Scientists also do research with all of your business data to include even adding third-party data as a source of enrichment in order to best understand what has happened historically, and what is likely to happen given a set of conditions in the future.

When doing historical reporting, the analyst is reporting on facts that have occurred that the data represents. This is usually your Data Warehouse, Business Intelligence type data product. These things are repeatable, predefined business questions.

A Data Architect designs how an infrastructure should be built from the servers to the data model, and works with other Architects to ensure things are streamlined to meet the goals and objectives of the organization.

#DataOps, are the people that make the Data Architects vision happen. They not only keep the lights on, but also make it easy for the Data Scientist, Business Analysts, Application Developers, and all other manner of people that have a need to work with an enterprises Data Products to do so.

But what is a Data Therapist?

What started out as a "joke" may truly be something important.

So here goes. An initial definition of a #DataTherapist.

A Data Therapist is the person, or group of people that shows the way for what not only Could be done with an enterprises data, but what Should be done with the Enterprises data. They scope out data projects, and work with the executive team to clear the runways for the Data Architect, Data Scientist, Business Analyst and the DataOps team to get their job done.

Not all data problems require a Data Scientist.

Not all Data Science problems are Big Data problems.

Not all Big Data problems are Data Science problems.

Data, regardless of structure, use-case, application, or data product supported will continue to grow. The leaders in this space continue to push the limits on what can be done with Python, R, Spark, SQL, and other tools.

The Data Therapist shines a light on what can be done, and whether your organization should undertake a large scale data project. They also can tell you when you need to stop doing something and simply start anew.

Are you ready to listen to the #DataTherapist?

You may not always like what they have to say.

Please comment below about how you think a Data Therapist can help an organization.

2011-08-15

Steps to successful adoption of a new data warehouse

What is taking so long to get the data warehouse ready?

In a new deployment of a data warehouse there are many infrastructure components that have to be put in place. Modeling tools, ETL Servers, ETL processes, BI Servers, and Bi interfaces and finally reports and dashboards. Not to mention sessions for user interviews, business process review and metadata capture.

I say server(s) because there should be dev/test and prod platforms for each of these.

Figure 3-4: how data models deliver benefit

Image via WikipediaA recent article at Information-management.com talks about data modeling taking too much time if done correctly.

Add all of these things together and you have a significant period of time to wait before seeing a benefit to a Data Warehouse/Business Intelligence project.

Here are some suggestions to reassure the stakeholders early on during the project lifecycle.

Give them data early and often.

Put together a small and simple data model for the first pass. Load the small star schema with a subset of the data relevant to a group of business users, then create some reports or give some power users access to create their own reports.

This shows the concept of continuity. A Continuity test in electronics is the checking of an electrical circuit to see if current flows, or that it is a complete circuit.

Show the data quality issues

"A problem well stated is a problem half solved" Without seeing data quality issues, the people that enter data into the system of record can not fix it.

Get and give feedback often

As soon as people start using the "prototype", you will get feedback. Use this as an opportunity to explain why the process should take longer. It also identifies gaps in understanding among the team. Once people have a hands-on view of the presentation layer they will try a number of things.

They will use it to answer questions they already have answers to. Thus validating the transformation processes.

They will also start to try to answer questions they may not have asked before. This is the best opportunity for learning more about how the data is being used.

These steps lay the foundation for making data work for you and your business.

2011-03-02

Data Management Industry?

How important is an industry classification for data management professionals?

I have been asked the question: What is your industry?

My reply, when given the option, is to say the Data Management Industry.

The particular vertical market that my company classifies itself according to Dun and Bradstreet Industry classification, Standard Industry Classification (SIC) or even North American Industry Classification System (NAICS) has a limited impact on my day to day duties.

Some industries have a more stringent requirement for data quality or data availability than others, but overall the manner in which data is managed between industries is consistently similar.

In every industry I have worked the same process is generally followed.

Data is captured

In Telecommunication, Energy and Supply Chain these systems are usually automated data capture via a field device such as a switch or a sensor, some are driven based on orders and some are driven based on consumer behavior.

In Retail and ECommerce the source data capture component is a customer facing system such as a web site or scanner for checking out at a grocery store.

Most companies have a human resources system that keep track of time for the customer facing employees tracking contextual information such as when did an employee arrive, what did they work on, when did they leave?

Data is integrated

Once we have the source data and as much contextual information about this data captured; that data is transferred to another system. This system could be a billing, payroll, time keeping or analytical system, such as a data warehouse or a data mart. The methods used to create this integration system can vary depending on the number of source systems involved and the requirements for cross referencing the data from one system with the data in other systems.

At times certain data points are transferred outside the organization. This data could be going to suppliers, vendors, customers or even external reporting analysts.

Internally each department within an organization needs to see certain data points. Marketing, Merchandising, Finance, Accounting, Legal, Human Resources, Billing, to name a few do not necessarily need to see all of the data captured. However the data they do require does need to get to them in a timely manner in order for these departments to support the overall organization.

Data is protected

During all of these data interchanges the same types of functions need to be performed.

Data must be secured (the users that need access have it, those that do not need access cannot see it), backed up, restores tested and verified, performance must be optimized, problems need to be addressed when they arise, quality must be maintained, and delivery must be verified.

Data Management Professionals

The Data Management profession consists of people striving to create, implement, maintain and evolve best practices for managing the data that runs our enterprise.

The challenges of data management, the problems we solve and the solutions we provide are more similar than they are different.

Is the industry classification of the enterprise we support all that important?

Pages

2016-01-25

Data Overload

Related articles

2016-01-20

How does data actually do work?

Cleansing

Channeling

Transforming

2016-01-13

"Data! Data! Data!”

Production DBA

Transaction Development DBA

Data Warehouse DBA

Application DBA

Operational DBA

How many DBA's do I need?

Related articles

2016-01-12

#WeStartwithData

Related articles

2015-05-27

Comparing Data Science and Business Intelligence

Reproducible versus Repeatable

Predictive versus Descriptive

Explorative versus Comparative

Attentive versus Advocating

Accepting versus Prescriptive

Related articles

2015-05-26

Data Strategy, but no Spark Strategy, how cute.

Related articles

2015-05-11

Was Charles Eppes the first fictional Data Scientist?

Related articles

2015-05-03

Do you need a #DataTherapist?

Related articles

2011-08-15

Give them data early and often.

Show the data quality issues

Get and give feedback often

2011-03-02

Data is captured

Data is integrated

Data is protected

Data Management Professionals

Related articles