The-Data-Guy: Business Intelligence

Showing posts with label Business Intelligence. Show all posts

2020-06-12

Question The Answers.

Some time ago I wrote an entry on the difference between Data Science and Business Intelligence: https://bit.ly/DataSciencevsBusinessIntelligence

I recently came across this quote:

“Advances are made by answering questions. Discoveries are made by questioning answers” —Bernard Haisch.

I think there is a relationship between this quote and that previous post.

In essence what I was attempting to say was that Business Intelligence is generally a process that your data flows through that enriches application Data and prepares it such that it can be used to answer questions. These questions may be simple:

How many widgets did this business unit produce last quarter?
How many did that business unit sell last quarter?
Which sales person sold what percentage last quarter?
What is the recurring cost of this Customer?

These are all important questions. However, this same data should be used as part of any predictive effort. If you are using different data for your data science efforts and your business intelligence efforts then as you chart new territory through Data Science, your Business Intelligence platform will assist in showing the value of the Data Science effort.

These two sides of a similar coin can and should be complementary.

Business Intelligence will drive your business forward, Data Science will show you the direction you should go.

2017-01-05

What is the performance relationship between a Database and a Business Intelligence Server

This article will be a bit long; it covers a complicated topic that I have been studying for quite some time.

I have run across the need to explain this topic on a number of occasions, and over time my explanations have hopefully become clearer and more succinct.

The concept that will be discussed here is the performance relationship between a database server and a business intelligence server in a simple data mart deployment.

Rarely are data mart deployments simple, but the intention for this is to be a reference article to understand the relationships between the server needs and the performance footprint under some of the various scenarios to be experienced during the lifecycle of a production deployment.

Here is a simple layout of an architecture for a DataMart:

Very basic image, the D is the database server, the F is the front-end, the U are the users, and they are all connected via the network.

To be precise this architecture represents a ROLAP (Relational On Line Analytical processing) built on top of a dimensional model (star schema) implementation. The dimensional model is assumed to be populated and current with totally separate ETL processes that are not represented in this diagram.

The “D” represents any database server: Oracle, MySQL, SQL Server, DB2, whichever infrastructure you or your enterprise has chosen.

The “F” represents any front-end business intelligence server that is optimized for dimensional model querying, and is supported by a server instance: Business Objects, Cognos, Pentaho Business Analytics,

Tableau Server. The desktop specific BI solutions do not fit in this reference model for reasons we shall see shortly.

In my early thoughts on the subject, I envisioned that the performance relationship in a properly done DataMart would be something like this:

This is a good representation of what happens.

On the left side of the chart we have the following scenario.

When the front-end server responding to a user interaction sends a request back to the database for aggregated data like: “show me the number of units sold over the last few years”

One could imagine the query being something like: Select Year, Sum(total_sold) from Fct_orders fo inner join Dim_Date dd on fo.date_key = dd.date_key.

The dutiful database does an aggregation, provided all of the statistics are current on the data, a short read takes place and more CPU and Memory but less Disk I/O is used to do the calculation.

In the graph this is represented by the high red-line on the upper left.

The results returned to the front end are small. A single record per year of collected data.

The CPU and Memory load on the front end server is tiny shown in green on the lower left.

On the right side of the chart we have the following scenario.

When the front-end server responding to a user interaction sends a request back to the database for non-aggregated data like: “show me all of the individual transactions taking place over the last few years”

One could imagine the query in this case to be something like: Select fo.*,dimensional data from Fct_orders fo inner join (all connected dimension tables).

In this case the database server has little option but to do a full table scan of the raw data and returning it.

In the graph this is represented by the lower red-line on the right (more disk I/O, less CPU and Memory), then the data is returned to the business intelligence server.

Our Front-End server will have to do some disk caching, as well as lots of processing (CPU and Memory) to handle the load just given to it, not to mention things like pagination, and possibly holding record counters to keep track of which rows the user has seen or not, among other things.

This graph seems to summarize the relationship between the two servers rather nicely. However, something is missing.

I had to dwell on this image for some time before I was able to think of a way to visualize the thing that is missing.

The network.

And even then there are at least two parts to the network.

The connection between the front-end server and the database, followed by the connection between the front-end server and all of the various users.

Each of these have a different performance footprint.

Representing the database performance, Front-End performance, and network performance for both the users and the system connections is something with which I continue to struggle.

Here is the image I have recently arrived at:

This chart needs a little context to understand the relationships between the 4 quadrants.

Quadrant I is the server network bandwidth. In a typical linear relationship as the data size increases from the database to the front end the server network bandwidth increases.

Quadrant II is the database performance relationship between CPU/Memory and Disk I/O for a varying query workload. For highly aggregated queries the CPU and Memory usage increases, and the Server Network bandwidth is smaller because less data is being put on the wire. For less aggregated data, and more full data transfers the Disk I/O is higher, Memory is lower, and back in Quadrant I the Server Network Bandwidth is higher.

Quadrant III is the Front-End server performance comparing CPU/Memory and Disk I/O when dealing with a varying volume of data. As the data increases from the database more resources and caching is needed on this server.

Quadrant IV is the User Network Bandwidth this is the result of the front end server responding to the requests from the user. As the number of users increase the volume of data increases and more of a load is put on the front end server. Likewise, the bandwidth increases as more data is being provided to the various users.

This image is an attempt to show the interactions between these 4 components.

The things that make this image possible is a well-designed dimensional model, a rich semantic layer with appropriate business definitions, and common queries that tend to be repeated.

This architecture can support exploratory analysis, however, the data to be explored must be defined and loaded up front. For exploratory analysis to determine which data points need to be included in the data mart, that should be done in a separate environment.

I created all three of these images with R using iGraph and ggplot2 with anecdotal data. The data shown in this chart is not sampled, but is meant as a representation of how these four systems interact. Having experience monitoring many platforms supporting this architecture, I know for a fact that no production systems will actually show these rises and falls the way this representative chart is doing.

However, understanding that at their core they should interact this way should give a pointer to where a performance issue may be hiding in your architecture. If you are experiencing problems. The other use-case of this image is an estimation tool for designing new solutions.

All that being said, much of this architecture may be called in to question by new tools.

Some newer systems, Hadoop, Snowflake, RedShift actually change the performance dynamics of the database component.

The Cloud concept has an impact on the System Bandwidth component. If you have everything in the cloud, then in theory the bandwidth between the database server and the front-end server should be managed by your cloud provider. There may need to be VPC pairings if you set them up in separate regions.

If these are being run within a self-managed data center should the connection between the database server and the front-end server be on a separate VLAN, or switch? Perhaps.

Does the front-end server use separate connections for the database querying interface and the user facing interface? Should it?

Do you need more than one front-end server sitting behind a load-balancer? How many users can one of your front-end servers’ support? What are the recommended limits from the vendor? Should data partitioning and dedicated servers per business unit be done to optimize performance for smaller data?

These are all types of questions that arise when looking at the bigger picture. Specifically when you are doing data systems design and architecture. This requires a slightly different touch than application systems design and architecture.

Thinking about applying this diagram in your own enterprise will hopefully give insight into your own environment.

Can you think of a better way to diagram this relationship? Let me know.
The code and text are posted here.

2015-05-03

#DataTherapist

Do you need a #DataTherapist?

A few weeks ago, I was enjoying a lunch with some of my coworkers. We were discussing some of the use cases of Data Science and Business analysis that we are building for our various clients.

Somebody made the comment, that some use-cases don't need a data scientist they need a data therapist.

Many of us laughed, and I even made a Twitter comment about this. A number of other people on Twitter not only retweeted the comment, but began to make comments about the application of a #DataTherapist to particular use-cases.

Here are a few recent definitions that have evolved in the Twitterverse related to the data hashtag: #Data.

My definition of Data Science: The application of Statistical and Mathematical rigor to Business Data. There should be the proper application of confidence intervals, and p-values to data when making decisions. When doing some type of predictive analysis this becomes even more important. Data Scientists also do research with all of your business data to include even adding third-party data as a source of enrichment in order to best understand what has happened historically, and what is likely to happen given a set of conditions in the future.

When doing historical reporting, the analyst is reporting on facts that have occurred that the data represents. This is usually your Data Warehouse, Business Intelligence type data product. These things are repeatable, predefined business questions.

A Data Architect designs how an infrastructure should be built from the servers to the data model, and works with other Architects to ensure things are streamlined to meet the goals and objectives of the organization.

#DataOps, are the people that make the Data Architects vision happen. They not only keep the lights on, but also make it easy for the Data Scientist, Business Analysts, Application Developers, and all other manner of people that have a need to work with an enterprises Data Products to do so.

But what is a Data Therapist?

What started out as a "joke" may truly be something important.

So here goes. An initial definition of a #DataTherapist.

A Data Therapist is the person, or group of people that shows the way for what not only Could be done with an enterprises data, but what Should be done with the Enterprises data. They scope out data projects, and work with the executive team to clear the runways for the Data Architect, Data Scientist, Business Analyst and the DataOps team to get their job done.

Not all data problems require a Data Scientist.

Not all Data Science problems are Big Data problems.

Not all Big Data problems are Data Science problems.

Data, regardless of structure, use-case, application, or data product supported will continue to grow. The leaders in this space continue to push the limits on what can be done with Python, R, Spark, SQL, and other tools.

The Data Therapist shines a light on what can be done, and whether your organization should undertake a large scale data project. They also can tell you when you need to stop doing something and simply start anew.

Are you ready to listen to the #DataTherapist?

You may not always like what they have to say.

Please comment below about how you think a Data Therapist can help an organization.

2011-08-01

3 Great Reasons to Build a Data Warehouse

Why should you build a Data Warehouse?

What problems do a Data Warehouse and Business Intelligence platform solve?

There are strong debates about the methods chosen for building a data warehouse, or choosing a business
intelligence tool.

Image via Wikipedia

Here are three great reasons for building a data warehouse.

Make more money

The initial cost of building a data warehouse can appear to be large. However, what is the cost in time for the people that are analyzing the data without a data warehouse. Ultimately each department, analyst or business unit is going through a similar process of getting data, putting it in a usable format, and storing it for reporting purposes(ETL). After going through this process they have to create reports, prepare presentations and perform analysis. The immediate time savings benefit comes to these folks who do not have to worry about finding the data once the data warehouse platform is built.

The following two points also allow you to make more money.

Make better decisions

In order to better know your customers, you must first better understand what they want from you.Once the people that spend most of their time analyzing the data do not have to spend so much time finding the data and focus their time on reviewing the data and making recommendations, the speed of decision making will increase. As better decisions are made, more decisions can be made faster. This increases agility, improves response time to the customer or environment, and intensifies decision making processes.

Once a decision making platform is built you can better see which type of customer is purchasing what type of product. This allows the marketing department to advertise to those types of customers. The merchandising department can ensure products are available when they are wanted. Purchasing can better anticipate getting raw materials so products are available. Inventory can best be managed when you are able to anticipate orders, shortages, and re-orders.

Make lasting impressions.

Customer service is improved when you better understand your customer. When you can recommend to your customers other products that they may like you become a partner to your customer. Amazon does an amazing job of this. Their recommendation engine is closely tied to their historical data, and pattern matching of which products are similar. Likewise, you may want to tell a customer that they may not want something that they want to purchase because a better solution is available. This makes a lasting impression on them that you are the one to help them in their decision making process.

Make data work

Building a data warehouse platform is one of the best ways to make data work for you, rather than you have to work for your data.

2011-07-25

Datagraphy or Datalogy?

What is the study of data management best practices?

Do data management professionals study Datagraphy, or Datalogy?

A few of the things that a data management professional studies and applies are

Tools

Data Modeling tools
ETL tools
Database Management tools

Procedures

Bus Matrix development
User session facilitation
Project feedback and tracking

Methodologies

Data Normalization
Dimensional Modeling
Data Architecture approaches

These, among many others, are applied to the needs of the business. Our application of these best practices make our enterprises more successful.

What should be the suffix of the word that sums up our body of knowledge?

Both "-graphy" and "logy" make sense, but let's look at these suffixes and their meaning.

-graphy

The wiki page for "-graphy" says: -graphy is the study, art, practice or occupation of...

The dictionary entry for "-graphy" says -"a process or form of drawing, writing, representing, recording, describing, etc., or an art or science concerned with such a process"

-logy

The wiki page for "-logy" says -logy is the study of ( a subject or body of knowledge).

The dictionary entry for "-logy" says: a combining form used in the names of sciences or bodies of knowledge.

Data

The key word that we all focus on is data.

In a previous blog entry, I wrote a review of the DAMA-DMBOK which is the Data Management Association Data Management Body Of Knowledge.

Data Management professionals study and contribute to this body of knowledge. As a data guy, I am inclined to study to works of those who have gone before. I want to both learn from their successes and avoid solutions that have been unsuccessful.

Some of the writings I study are by people like: Dan Linstedt, Len Silverston, Bill Inmon, Ralph Kimball, Karen Lopez, William Mcknight and many others.

I have seen first hand what happens to a project when expertise from the body of knowledge produced by these professionals has been discarded. It is not pretty.

Why do I study these particular authors? These folks share their experiences. When I face an intricate problem, I research some of their writings to see what they have done. Some tidbit of expertise they have written about has shed light on many problem I have faced, helping me to find the solution that much sooner.

When I follow their expertise my solutions may still be unique, but the solutions fit into patterns that have already been faced. I am standing on the shoulders of giants when I heed their advice.

When I am forced to ignore their advice, I struggle, fight and do battle with problems that either should not be solved or certainly not be solved in the manner in which I am forced to solve them.

Should the study of and contribution to the body of knowledge of data management be called data-graphy or data-logy?

Datagraphy

The term Datagraphy sums up the study of the data management body of knowledge succintly.

I refer back to the dictionary definition of the suffix "-graphy": "a process or form of drawing, writing, representing, recording, describing, etc., or an art or science concerned with such a process"

Data is recorded, described, written down,written about, represented (in many ways) and used as a source for many drawings and graphical representations.

What do you think? I will certainly be using Datagraphy.

2011-07-23

Data is killing us!

Are you drowning in Data?

You have a number of applications collecting various pieces of data in order to run your business. What do you have to do in order for an analyst to make an informed decision?

For the majority of your business operations, dashboards should show current activity. Thresholds can be established for when a particular event takes place and alerts sent automatically. Simulations can be run based on past performance to gauge or even predict the performance of what-if scenarios.

All of these things can be done, the question is: Are they being done?

Image via Wikipedia

Are there so many copies of your application databases, that the cost of servers, disk arrays and storage going through the roof?

Are multiple people required to keep track of which backups and restores are done on a nightly basis driving personnel costs up?

Are business analysts spending more time collecting data than understanding, interpreting and making recommendations, reducing efficiency?

There is a better way.

A person who studies the practices of data management and the applicability of the various data management tools, procedures or methodologies to the needs of the business can make a difference in the use of an organizations data.

This difference can be measured in many ways. It could be an increase in revenue because a relationship was found in the data that could not have been seen before a new business intelligence system was deployed. It could be cost savings of physical equipment.

More often it is the saving of personnel time associated with gathering data just to answer questions.

Some proponents of vendor solutions will suggest that they have all of the answers to your data needs. Perhaps some vendors do have solutions. However, bringing in a vendor solution will not relieve an organization of the responsibility of data management.

The best way to work with vendors is to get them to fully understand all of the pain points associated with your data. No single vendor can solve all problems. Smart people with a vested interest in making your company successful will help you management your data.

Proliferation of data makes an organization stronger. If data is killing you, then you need someone to tame the beast and make data work for you.

Make your data work for you, rather than you work for your data.

Who are the people that will make your data work for you? A database administrator is a good start, many I have spoken to have plenty of ideas for how to make things better.

A data architect is the best start. Data Architects are the people that have studied data management best practices. A great Data Architect can quickly come to an understanding of your pain points and make recommendations that can be done soon to make sure that data works for you.

2011-02-15

When is the Data Warehouse Done?

Is Data Warehouse development ever complete?

During the launch of a data warehouse project There are schedules and milestones published for everyone to mark on their calendar. A good portion of these milestones are met, the data model is reviewed, development is done, data is loaded, dashboards are created, reports generated and the users are happy right?

Image via Wikipedia
Well, one would hope.
Invariably there is always one more question. How hard would it be to add this metric?

Sometimes it is just a matter of spinning out a new report, new dashboard or even new report. Sometimes the question comes requiring data from an application that did not even exist when the data warehouse project was started. Now the architect has to go back and do integration work to incorporate the data source into the data warehouse, perhaps new modeling needs to be done, perhaps this requires some time for ETL development, sometimes it is just some front end business intelligence work that needs to be done.

Once that is deployed does the data warehouse answer all questions for the enterprise? Can the project then be said to be complete, done and over?

I think perhaps not.

Most data warehouse projects I have worked on have been released in phases. Once a phase is done and users are happy with it we move on to the next phase. Occasionally we have to go back and modify, for various reasons, things that we have already completed and put into production. Is it ever complete? Is it ever done?

I think a data warehouse requires no more modifications in only one case.

When the company no longer exists.

So long as the enterprise is vibrant and interacting with customers, suppliers, vendors and the like. So long as data comes in and goes out of the organization development of the data warehouse will need to continue. It may not be as intense as at the beginning of the original project, but development will need to be done.

So long as the enterprise lives, the data warehouse lives and changes.

2011-02-11

Dynamic Situational Awareness and Multi Agent Systems

I just watched the exultation of the people of Egypt after the announcement that Hosni Mubarak relinquished the Presidency of Egypt to the Military. Egypt's revolution represents many things to many people, and analysts, historians, economists and others will be analyzing the cause, and the events that took place throughout the revolution for some time. The story is just beginning.

One thing that this reminded me of is a model I studied some time ago.

Image via WikipediaThe concept of Multi Agent Systems, is a system composed of multiple interacting intelligent agents. Multi-agent systems can be used to solve problems which are difficult or impossible for an individual agent or monolithic system to solve. Borrowing from the definition is an overview:

Overview

The agents in a multi-agent system have several important characteristics:^[4]

Autonomy: the agents are at least partially autonomous
Local views: no agent has a full global view of the system, or the system is too complex for an agent to make practical use of such knowledge
Decentralization: there is no designated controlling agent (or the system is effectively reduced to a monolithic system)^[5]

Typically multi-agent systems research refers to software agents. However, the agents in a multi-agent system could equally well be robots,^[6] humans or human teams. A multi-agent system may contain combined human-agent teams.

What we as a community have seen is an example of a multi agent system that breaks one of these rules.

The rule of "Local views" did not apply to the crowd of people in Egypt. Thanks to Twitter, Facebook, Youtube and cell phones each person had access to the global view. Some examples of this is called: Situational Awareness

The new social media tools available to the communities of people in Egypt allowed them to have full situational awareness of their local environment as well as the environment in the country as a whole. Modeling this behavior will be something that I am sure many researchers will be working on for some time to come.

How will we see some of this translated into the business world? Dashboards, Mobile Business Intelligence based on clean, concise data systems that are fed from data warehouses or data marts can profoundly impact the capabilities of the people that use these systems. Quickly getting appropriate data to the appropriate person can completely change the ability of business analysts, financial planners, buyers, sellers, accountants and managers to do their daily jobs.

When faced with the questions of how can delivery of data quickly to intelligence systems impact the bottom line. The behavior of the people of Egypt shows how quick access to information can have an impact on an environment.

I hope that the people of Egypt build a stronger country and live in peace with their neighbors and the rest of the world. Peace loving people around the world congratulate you.

Pages

2020-06-12

2017-01-05

Related articles

2015-05-03

Do you need a #DataTherapist?

Related articles

2011-08-01

Make more money

Make better decisions

Make lasting impressions.

Make data work

2011-07-25

-graphy

-logy

Data

Datagraphy

Related articles

2011-07-23

Related articles

2011-02-15

Related articles

2011-02-11

Overview

Related articles