The-Data-Guy: Database

Showing posts with label Database. Show all posts

2017-01-05

What is the performance relationship between a Database and a Business Intelligence Server

This article will be a bit long; it covers a complicated topic that I have been studying for quite some time.

I have run across the need to explain this topic on a number of occasions, and over time my explanations have hopefully become clearer and more succinct.

The concept that will be discussed here is the performance relationship between a database server and a business intelligence server in a simple data mart deployment.

Rarely are data mart deployments simple, but the intention for this is to be a reference article to understand the relationships between the server needs and the performance footprint under some of the various scenarios to be experienced during the lifecycle of a production deployment.

Here is a simple layout of an architecture for a DataMart:

Very basic image, the D is the database server, the F is the front-end, the U are the users, and they are all connected via the network.

To be precise this architecture represents a ROLAP (Relational On Line Analytical processing) built on top of a dimensional model (star schema) implementation. The dimensional model is assumed to be populated and current with totally separate ETL processes that are not represented in this diagram.

The “D” represents any database server: Oracle, MySQL, SQL Server, DB2, whichever infrastructure you or your enterprise has chosen.

The “F” represents any front-end business intelligence server that is optimized for dimensional model querying, and is supported by a server instance: Business Objects, Cognos, Pentaho Business Analytics,

Tableau Server. The desktop specific BI solutions do not fit in this reference model for reasons we shall see shortly.

In my early thoughts on the subject, I envisioned that the performance relationship in a properly done DataMart would be something like this:

This is a good representation of what happens.

On the left side of the chart we have the following scenario.

When the front-end server responding to a user interaction sends a request back to the database for aggregated data like: “show me the number of units sold over the last few years”

One could imagine the query being something like: Select Year, Sum(total_sold) from Fct_orders fo inner join Dim_Date dd on fo.date_key = dd.date_key.

The dutiful database does an aggregation, provided all of the statistics are current on the data, a short read takes place and more CPU and Memory but less Disk I/O is used to do the calculation.

In the graph this is represented by the high red-line on the upper left.

The results returned to the front end are small. A single record per year of collected data.

The CPU and Memory load on the front end server is tiny shown in green on the lower left.

On the right side of the chart we have the following scenario.

When the front-end server responding to a user interaction sends a request back to the database for non-aggregated data like: “show me all of the individual transactions taking place over the last few years”

One could imagine the query in this case to be something like: Select fo.*,dimensional data from Fct_orders fo inner join (all connected dimension tables).

In this case the database server has little option but to do a full table scan of the raw data and returning it.

In the graph this is represented by the lower red-line on the right (more disk I/O, less CPU and Memory), then the data is returned to the business intelligence server.

Our Front-End server will have to do some disk caching, as well as lots of processing (CPU and Memory) to handle the load just given to it, not to mention things like pagination, and possibly holding record counters to keep track of which rows the user has seen or not, among other things.

This graph seems to summarize the relationship between the two servers rather nicely. However, something is missing.

I had to dwell on this image for some time before I was able to think of a way to visualize the thing that is missing.

The network.

And even then there are at least two parts to the network.

The connection between the front-end server and the database, followed by the connection between the front-end server and all of the various users.

Each of these have a different performance footprint.

Representing the database performance, Front-End performance, and network performance for both the users and the system connections is something with which I continue to struggle.

Here is the image I have recently arrived at:

This chart needs a little context to understand the relationships between the 4 quadrants.

Quadrant I is the server network bandwidth. In a typical linear relationship as the data size increases from the database to the front end the server network bandwidth increases.

Quadrant II is the database performance relationship between CPU/Memory and Disk I/O for a varying query workload. For highly aggregated queries the CPU and Memory usage increases, and the Server Network bandwidth is smaller because less data is being put on the wire. For less aggregated data, and more full data transfers the Disk I/O is higher, Memory is lower, and back in Quadrant I the Server Network Bandwidth is higher.

Quadrant III is the Front-End server performance comparing CPU/Memory and Disk I/O when dealing with a varying volume of data. As the data increases from the database more resources and caching is needed on this server.

Quadrant IV is the User Network Bandwidth this is the result of the front end server responding to the requests from the user. As the number of users increase the volume of data increases and more of a load is put on the front end server. Likewise, the bandwidth increases as more data is being provided to the various users.

This image is an attempt to show the interactions between these 4 components.

The things that make this image possible is a well-designed dimensional model, a rich semantic layer with appropriate business definitions, and common queries that tend to be repeated.

This architecture can support exploratory analysis, however, the data to be explored must be defined and loaded up front. For exploratory analysis to determine which data points need to be included in the data mart, that should be done in a separate environment.

I created all three of these images with R using iGraph and ggplot2 with anecdotal data. The data shown in this chart is not sampled, but is meant as a representation of how these four systems interact. Having experience monitoring many platforms supporting this architecture, I know for a fact that no production systems will actually show these rises and falls the way this representative chart is doing.

However, understanding that at their core they should interact this way should give a pointer to where a performance issue may be hiding in your architecture. If you are experiencing problems. The other use-case of this image is an estimation tool for designing new solutions.

All that being said, much of this architecture may be called in to question by new tools.

Some newer systems, Hadoop, Snowflake, RedShift actually change the performance dynamics of the database component.

The Cloud concept has an impact on the System Bandwidth component. If you have everything in the cloud, then in theory the bandwidth between the database server and the front-end server should be managed by your cloud provider. There may need to be VPC pairings if you set them up in separate regions.

If these are being run within a self-managed data center should the connection between the database server and the front-end server be on a separate VLAN, or switch? Perhaps.

Does the front-end server use separate connections for the database querying interface and the user facing interface? Should it?

Do you need more than one front-end server sitting behind a load-balancer? How many users can one of your front-end servers’ support? What are the recommended limits from the vendor? Should data partitioning and dedicated servers per business unit be done to optimize performance for smaller data?

These are all types of questions that arise when looking at the bigger picture. Specifically when you are doing data systems design and architecture. This requires a slightly different touch than application systems design and architecture.

Thinking about applying this diagram in your own enterprise will hopefully give insight into your own environment.

Can you think of a better way to diagram this relationship? Let me know.
The code and text are posted here.

2016-01-17

Packaged reports versus integrated reporting.

Data Silo?

Software Packages like SAP, Peoplesoft, or Kronos are dedicated to solving problems. They are full-featured packages that come with plenty of options to meet the needs of their customers.

However a lot of people I speak with about these packages act as if these packages are the only packages in the enterprise. More often than not, these "all encompassing" packages are not alone in the enterprise; there are other systems with which they need to interact.

A package is not alone

I was reminded of some of my discussions about this topic when I saw a toy commercial recently. They show children playing with just that particular toy set that the advertisement is showing. In reality this one toy set is part of an entire room full of toys. Cleaning up all of these toys can become a challenge.

If you have ever helped a child find a part of a toy you will understand.

For a toy set if you want to take a car from one set and play with that car on a new race set you simply pick up the car and start playing with it. If it doesn’t quite fit the child will use their imagination to make the car either a giants car, or a tiny robot car looking for other tiny cars in that world.

For an application it’s not quite that simple. If you have a large volume of data in an older application or legacy application, and you want to use that data in your new application you have to have a migration strategy requiring expertise in both applications as well as data manipulation. If these applications use different data repositories, Oracle and the other SQL server, that can make it more complicated.

If you want to use both applications at the same time, during a soft launch, for example, data integration becomes even more vital.

Data needs to be integrated

Constant feedback of performance is a vital business function. Seeing the performance of only one application will limit your vision in other areas. Data Management professionals are the ones ultimately responsible for knowing where all of the “toys” are located and how to get them as well show how they are represented.

Proactive data management planning ensures that some of these problems never emerge. If every week or month or day you have lots of IT people and Business Analysts looking for data, you need to re-think your strategy.Because then you are working for your data, rather than having your data work for you.

2015-05-07

PloyGlot Management

PloyGlot Data Management

What is the data stored in?

A traditional database administrator is familiar with a set of utilities, SQL, and a stack of tools specific to the RDBMS that they are supporting.

SQL Server, SQL Server Analysis Service, SQL Server Integration Services, SQL Server Reporting Services. SQL Server Management Studio.

Not to mention the whole disk partitioning, SAN layout, RAID level and such that they need to know.

Oracle has it's own set of peculiarities as does DB2, MySQL, and others.

There is a new kid on the block.

NoSQL.

Martin Fowler does a phenomenal job introducing the topic of NoSQL in this talk

While most people do not get into too many of the details about how their data is stored and structured, the world of #DataOps the nature of how and where the data is stored becomes incredibly important.

In any given day managing a PloyGlot environment the command line tools used could be sqlplus, bcp, sqlplus,hdfs,spark-shell, xd-shell.

Each of these data storage engines have different requirements. The manner in which databases are clustered when moving between RDBMS, and the NoSQL platforms are quite different.

Putting Hadoop or Cassandra on a SAN should require due thought before doing so.

Likewise creating an environment with isolated disks for an Oracle Cluster may not be the correct solution.

Managing a PolyGlot environment, is by itself a challenge. Sometimes this requires a team,
sometimes this requires a lightly shaded purple squirrel that is equally at home at the command line, the SQL prompt,a REPL, a console, a white board, an AWS management browser, or a management console like Grid control, Ops Center, or Cloudera Manager.

Working with this variety of data, and the variety of the types of teams, and people that need access to this data, or even a subset of the data requires it's own level of understanding of the data, data management, and how to make the data itself work to contribute to the bottom line of the organization.

Are your purple squirrels only on the data science team? Probably not.

2014-03-28

Databases-are-pack-mules-neither-cattle-nor-puppies

"Treat your servers like Cattle, not puppies" is mantra I have heard repeatedly in a former engagement where I was actively involved in deploying a cloud based solution with RightScale across Rackspace, AWS, and Azure.

The idea is that if you have a problem with a particular server, simply kill it, relaunch it with the appropriate rebuild scripts like chef and some other automation, have it reconnect to the load balancer and you are off to the races.

I think there is one significant flaw in this philosophy.

Database servers, or data servers are neither Cattle nor puppies. They are however like Pack Mules.

If you have an issue with some data servers like Cassandra have some built in rebalancing capabilities that allow you to kill one particular instance of a Cassandra ring, bring up a new one, and it will redistribute the loads of data you may have. Traditional database engines like SQL Server, Oracle, MySQL, etc... do not have this built in capability.

Backups still need to be done, Restores still need to be tested, bringing up a new relational database server requires a bit of expertise to get it right. There is still plenty of room for automation, scripting, and other capabilities.

That being said, database infrastructure needs to have a trained, competent, database administrator overseeing its care and maintenance.

We have all seen movies where pack mules were carrying supplies of the adventurers.

As you take a Pack Mule on your adventures or explorations with you, if it breaks its leg, you can get a new pack pule easy enough. However, you have to remove all of the things that the pack mule was carrying to the new pack mule. If you don't have a new pack mule, or can't get one quickly the adventurers themselves have to carry the supplies, the load is redistributed, priorities of what is truly needed to make it through the foreseeable next steps, and plans are made to find the local town to get a new pack mule.

Back in the present database infrastructure is truly the lifeblood of all of our organizations. Trying to "limp" through, or simply "blow away" our servers and rebuilding them is an extreme philosophy. There are laws, regulations, and customer agreements regarding the treatment and protection of data that must be adhered to.

Who is taking care of your pack mules? Will your current pack mule make it over the next hill you have to climb?

2011-08-01

3 Great Reasons to Build a Data Warehouse

Why should you build a Data Warehouse?

What problems do a Data Warehouse and Business Intelligence platform solve?

There are strong debates about the methods chosen for building a data warehouse, or choosing a business
intelligence tool.

Image via Wikipedia

Here are three great reasons for building a data warehouse.

Make more money

The initial cost of building a data warehouse can appear to be large. However, what is the cost in time for the people that are analyzing the data without a data warehouse. Ultimately each department, analyst or business unit is going through a similar process of getting data, putting it in a usable format, and storing it for reporting purposes(ETL). After going through this process they have to create reports, prepare presentations and perform analysis. The immediate time savings benefit comes to these folks who do not have to worry about finding the data once the data warehouse platform is built.

The following two points also allow you to make more money.

Make better decisions

In order to better know your customers, you must first better understand what they want from you.Once the people that spend most of their time analyzing the data do not have to spend so much time finding the data and focus their time on reviewing the data and making recommendations, the speed of decision making will increase. As better decisions are made, more decisions can be made faster. This increases agility, improves response time to the customer or environment, and intensifies decision making processes.

Once a decision making platform is built you can better see which type of customer is purchasing what type of product. This allows the marketing department to advertise to those types of customers. The merchandising department can ensure products are available when they are wanted. Purchasing can better anticipate getting raw materials so products are available. Inventory can best be managed when you are able to anticipate orders, shortages, and re-orders.

Make lasting impressions.

Customer service is improved when you better understand your customer. When you can recommend to your customers other products that they may like you become a partner to your customer. Amazon does an amazing job of this. Their recommendation engine is closely tied to their historical data, and pattern matching of which products are similar. Likewise, you may want to tell a customer that they may not want something that they want to purchase because a better solution is available. This makes a lasting impression on them that you are the one to help them in their decision making process.

Make data work

Building a data warehouse platform is one of the best ways to make data work for you, rather than you have to work for your data.

2011-07-25

Datagraphy or Datalogy?

What is the study of data management best practices?

Do data management professionals study Datagraphy, or Datalogy?

A few of the things that a data management professional studies and applies are

Tools

Data Modeling tools
ETL tools
Database Management tools

Procedures

Bus Matrix development
User session facilitation
Project feedback and tracking

Methodologies

Data Normalization
Dimensional Modeling
Data Architecture approaches

These, among many others, are applied to the needs of the business. Our application of these best practices make our enterprises more successful.

What should be the suffix of the word that sums up our body of knowledge?

Both "-graphy" and "logy" make sense, but let's look at these suffixes and their meaning.

-graphy

The wiki page for "-graphy" says: -graphy is the study, art, practice or occupation of...

The dictionary entry for "-graphy" says -"a process or form of drawing, writing, representing, recording, describing, etc., or an art or science concerned with such a process"

-logy

The wiki page for "-logy" says -logy is the study of ( a subject or body of knowledge).

The dictionary entry for "-logy" says: a combining form used in the names of sciences or bodies of knowledge.

Data

The key word that we all focus on is data.

In a previous blog entry, I wrote a review of the DAMA-DMBOK which is the Data Management Association Data Management Body Of Knowledge.

Data Management professionals study and contribute to this body of knowledge. As a data guy, I am inclined to study to works of those who have gone before. I want to both learn from their successes and avoid solutions that have been unsuccessful.

Some of the writings I study are by people like: Dan Linstedt, Len Silverston, Bill Inmon, Ralph Kimball, Karen Lopez, William Mcknight and many others.

I have seen first hand what happens to a project when expertise from the body of knowledge produced by these professionals has been discarded. It is not pretty.

Why do I study these particular authors? These folks share their experiences. When I face an intricate problem, I research some of their writings to see what they have done. Some tidbit of expertise they have written about has shed light on many problem I have faced, helping me to find the solution that much sooner.

When I follow their expertise my solutions may still be unique, but the solutions fit into patterns that have already been faced. I am standing on the shoulders of giants when I heed their advice.

When I am forced to ignore their advice, I struggle, fight and do battle with problems that either should not be solved or certainly not be solved in the manner in which I am forced to solve them.

Should the study of and contribution to the body of knowledge of data management be called data-graphy or data-logy?

Datagraphy

The term Datagraphy sums up the study of the data management body of knowledge succintly.

I refer back to the dictionary definition of the suffix "-graphy": "a process or form of drawing, writing, representing, recording, describing, etc., or an art or science concerned with such a process"

Data is recorded, described, written down,written about, represented (in many ways) and used as a source for many drawings and graphical representations.

What do you think? I will certainly be using Datagraphy.

2011-03-02

Data Management Industry?

How important is an industry classification for data management professionals?

I have been asked the question: What is your industry?

My reply, when given the option, is to say the Data Management Industry.

The particular vertical market that my company classifies itself according to Dun and Bradstreet Industry classification, Standard Industry Classification (SIC) or even North American Industry Classification System (NAICS) has a limited impact on my day to day duties.

Some industries have a more stringent requirement for data quality or data availability than others, but overall the manner in which data is managed between industries is consistently similar.

In every industry I have worked the same process is generally followed.

Data is captured

In Telecommunication, Energy and Supply Chain these systems are usually automated data capture via a field device such as a switch or a sensor, some are driven based on orders and some are driven based on consumer behavior.

In Retail and ECommerce the source data capture component is a customer facing system such as a web site or scanner for checking out at a grocery store.

Most companies have a human resources system that keep track of time for the customer facing employees tracking contextual information such as when did an employee arrive, what did they work on, when did they leave?

Data is integrated

Once we have the source data and as much contextual information about this data captured; that data is transferred to another system. This system could be a billing, payroll, time keeping or analytical system, such as a data warehouse or a data mart. The methods used to create this integration system can vary depending on the number of source systems involved and the requirements for cross referencing the data from one system with the data in other systems.

At times certain data points are transferred outside the organization. This data could be going to suppliers, vendors, customers or even external reporting analysts.

Internally each department within an organization needs to see certain data points. Marketing, Merchandising, Finance, Accounting, Legal, Human Resources, Billing, to name a few do not necessarily need to see all of the data captured. However the data they do require does need to get to them in a timely manner in order for these departments to support the overall organization.

Data is protected

During all of these data interchanges the same types of functions need to be performed.

Data must be secured (the users that need access have it, those that do not need access cannot see it), backed up, restores tested and verified, performance must be optimized, problems need to be addressed when they arise, quality must be maintained, and delivery must be verified.

Data Management Professionals

The Data Management profession consists of people striving to create, implement, maintain and evolve best practices for managing the data that runs our enterprise.

The challenges of data management, the problems we solve and the solutions we provide are more similar than they are different.

Is the industry classification of the enterprise we support all that important?

2011-02-17

Thought leadership

You have to be a thought leader in order to recognize one.

I hear the term thought leader bestowed upon people occasionally. I have even bestowed this term on some people that I consider to be extremely knowledgeable about building data warehouse systems.

The wealth of information for data management best practices continues to grow. Thought leaders can publish knowledge about solving a particular problem in a variety of forums now: blogs, books, articles, and even research papers. The sheer volume of information about the "best practices" is almost intimidating.

The ability to take in all of the information about best practices for a subject area, apply it to the situation at hand, consolidating the recommendations from multiple sources as well as ignoring those recommendations that are not applicable make you a thought leader. Google provides a way of finding a site that answers a particular question. If a person does not ask the correct question, Google does not provide a good answer. Once Google finds a particular answer to a keyword query you have to apply that answer to your particular situation.

Let us take a specific example.

The question should not be:

What is the best way to build A data warehouse?

The question should be:

What is the best way to build THIS data warehouse?

Even something as simple as learning a how to apply a new SQL trick that you learned to a specific problem you are working on shows the application of this knowledge. Best practices can be abstract, or even theoretical. When you can take recommendations from many sources and apply their expertise to your specific problem you have taken a big step.

This can apply to many other professional areas. SEO, Business Analysis, Business Process Re-engineering, ETL development,Resume writing, Financial Analysis, Online Marketing, etc...

If you can study multiple sources and apply their recommendations or findings to your own situation, you become a thought leader.

You become a recognized thought leader when you write about it.

2011-02-15

When is the Data Warehouse Done?

Is Data Warehouse development ever complete?

During the launch of a data warehouse project There are schedules and milestones published for everyone to mark on their calendar. A good portion of these milestones are met, the data model is reviewed, development is done, data is loaded, dashboards are created, reports generated and the users are happy right?

Image via Wikipedia
Well, one would hope.
Invariably there is always one more question. How hard would it be to add this metric?

Sometimes it is just a matter of spinning out a new report, new dashboard or even new report. Sometimes the question comes requiring data from an application that did not even exist when the data warehouse project was started. Now the architect has to go back and do integration work to incorporate the data source into the data warehouse, perhaps new modeling needs to be done, perhaps this requires some time for ETL development, sometimes it is just some front end business intelligence work that needs to be done.

Once that is deployed does the data warehouse answer all questions for the enterprise? Can the project then be said to be complete, done and over?

I think perhaps not.

Most data warehouse projects I have worked on have been released in phases. Once a phase is done and users are happy with it we move on to the next phase. Occasionally we have to go back and modify, for various reasons, things that we have already completed and put into production. Is it ever complete? Is it ever done?

I think a data warehouse requires no more modifications in only one case.

When the company no longer exists.

So long as the enterprise is vibrant and interacting with customers, suppliers, vendors and the like. So long as data comes in and goes out of the organization development of the data warehouse will need to continue. It may not be as intense as at the beginning of the original project, but development will need to be done.

So long as the enterprise lives, the data warehouse lives and changes.

2010-12-27

DAMA-DMBOK book review

Diagram showing the importance and result of w...

Image via Wikipedia

“What do you do?”

I am asked this question frequently. Family members, church friends, even recruiters and coworkers sometimes ask this question.

Depending on the audience, I will say something like “work with computers”, or “I’m a DBA.” or “I’m a database developer.”

Dr. Richard Feynman once said: “If you can't explain something to a first year student, then you haven't really understood it.”

The DAMA – Data Management Body of Knowledge is a work that attempts to document and formalize the definition of the Data Management profession.

According to the book, a Data Management Professional is responsible for the planning, controlling and delivery of data and information assets.

The thing that impressed me the most is that it brought together so many formal definitions of many various concepts that I work with on a daily basis. Whole books can, indeed have, been written on each component of data management touched on in the body of knowledge. One of the values of this book is the bibliography. If one were to acquire every book referenced in this work they would have an impressive library of data management knowledge.

Another thing that was impressive to me is this book advocates the role of the Data Management Executive. The Data Management Executive is defined as: “The highest-level manager of Data Management Services organizations in an IT department. The DM Executive reports to the CIO and is the manager most directly responsible for data management, including coordinating data governance and data stewardship activities, overseeing data management projects and supervising data management professionals. May be a manager, director, AVP or VP.” I have worked with and in many organizations; very few actually had an “official” data management executive. As a result, data movement into and out of the organization has been something of a haphazard process. Each project that required movement of data was approached differently. If a single official point of contact for all data management activities existed, then these projects could have been more streamlined to fit into an overarching design for the enterprise as a whole.

Each chapter covers a different aspect of the overall Data Management Profession. The first chapter gives an overview of why data is a corporate asset. The definition of data as a corporate asset is the foundation of all data management activities. Focusing on data as an asset first, then the follow on activities discussed in the major component chapters are seen as value-add activities.

This picture illustrate the Data Architecture ...

Image via Wikipedia

The major components of Data Management covered by the chapters and the definitions the DMBOK provides are:

Data Governance: The exercise of authority and control (planning, monitoring and enforcement) over the management of data assets. The chapter gives an overview of the data governance function and how it impacts all of the other functions. Data Governance is the foundation for the other functions.

Data Architecture: An integrated set of specifications artifacts used to define data requirements, guide interaction and control of data assets, and align data investments with business strategy.

Data Development: The subset of project activities within the system development lifecycle (SDLC) focused on defining data requirements, designing the data solution components and implementing these components.

Data Operations Management: The development, maintenance and support of structured data to maximize the value of the data resources to the enterprise. Data operations management includes two sub-functions: database support and data technology management.

Data Security Management: The planning, development and execution of security policies and procedures to provide proper authentication, authorization, access and auditing of data and information assets.

Reference and Master Data Management: The ongoing reconciliation and maintenance of reference data and master data.

Data Warehouse and Business Intelligence Management: This is a combination of two primary components. The first is an integrated decision support database. The second is the related software programs used to collect, cleanse, transform, and store data from a variety of operational and external sources. Both of these parts combined to support historical, analytical and business intelligence requirements.

Document and Content Management: The control over capture, storage, access, and use of data and information stored outside relational databases. Document and Content Management focuses on integrity and access. Therefore, it is roughly equivalent to data operations management for relational databases.

Meta-data Management: The set of process that ensure proper creation, storage, integration and control to support associated usage of meta-data.

Data Quality Management: A critical support process in organizational change management. Changing business focus, corporate business integration strategies, and mergers, acquisitions, and partnering can mandate that the IT function blend data sources, create gold data copies, retrospectively populate data or integrate data. The goals of interoperability with legacy or B2B systems need the support of a DQM program.

The last chapter covers Professional Development, ethics, and how DAMA( Data Management International) dama provides a professional society body or guild for the communal support of information and data management professionals.

Overall this is an outstanding book for defining the roles associated with data management. While it is light on details for implementing the programs, processes and projects that it defines, it is nevertheless a great book for creating a common vocabulary amongst professionals who work day-to-day in the data management profession.

The more we, as data management professionals, communicate consistently with business users, executives, and the public about what we do the better it will be for all of us when one of us is asked “what we do”.

My answer now is I am a Data Management Professional. I can assist you with better understanding, delivery, analysis, security and integrity of your data.

2010-12-01

Data as an Enterprise Asset

From the wiki an Asset is: "Anything tangible or intangible that is capable of being owned or controlled to produce value and that is held to have positive economic value is considered an asset"

Data is the most valuable Enterprise Asset's in existence. The recent release of documents on the Wikileaks site is a prime example of this. What will be the final cost associated with the release of these documents? How many man-hours will be devoted to changing procedures, implementing new security protocols, trying to recover loss of face by many government agencies?

Your address book

What would happen if you were to lose your phone?

You would just replace it right?

What about your address book?

How many people keep their contacts in multiple locations for "safekeeping"?

You want to make sure that you keep your contacts regardless of what happens to your phone, right. Data Management professionals feel the same way about the data that they safeguard.

The CEO view

It is 11:43 p.m. on a Friday night. The alcohol from the dinner meeting with investors won’t wear off for a few more hours. You should be fine for the 7:12 a.m. tee time with the next group of potential clients. When the phone rings you just curse and pick it up.

“What!” you yell into the phone.

“Hey boss “, you hear the head of your IT department.

“Listen; there is no easy way to say this. In the storm that we had earlier tonight, we took a handful of lightning strikes and had a tornado touch down on the building itself. The lightning strikes then caused a fire that wasn’t caught until it was too late. The building is pretty much destroyed.

We have already updated DNS to our DR site. Some of the DBA's and server admins are on the way there. Our main network guy is unavailable since he is out of town. We are supposed to have our backup tapes there in a few hours. The server guys will get our servers back up, the DBA's will restore the databases and validate where we are with the data."

What do you do?

If you trust the DR plan and your DBA's, then you can go back to sleep.

Would you sleep well?

How valuable is your data now that you don't know whether you have it or not?

Valuating your data

One way to determine the value of your data is to identify the direct and indirect business benefits derived from use of the data. Another way is to calculate the cost of its loss; what would be the impact of not having the current quality level of your data or the amount of data you have?

What if you only lost a years' worth of data?

What change to revenue would occur?

What would be the cost to recover it? Man-hours, potentially consultant hours as you hire outside expertise if necessary would factor in to the costs.

Data Management Professionals protect your Assets

Data Management Professionals are the ones that protect your data assets. By protecting and safeguarding your data assets, they are protecting and safeguarding the enterprise itself.

2010-11-23

Data never dies

Do you ever wish you could forget something?

Certainly there are traumatic events in some of our lives that we may wish that we could forget; more often than not most people wish they could remember something.

A taste, a name, the name of a restaurant, these are all things that some of us try very hard to remember at times.

For computer data management it is a different story. I have been in many discussions where we question how far back in time we are required to retain data.

By formal definition this is called the data retention policy of an organization. There are laws in place that require certain records to be retained for 7 years. Some data must be kept indefinitely.

This simple term: “Data Retention Policy” can have an impact on software purchases, hardware purchases, man-hours and processing time for an analytics application.

Why is my application taking so long?

As the volume of data within an application grows, the performance footprint of the application will change. Things that previously ran fast will begin to run slow. I once worked for an application vendor and handled more data management issues than software development issues. On one particular occasion shortly after I started there I received a call about the application not working. Upon review of the environment where “nothing had changed” I discovered the problem. The SQL Server database was severely underpowered. Simply manually executing the SQL directly through query analyzer showed dreadful performance.

We had to recommend an upgrade in hardware. Once the customer purchased new hardware I took a team to the customer site and we did a migration of the data from the old server to the new server. When I left the site, I heard numerous people talking about how the application had not worked that well since it had been freshly installed.

A simpler answer may have been to “archive” data, to clean it out so that the performance would have returned to a fresh state or even just delete it. The reason we could not do that is that this particular application was a time tracking system for recording time-in and time-outs of employees working at a refinery. Employee data is not something that should just be purged; especially data that directly impacts how contractors and employees are paid.

The data would be studied for some time to report on cost expenditures for the site where the work was recorded.

But simply upgrading the back end database server was really only a short term solution.
This is a case where we kept all of the historical data within the application itself for reporting and study.

Reporting systems can help

As a data warehouse engineer, now I would have suggested an alternative solution. I would have suggested that “warm” data should be moved to a reporting application for reporting and study.

A threshold should be established for what is useful within an application itself for data that is pertinent and needed on a daily and weekly basis. This is the “hot” fresh data that is currently being processed. The data that is important for business continuity and reporting to auditors, vendors other business units and executives does not necessarily need to be kept within the application itself. We should have spun off a reporting system that could be used to retain that data and allow study and reporting, but not bog down the application itself.

Building specific reporting systems is essential to maintain optimal application performance. By offloading this data into an Operational Data Store, Data Mart, or Data Warehouse you will keep your “hot” data hot and fresh and your “warm” data will be available for use in an environment that does not interfere in any way with the day to day work of your business.

How long do I keep data?

How long data is kept for each business unit within an organization is a question for each business owner. Law’s need to be examined for requirements, analysts need to make it clear how much data they need to see for trending analysis, and data management personnel need to contribute to the discussion by showing alternatives for data storage, retrieval and reporting.

Keep your corporate “memory” alive by having a current data retention policy that is reviewed every time a new application is brought online. Reviewing your data retention policy at least annually keeps this issue fresh in all of the stake-holders minds. Disaster recovery planning and performance optimization both are influenced by the data retention policy.

Since the data of your company is really all of the information about your customers, vendors, suppliers, employees and holdings data never dying is a good thing!

Pages

2017-01-05

Related articles

2016-01-17

Data Silo?

A package is not alone

Data needs to be integrated

Related articles

2015-05-07

PloyGlot Data Management

What is the data stored in?

Related articles

2014-03-28

Related articles

2011-08-01

Make more money

Make better decisions

Make lasting impressions.

Make data work

2011-07-25

-graphy

-logy

Data

Datagraphy

Related articles

2011-03-02

Data is captured

Data is integrated

Data is protected

Data Management Professionals

Related articles

2011-02-17

Related articles

2011-02-15

Related articles

2010-12-27

Related articles

2010-12-01

Your address book

The CEO view

Valuating your data

Data Management Professionals protect your Assets

Related articles

2010-11-23

Why is my application taking so long?

Reporting systems can help

How long do I keep data?

Related articles