The-Data-Guy: January 2011

2011-01-30

Organic SEO Data Management

Image via WikipediaHow do you manage data outside your control?

Search Engines are great catalogers of meta data about your web site. Recently I was asked to do a bit of SEO for a friend. I worked on putting the HTML together, and researched the topic of SEO extensively. There is a lot of almost conflicting guidance for how to do SEO for a web site. So in order to understand how SEO works, I did what I am sure many other software developers have done.

I wrote my own search engine.

My search engine is anything but production grade, but the research project gave me some insight into how the big guys work. It has limitations, but rather than completely reinvent the wheel, I decided to leverage the capabilities of the tools I had.

I am sure that there are better ways to build a search engine, and this is not meant to be an authoritative methodology, but it is the way that I build my little utility to give me some insight on the process.

I used Fedora to build this, with a MySQL back end. The page crawler is all PHP code. Some of the PHP code I leveraged from the open source community. After all, the goal of this project is to understand the concepts of a search engine, not commercialize it. The web pages were broken down to the raw text, that text was then analyzed and restructured into a small data mart.

The steps my crawler performed were:
1. Identify the site and the individual pages.
2. Get Links
3. Get Meta tags for document.
4. Get the raw-text
5. Get any emails from the page.
6. Clean stopwords from the raw text.
7. Do word frequency analysis on the clean raw text
8. Stem the text
9 Do word frequency analysis on the stemmed text.
10. Store readability scores.

For each of these steps, I stored the results into a database. The database would ultimately turn into a star schema data mart for reporting and analysis.

Just as an overview I will touch on each of these steps, and why I did them. Each step has plenty of documentation about it across the internet, so this is not meant to be an exhaustive explanation.

Step 1. Identify the Site and the individual pages.

    Not only would you want the individual page that a search pattern shows up on, but you also want the site associated with that page. Simple Reg-ex work here.

Step 2. Get Links.

   Parse through the HTML in order to get all of the <a> tags and record the links this page makes reference to. The goal of this step is to record the number of inbound links to a particular page. By capturing the pages this particular page links to, when we scan those pages we can do a count to see the inbound links. The biggest problem with this method is I would either have to know which sites and pages link to this page, or scan the whole internet. I chose to simply record it for internal site page link counts.

Step 3. Process Meta Tags

   Record the Meta tags that the page author stored in the document. These are things like keywords, description, etc... These keywords are the keywords that I want google to use as the keywords for this site. What I found through this research project is that these keywords do NOT necessarily match what google determines is a keyword for your site.

Step 4. Get the Raw Text.

    This step is perhaps one of the most important. Stripping out the raw text is the human readable text that people will actually read. Humans do not process HTML tags, at least normal humans anyway. HTML tags are not cataloged in my process.

Step 5. Get any emails

     Another simple reg-ex that is run against the raw text to store the emails that were embedded in the text.

Step 6. Clean stopwords.

    Stopwords are defined here: Stopwords. Basically these words are common short function words such as the, is, and, or. These words are not very important for text scanning.

Step 7. Word frequency analysis.

    Word frequency analysis is the process of counting the word frequency of a body of text. For example in the text "The quick brown fox jumped over the slow red fox." The word "fox" would have a WFA count of two. Every other word would have a WFA "the", which would be considered a stopword in this example)

Step 8. Stem the raw text.

Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form, from Stemming. Why is this important? Knowing the root word of a word gives you some flexibility when it comes to processing searches. If the text from the web page is: "Fishing for red worms", and the search is "Fished for worms", if you remove the stopwords and stem both phrases you will have "Fish red worm" and "Fish worm". By doing a Like search on the database table where the keyword phrases are stored the search "Fish worm" will find "Fish red worm" and you will have a hit.

Step 9. Word Frequency Analysis on Stemmed text.

   This is step 7, only using the stemmed text instead of the raw text.

Step 10. Store readability scores.

   A readability score like the Flesch Kincaid readability score give you the grade level of a body of text. Again, this is not an effort to explain the readability score, just a highlight that this particular score can be helpful in looking at a web site for analysis purposes.

Image via Wikipedia

Now that all of this was done, I produced some simple reports on the individual pages of the web site to see things like readability score, and the word frequency analysis of the pages. Funny enough, the high frequency words matched some of the keywords that google webmaster tools showed for the web-site I was working with.

One thing that google analytics and webmaster tools can do for an organization is to show which pages were clicked at what time. If you were using something like this in your data warehouse, it would be interesting to see the keywords used on a particular page that sold a product, or drove a customer to make contact with your organization.

After all, if the goal of the data warehouse is to provide a 360 degree view of the customer, then wouldn't incorporating key word analytics into the data warehouse provide better insight into how the consumer interacts with the web-site, and how that interaction directly is related to sales?

2011-01-23

Backup Strategies? How about a restore strategy?

LAS VEGAS - JANUARY 07: An ioSafe, Inc. Solo ...

Image by Getty Images via @daylife
Recently we had a catastrophic failure of the LUNS that were connected to our development database cluster. Upon starting a restore process, we discovered that there was an issue with our backup process.

Backups are just the first step in a backup strategy. How often are the backups tested? Some would say this is part of a "Disaster Recovery" process or program. I think there a degrees of disasters.

Most DR programs that I am familiar with are designed for a catastrophic failure of a site, or a data center. There are other failures that should be addressed in a restore strategy. Losing a development environment can still cost the company many man-hours of work. If you calculate the dollar value of each hour worked, or not worked as the case may be, you will see the financial impact of not periodically testing your restores.

There are different types of backups, full, incremental, snapshot, and full operating system backups. When you are restoring to a point in time, and you are doing incremental backups, you have to have all of the incremental backups available. You may need additional storage available to test out the restore of a database.

Restores generally take a bit longer to do than an actual backup. Especially if there are multiple steps to the backup. If you need to get the backup media from off-site storage, this can take even longer. How long does it take to get a specific tape or CD from your off-site vendor? How often is the time in this SLA tested?

A Dataflow Diagram of backup and recovery proc...

Image via Wikipedia

What is the business impact to losing a test or development environment? One may say, not much, but have you factored in the manpower cost to recovering that environment? How many people are involved in rebuilding the test or development environment that could be working on solving business problems with IT?

Does your backup or disaster recovery strategy include time and resources for periodically testing out restores of individual systems? If it does, what frequency is this done? If not, why not?

Related articles

2011-01-18

Data Management is a dirty job

Image via WikipediaOne of my family's favorite shows is Dirty Jobs with Mike Rowe.

The premise of the show is they profile jobs that are less than glamorous, and highlight the people that are not afraid to get dirty in their daily jobs.

As the show begins, Mike introduces the people he will be working with, then they describe the task at hand. Once Mike gets an overview of the job he dives in and tries his hand at the job. The purpose of this, of course, is entertainment. Mike is not a specialist in any of the jobs, yet he goes through a bit of training and then tries his hand. Mike has a great sense of humor and it shines through as he attempts to do the jobs in the show. Sometimes he is successful, sometimes his attempts are less than successful.

Either at the end of the segment, or throughout the segment, Mike attempts to understand and explain to the audience how doing such a job impacts the bottom-line of the business. For example, why is it important to this company to scrape out a refuse container? Well, if it gets too clogged it can cause damage to the machinery which will prevent the machines making the products the company sells.

As I watch these shows, I am reminded both of my time as a consultant and my roles as an architect in the various positions I have held. As a consultant one of the things that you have to do very quickly is to understand the business processes of what you are working on. Why is it important that the performance of this particular query be improved? If we don't get an answer to this question quickly it will delay our orders which ultimately impacts delivery of products to our customers.

The day to day work done as a data warehouse architect or business intelligence architect is quite a bit different from the activities that Mike Rowe does on his show, but in essence aren't they similar?

We look at the details of business processes, delve into the dirty details of how transactions are processed and data is accumulated, then present that data to as many people as we can through the front-end tools that we deploy. There are times when we spend many hours working on a particular problem of how to process and present a concept easily and in a repeatable manner.

Diving into the details of how a business runs can be a "dirty" job in that there are many details that must not be overlooked. In that sense, data management itself is a dirty job because so many people take for granted that when they look at a report, or a dashboard, that the data is clearly presented. We make it easy for data to be presented in a clean manner, but the details of how it got there is anything but clear to the majority of people within an organization.

Pages