Image via WikipediaHow do you manage data outside your control?
Search Engines are great catalogers of meta data about your web site. Recently I was asked to do a bit of SEO for a friend. I worked on putting the HTML together, and researched the topic of SEO extensively. There is a lot of almost conflicting guidance for how to do SEO for a web site. So in order to understand how SEO works, I did what I am sure many other software developers have done.
I wrote my own search engine.
My search engine is anything but production grade, but the research project gave me some insight into how the big guys work. It has limitations, but rather than completely reinvent the wheel, I decided to leverage the capabilities of the tools I had.
I am sure that there are better ways to build a search engine, and this is not meant to be an authoritative methodology, but it is the way that I build my little utility to give me some insight on the process.
I used Fedora to build this, with a MySQL back end. The page crawler is all PHP code. Some of the PHP code I leveraged from the open source community. After all, the goal of this project is to understand the concepts of a search engine, not commercialize it. The web pages were broken down to the raw text, that text was then analyzed and restructured into a small data mart.
The steps my crawler performed were:
1. Identify the site and the individual pages.
2. Get Links
3. Get Meta tags for document.
4. Get the raw-text
5. Get any emails from the page.
6. Clean stopwords from the raw text.
7. Do word frequency analysis on the clean raw text
8. Stem the text
9 Do word frequency analysis on the stemmed text.
10. Store readability scores.
For each of these steps, I stored the results into a database. The database would ultimately turn into a star schema data mart for reporting and analysis.
Just as an overview I will touch on each of these steps, and why I did them. Each step has plenty of documentation about it across the internet, so this is not meant to be an exhaustive explanation.
Step 1. Identify the Site and the individual pages.
Not only would you want the individual page that a search pattern shows up on, but you also want the site associated with that page. Simple Reg-ex work here.
Step 2. Get Links.
Parse through the HTML in order to get all of the <a> tags and record the links this page makes reference to. The goal of this step is to record the number of inbound links to a particular page. By capturing the pages this particular page links to, when we scan those pages we can do a count to see the inbound links. The biggest problem with this method is I would either have to know which sites and pages link to this page, or scan the whole internet. I chose to simply record it for internal site page link counts.
Step 3. Process Meta Tags
Record the Meta tags that the page author stored in the document. These are things like keywords, description, etc... These keywords are the keywords that I want google to use as the keywords for this site. What I found through this research project is that these keywords do NOT necessarily match what google determines is a keyword for your site.
Step 4. Get the Raw Text.
This step is perhaps one of the most important. Stripping out the raw text is the human readable text that people will actually read. Humans do not process HTML tags, at least normal humans anyway. HTML tags are not cataloged in my process.
Step 5. Get any emails
Another simple reg-ex that is run against the raw text to store the emails that were embedded in the text.
Step 6. Clean stopwords.
Stopwords are defined here: Stopwords. Basically these words are common short function words such as the, is, and, or. These words are not very important for text scanning.
Step 7. Word frequency analysis.
Word frequency analysis is the process of counting the word frequency of a body of text. For example in the text "The quick brown fox jumped over the slow red fox." The word "fox" would have a WFA count of two. Every other word would have a WFA "the", which would be considered a stopword in this example)
Step 8. Stem the raw text.
Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form, from Stemming. Why is this important? Knowing the root word of a word gives you some flexibility when it comes to processing searches. If the text from the web page is: "Fishing for red worms", and the search is "Fished for worms", if you remove the stopwords and stem both phrases you will have "Fish red worm" and "Fish worm". By doing a Like search on the database table where the keyword phrases are stored the search "Fish worm" will find "Fish red worm" and you will have a hit.
Step 9. Word Frequency Analysis on Stemmed text.
This is step 7, only using the stemmed text instead of the raw text.
Step 10. Store readability scores.
A readability score like the Flesch Kincaid readability score give you the grade level of a body of text. Again, this is not an effort to explain the readability score, just a highlight that this particular score can be helpful in looking at a web site for analysis purposes.
Image via Wikipedia
Now that all of this was done, I produced some simple reports on the individual pages of the web site to see things like readability score, and the word frequency analysis of the pages. Funny enough, the high frequency words matched some of the keywords that google webmaster tools showed for the web-site I was working with.
One thing that google analytics and webmaster tools can do for an organization is to show which pages were clicked at what time. If you were using something like this in your data warehouse, it would be interesting to see the keywords used on a particular page that sold a product, or drove a customer to make contact with your organization.
After all, if the goal of the data warehouse is to provide a 360 degree view of the customer, then wouldn't incorporating key word analytics into the data warehouse provide better insight into how the consumer interacts with the web-site, and how that interaction directly is related to sales?
Search Engines are great catalogers of meta data about your web site. Recently I was asked to do a bit of SEO for a friend. I worked on putting the HTML together, and researched the topic of SEO extensively. There is a lot of almost conflicting guidance for how to do SEO for a web site. So in order to understand how SEO works, I did what I am sure many other software developers have done.
I wrote my own search engine.
My search engine is anything but production grade, but the research project gave me some insight into how the big guys work. It has limitations, but rather than completely reinvent the wheel, I decided to leverage the capabilities of the tools I had.
I am sure that there are better ways to build a search engine, and this is not meant to be an authoritative methodology, but it is the way that I build my little utility to give me some insight on the process.
I used Fedora to build this, with a MySQL back end. The page crawler is all PHP code. Some of the PHP code I leveraged from the open source community. After all, the goal of this project is to understand the concepts of a search engine, not commercialize it. The web pages were broken down to the raw text, that text was then analyzed and restructured into a small data mart.
The steps my crawler performed were:
1. Identify the site and the individual pages.
2. Get Links
3. Get Meta tags for document.
4. Get the raw-text
5. Get any emails from the page.
6. Clean stopwords from the raw text.
7. Do word frequency analysis on the clean raw text
8. Stem the text
9 Do word frequency analysis on the stemmed text.
10. Store readability scores.
For each of these steps, I stored the results into a database. The database would ultimately turn into a star schema data mart for reporting and analysis.
Just as an overview I will touch on each of these steps, and why I did them. Each step has plenty of documentation about it across the internet, so this is not meant to be an exhaustive explanation.
Step 1. Identify the Site and the individual pages.
Not only would you want the individual page that a search pattern shows up on, but you also want the site associated with that page. Simple Reg-ex work here.
Step 2. Get Links.
Parse through the HTML in order to get all of the <a> tags and record the links this page makes reference to. The goal of this step is to record the number of inbound links to a particular page. By capturing the pages this particular page links to, when we scan those pages we can do a count to see the inbound links. The biggest problem with this method is I would either have to know which sites and pages link to this page, or scan the whole internet. I chose to simply record it for internal site page link counts.
Step 3. Process Meta Tags
Record the Meta tags that the page author stored in the document. These are things like keywords, description, etc... These keywords are the keywords that I want google to use as the keywords for this site. What I found through this research project is that these keywords do NOT necessarily match what google determines is a keyword for your site.
Step 4. Get the Raw Text.
This step is perhaps one of the most important. Stripping out the raw text is the human readable text that people will actually read. Humans do not process HTML tags, at least normal humans anyway. HTML tags are not cataloged in my process.
Step 5. Get any emails
Another simple reg-ex that is run against the raw text to store the emails that were embedded in the text.
Step 6. Clean stopwords.
Stopwords are defined here: Stopwords. Basically these words are common short function words such as the, is, and, or. These words are not very important for text scanning.
Step 7. Word frequency analysis.
Word frequency analysis is the process of counting the word frequency of a body of text. For example in the text "The quick brown fox jumped over the slow red fox." The word "fox" would have a WFA count of two. Every other word would have a WFA "the", which would be considered a stopword in this example)
Step 8. Stem the raw text.
Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form, from Stemming. Why is this important? Knowing the root word of a word gives you some flexibility when it comes to processing searches. If the text from the web page is: "Fishing for red worms", and the search is "Fished for worms", if you remove the stopwords and stem both phrases you will have "Fish red worm" and "Fish worm". By doing a Like search on the database table where the keyword phrases are stored the search "Fish worm" will find "Fish red worm" and you will have a hit.
Step 9. Word Frequency Analysis on Stemmed text.
This is step 7, only using the stemmed text instead of the raw text.
Step 10. Store readability scores.
A readability score like the Flesch Kincaid readability score give you the grade level of a body of text. Again, this is not an effort to explain the readability score, just a highlight that this particular score can be helpful in looking at a web site for analysis purposes.
Image via Wikipedia
Now that all of this was done, I produced some simple reports on the individual pages of the web site to see things like readability score, and the word frequency analysis of the pages. Funny enough, the high frequency words matched some of the keywords that google webmaster tools showed for the web-site I was working with.
One thing that google analytics and webmaster tools can do for an organization is to show which pages were clicked at what time. If you were using something like this in your data warehouse, it would be interesting to see the keywords used on a particular page that sold a product, or drove a customer to make contact with your organization.
After all, if the goal of the data warehouse is to provide a 360 degree view of the customer, then wouldn't incorporating key word analytics into the data warehouse provide better insight into how the consumer interacts with the web-site, and how that interaction directly is related to sales?
No comments:
Post a Comment