Pages

2016-01-06

Data Management of Organic SEO

How do you manage data outside your control?


screenshot from a opensource search engine (op...
screenshot from a opensource search engine (openwebspider) (Photo credit: Wikipedia)
Search Engines are great catalogers of meta data about your web site. Recently I was asked to do a bit of SEO for a friend. I worked on putting the HTML together, and researched the topic extensively. There is a lot of almost conflicting guidance for how to do SEO for a web site. So in order to understand how SEO works, I did what I am sure many other software developers have done. I wrote my own search engine.

My search engine is anything but production grade, but the research project gave me some insight into how the big guys work. It has limitations, but rather than completely reinvent the wheel, I decided to leverage the capabilities of the tools I had.

I am sure that there are better ways to build a search engine, and this is not meant to be an authoritative methodology, but it is the way that I build my little utility to give me some insight on the process.

I used Fedora to build this, with a MySQL back end. The page crawler is all PHP code. Some of the PHP code I leveraged from the open source community. After all, the goal of this project is to understand the concepts of a search engine, not commercialize it.

The steps my crawler performed were:
1. Identify the site and the individual pages.
2. Get Links
3. Get Meta tags for document.
4. Get the raw-text
5. Get any emails from the page.
6. Clean stopwords from the raw text.
7. Do word frequency analysis on the clean raw text
8. Stem the text
9 Do word frequency analysis on the stemmed text.
10. Store readability scores.

For each of these steps, I stored the results into a database. The database would ultimately turn into a star schema for reporting purposes.

Just as an overview I will touch on each of these steps, and why I did them. Each step has plenty of documentation about it across the internet, so this is not meant to be an exhaustive explanation.

Step 1. Identify the Site and the individual pages.

    Not only would you want the individual page that a search pattern shows up on, but you also want the site associated with that page. Simple Reg-ex work here.

Step 2. Get Links.

   Parse through the HTML in order to get all of the <a> tags and record the links this page makes reference to.

Step 3. Process Meta Tags

   Record the Meta tags that the page author stored in the document. These are things like keywords, description, etc...


Step 4. Get the Raw Text.

    This step is perhaps one of the most important. Stripping out the raw text is the human readable text that people will actually read. Humans do not process HTML tags, at least normal humans anyway. HTML tags are not cataloged in my process.

Step 5. Get any emails

     Another simple reg-ex that is run against the raw text to store the emails that were embedded in the text.

Step 6. Clean stopwords.

    Stopwords are defined here: Stopwords. Basically these words are common  short function words such as the, is, and, or. These words are not very important for text scanning.

Step 7. Word frequency analysis.

    Word frequency analysis is the process of counting the word frequency of a body of text. For example in the text "The quick brown fox jumped over the slow red fox." The word "fox" would have a WFA count of two. Every other word would have a WFA "the", which would be considered a stopword in this example)

Step 8. Stem the raw text.

  Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form, from Stemming. Why is this important? Knowing the root word of a word gives you some flexibility when it comes to processing searches. If the text from the web page is: "Fishing for red worms", and the search is "Fished for worms", if you remove the stopwords and stem both phrases you will have "Fish red worm" and "Fish worm". By doing a Like search on the database table where the keyword phrases are stored the search "Fish worm" will find "Fish red worm" and you will have a hit.

Step 9. Word Frequency Analysis on Stemmed text.

   This is step 7, only using the stemmed text instead of the raw text. 

Step 10. Store readability scores.

   A readability score like the Flesch Kincaid readability score give you the grade level of a body of text. Again, this is not an effort to explain the readability score, just a highlight that this particular score can be helpful in looking at a web site for analysis purposes.


Now that all of this was done, I produced some simple reports on the individual pages of the web site to see things like readability score, and the word frequency analysis of the pages. Funny enough, the high frequency words matched some of the keywords that google webmaster tools showed for the web-site I was working with.




No comments:

Post a Comment