Job Descriptions Suck!
I saw a commercial about They claim to have over 60,000 new jobs added every day.

This sounds great.

Except for one thing.

Job descriptions suck!

Of these 60,000 new jobs, how many of them are actually unique? If you can not differentiate one job description from another, how can an individual differentiate themselves from all of the other candidates that may be applying for a job?

Most job portals like Indeed do quite well at allowing you to search for a job title. If you are adept at doing keyword searches you can usually refine what you are looking for to something close to something for which you are qualified.

But what about the other side of the aisle?

Recruiters are searching your profiles. They do similar keyword searches against your indexed resume.

Depending on how they do a search, and the search technology they may be using actually determines whether you are found or not.

It has been said if you have more than a one page resume you are overlooked by most recruiters.

How do you stand out when there are so many technical reasons why you will not get noticed?

I think there is a different way to look at things.

Very seldom in my career have I done one job during my tenure at an employer. I think others have seen similar things.

I broke out some R code, started capturing some job descriptions sent to me, and did some text similarity mining.

What I came up with, I published on It takes a collection of what I call "project write-ups" compares the project writeup with various job descriptions in my little database, and shows the top 10 job descriptions that match that writeup. If you do enough write-ups, you can see a graph created that shows your career path. For good measure, I also provide a word-cloud of you a persons write-ups, and do a centrality representation from the graph to show the job descriptions an individual has matched most frequently.

My profile is : Doug Needham

Would having this make you stand out from the crowd?

I don't know yet.
But one thing I have found, when I show people my profile page it makes a very compelling story to have a conversation about my career path, and the job descriptions that match what I can do.

Wasn't there something in the news recently about "joining a conversation"?

Create a profile and get a recruiter to join your conversation.

 There are instructions on the site that show you how to send a profile write-up.

Send me a write-up and join the site.

What will your story be?


The data guy learns German

Many years ago I worked for a company headquartered in Germany.

I was able to go to Darmstadt for a short visit. I did learn a few German phrases, but I did not get around to learning much of the language.

I have always wanted to get around to learning how to read and write in  German, but I kept putting it off.

This year I have decided I am going to focus my personal journey on linguistics, and language processing. Learning another language may or may not help me as I get back into the text processing space, but from a meta perspective of understanding more about the process, we as humans have to go through in order to understand other languages will hopefully give me more insight into providing value to text analysis. Not to mention, most of the linguists I have known speak more than just one language.

I think focusing on learning another language for human conversation may help me as I focus more on language processing.

So I began my journey earlier this year.

The following are the resources that I have been using.
And a few other youtube folks.
The concepts taught in the book Fluent Forever 
       I encourage anyone interested in language learning to read this book. The main concepts in his technique are the following:
A transcription of "IPA" in the IPA
  • Learn pronunciation using the International Phonetic Alphabet
  •  Use flashcards that you make yourself. 
  • Focus on word frequency lists for learning vocabulary. The word Ich is more common than the word Pilz. You may be able to work out more if you know and understand a word that is used more frequently. 
  • Do not translate words from German to English. Draw pictures of the artifacts or concepts. For example, if you do a flashcard for the Sonne draw a picture like the one on the right

I do want to focus on this last point for a moment. I have not so far spent as much time as I would like putting together flashcards the way Gabriel Wyner suggests in the book above. However, what I do is when I take note of a new word I am learning, I only write the word in German, followed by a picture that means something to me that represents the item, artifact or concept.  I do not know if this method has the same effect as what he suggests, but it is what I am attempting. Ultimately I will probably create the full flashcards as he suggests with the word in Deutsch, the IPA pronunciation, and an image that represents the word.

I added one other thing to the list. "read german books". For any L1 language, we naturally increase our vocabulary as we read. I have started with some young adult readers and will work my way through more of them. I am already at the point where I can mostly understand the German text while looking up a few words. This expertise will continue to grow this year.

 I also intend to watch some television shows dubbed into German. Wikipedia also has some really good technical material that can be reviewed in both English and German. I plan to use that resource as much as possible to clarify the technical terms I work with daily.

Google translate is also very helpful because I may think I have a translation worked out but Google confirms the parts I had correct and shows me where I went wrong. 

So here goes my first bilingual post.

Mein Verständnis von Deutsch ist erst am Anfang. Ich hoffe, bald mehr Menschen auf Deutsch zu sprechen. Wenn Sie Deutsch sprechen, schreiben Sie mir bitte eine Nachricht.




What is the performance relationship between a Database and a Business Intelligence Server

This article will be a bit long; it covers a complicated topic that I have been studying for quite some time. 

I have run across the need to explain this topic on a number of occasions, and over time my explanations have hopefully become clearer and more succinct.

The concept that will be discussed here is the performance relationship between a database server and a business intelligence server in a simple data mart deployment. 

Rarely are data mart deployments simple, but the intention for this is to be a reference article to understand the relationships between the server needs and the performance footprint under some of the various scenarios to be experienced during the lifecycle of a production deployment. 

Here is a simple layout of an architecture for a DataMart

Very basic image, the D is the database server, the F is the front-end, the U are the users, and they are all connected via the network.

To be precise this architecture represents a ROLAP (Relational On Line Analytical processing) built on top of a dimensional model (star schema) implementation. The dimensional model is assumed to be populated and current with totally separate ETL processes that are not represented in this diagram.

The “D” represents any database server: Oracle, MySQL, SQL Server, DB2, whichever infrastructure you or your enterprise has chosen.

The “F” represents any front-end business intelligence server that is optimized for dimensional model querying, and is supported by a server instance: Business Objects, Cognos, Pentaho Business Analytics,
Tableau Server. The desktop specific BI solutions do not fit in this reference model for reasons we shall see shortly. 

In my early thoughts on the subject, I envisioned that the performance relationship in a properly done DataMart would be something like this: 

This is a good representation of what happens. 

On the left side of the chart we have the following scenario.

When the front-end server responding to a user interaction sends a request back to the database for aggregated data like: “show me the number of units sold over the last few years”

One could imagine the query being something like: Select Year, Sum(total_sold) from Fct_orders fo inner join Dim_Date dd on fo.date_key = dd.date_key.
The dutiful database does an aggregation, provided all of the statistics are current on the data, a short read takes place and more CPU and Memory but less Disk I/O is used to do the calculation.

In the graph this is represented by the high red-line on the upper left.

The results returned to the front end are small. A single record per year of collected data.
The CPU and Memory load on the front end server is tiny shown in green on the lower left.

On the right side of the chart we have the following scenario.

When the front-end server responding to a user interaction sends a request back to the database for non-aggregated data like: “show me all of the individual transactions taking place over the last few years”

One could imagine the query in this case to be something like: Select fo.*,dimensional data from Fct_orders fo inner join (all connected dimension tables).

In this case the database server has little option but to do a full table scan of the raw data and returning it.

In the graph this is represented by the lower red-line on the right (more disk I/O, less CPU and Memory), then the data is returned to the business intelligence server.

Our Front-End server will have to do some disk caching, as well as lots of processing (CPU and Memory) to handle the load just given to it, not to mention things like pagination, and possibly holding record counters to keep track of which rows the user has seen or not, among other things.

This graph seems to summarize the relationship between the two servers rather nicely. However, something is missing.

I had to dwell on this image for some time before I was able to think of a way to visualize the thing that is missing.

The network.

And even then there are at least two parts to the network.

The connection between the front-end server and the database, followed by the connection between the front-end server and all of the various users.

Each of these have a different performance footprint.

Representing the database performance, Front-End performance, and network performance for both the users and the system connections is something with which I continue to struggle.

Here is the image I have recently arrived at:

This chart needs a little context to understand the relationships between the 4 quadrants. 

Quadrant I is the server network bandwidth. In a typical linear relationship as the data size increases from the database to the front end the server network bandwidth increases.

Quadrant II is the database performance relationship between CPU/Memory and Disk I/O for a varying query workload. For highly aggregated queries the CPU and Memory usage increases, and the Server Network bandwidth is smaller because less data is being put on the wire.  For less aggregated data, and more full data transfers the Disk I/O is higher, Memory is lower, and back in Quadrant I the Server Network Bandwidth is higher.

Quadrant III is the Front-End server performance comparing CPU/Memory and Disk I/O when dealing with a varying volume of data. As the data increases from the database more resources and caching is needed on this server.

Quadrant IV is the User Network Bandwidth this is the result of the front end server responding to the requests from the user. As the number of users increase the volume of data increases and more of a load is put on the front end server. Likewise, the bandwidth increases as more data is being provided to the various users.

This image is an attempt to show the interactions between these 4 components.

The things that make this image possible is a well-designed dimensional model, a rich semantic layer with appropriate business definitions, and common queries that tend to be repeated.

This architecture can support exploratory analysis, however, the data to be explored must be defined and loaded up front. For exploratory analysis to determine which data points need to be included in the data mart, that should be done in a separate environment.

I created all three of these images with R using iGraph and ggplot2 with anecdotal data. The data shown in this chart is not sampled, but is meant as a representation of how these four systems interact.  Having experience monitoring many platforms supporting this architecture, I know for a fact that no production systems will actually show these rises and falls the way this representative chart is doing.

However, understanding that at their core they should interact this way should give a pointer to where a performance issue may be hiding in your architecture. If you are experiencing problems. The other use-case of this image is an estimation tool for designing new solutions.

All that being said, much of this architecture may be called in to question by new tools.

Some newer systems, Hadoop, Snowflake, RedShift actually change the performance dynamics of the database component.

The Cloud concept has an impact on the System Bandwidth component. If you have everything in the cloud, then in theory the bandwidth between the database server and the front-end server should be managed by your cloud provider. There may need to be VPC pairings if you set them up in separate regions.

If these are being run within a self-managed data center should the connection between the database server and the front-end server be on a separate VLAN, or switch? Perhaps.
Does the front-end server use separate connections for the database querying interface and the user facing interface? Should it?

Do you need more than one front-end server sitting behind a load-balancer? How many users can one of your front-end servers’ support? What are the recommended limits from the vendor? Should data partitioning and dedicated servers per business unit be done to optimize performance for smaller data? 

These are all types of questions that arise when looking at the bigger picture. Specifically when you are doing data systems design and architecture. This requires a slightly different touch than application systems design and architecture.

Thinking about applying this diagram in your own enterprise will hopefully give insight into your own environment.

Can you think of a better way to diagram this relationship? Let me know.
The code and text are posted here. 


How many times do you have to stand in the rain before lightning strikes?

Recently, I have seen this picture a number of times:

This is a cute little anecdote about taking risks and being open to opportunities.

Here is a counter thought:

How many people were invited to other rooms, and those people are not billionaires now?

How many people have spent nights and weekends coding or building someone else's idea only to never see a dime?

Now don't get me wrong. I absolutely love working with entrepreneurs!

The excitement of a new idea, the thrill of building things from scratch, the camaraderie of working on something that is new and rushing to get something to market before someone else builds something similar.

These are fun things to work on.

However, everyone should be committed to the goal with the same amount of buy-in.

As I wrote about previously Beware The Partnership where the technology person or team is the only one working on the project. This is called contracting.

If you and a friend have an idea, and you are both working various angles on the idea, go for it.

If you are not a technology person, and you need a "partner" to do the actual building part, you have just hired a consultant. They may work with you for some sort of percentage of future ownership (sweat equity), but at some point sweat, motivational speeches, possible future options do not put food on the table.

Never be afraid to take risks.

The risk for the technologist is spending time, effort and expertise on a project that may never pay off.  My caution for you if you are a technologis is this: Don't expect to make your expected hourly rate. Be flexible, negotiate maybe even suggest that the idea person pay for equipment of some other tangible if they are unwilling or unable ot pay you directly.

The risk for the idea person is that you may be paying for something that does not quite fit in with your vision. My caution for you if you are an idea person is this: Either be willing to pay for expertise that you do not have, or simply do not talk about your idea with anyone. If you do not have the ability to pay for expertise, use Lean techniques to figure out the quickest path to make money with your idea. If you are currently working, leverage your savings or take a portion of your current income and save it till you can afford to pay for expertise, experience, equipment or some other tangible item to help you build your idea.

If you can't risk losing a bit of money(for the idea person), or time (for the technologist) then don't get involved in building out something.

If you are currently in either of these situations, and are uncomfortable talking about these things, share this link with your business partner. Have a conversation about the uncomfortable topic of money early on. If you are willing to commit your future to an idea with your partner, you should be willing to discuss money, and you should do it sooner rather than later.

You do have to take risks to be successful. Sometimes it simply rains.

On rare occasion lightning strikes and you are able to convert from working on a side project to doing something you love full time.

Either way, you will get wet.

The question is, how wet are you willing to get?

Will you take a bath, or be singing in the rain? 


Predictive Analytics World New York 2016 Supercharging with Ensemble Models.

The Wisdom of Crowds
The Wisdom of Crowds (Photo credit: Wikipedia)
Dean Abbott taught this class on Ensemble models.

One of the in class demonstrations was an example from The Wisdom of Crowds. He passed around a bottle with some cereal in it. Everyone guessed, and then he averaged the guesses.

Two people were closer than the true answer, but an Ensemble model (An average of all of our individual guesses based on our internal model of the bottle and the size of the cereal.)

There were also hands-on demonstrations with Salford Systems predictive modeler. You can find out more about the tool at this link

Dean is a thorough instructor, and clearly could educate all of us on the various ways of doing predictive modeling.

English: A manually drawn decision tree diagra...
He talked about Logistic and Linear regression, decision trees, random forests, and how to combine these specific models with various options as an Ensemble model.

He touched just briefly on deep-learning.

I look forward to hearing from him again, I think every time I would be able to hear from him I would learn something new.

Dean recommended to read his book: Applied Predictive Analytics Principles I look forward to starting to read this on the flight back.

It has been an exciting time in New York. Whenever I attend these conferences, and workshops I always feel like the more I learn, the less I know.

I did take a few pictures, and made a few tweets about one day speaking at a future event. I think I have a lot to learn to be on par with these speakers.

Continuous learning is the key to expertise.