Pages

2018-06-12

Practical Text for the Data Professional.


Recently, I have been having conversations about text analysis. 



Before we get into the details, why would you want to do Text analysis?  Do you
  • Collect survey data?  
  • Customer feedback? 
  • Complaint forms? 
  • Market Content?
  • Solicit feedback through Social platforms?
  • Perform SEO?
These are just the tip of the iceberg when it comes to analyzing the text you deal with every day.

Text analysis, by itself, can be a little intimidating. So, I put together a small R notebook using some off the shelf CRAN packages to parse PDF files, and create some metrics that can be analyzed by Tableau and Gephi. The PDF files are a collection of books that I have downloaded from various sources over the years. Many of these are the PDF companions of hardback books I have purchased for my own learning of a given topic. Some are PDF conversions of Power Points from presentations I have attended.


The R notebook can be found on  RPubs, and the Tableau workbook can be found on Tableau Public.

Each cell of the R notebook can be a topic in and of itself. The process I followed for this outline is to
·         Simply (emphasis on simply) parse the document
·        Break the document into sections (not chapters)
·         Calculate the lexical score for each section
·         Calculate the Sentiment for each section
·         Annotate the text
·         Pull out the most frequently used Nouns, adjectives, Verbs, and Keyword phrases.

In the notebook I only show a single PDF that I parsed, I also create a “batch” process to create CSV’s for each of these. In addition to the csv files I also prepped the data into files that could be loaded into Gephi for Graph analysis. 

The individual CSV files, I loaded into Tableau for some different visualizations. 

This is an example graph of the smaller Automatic Keyword Extraction Graph created.

This shows the relationship between Documents and Sections that have the same keywords.

 If two documents use the same keyword that has been extracted from the raw text, there is a line or edge between the nodes which are the documents and sections.



The code I wrote is stored on my Github

Any of these features that are generated from the text could also be considered a feature to be used in a Machine Learning application as well depending on your use case for the text analysis.

I will be writing and speaking in much more detail about this process in the coming months, I will update this page when I have a link to where you can get more information. 

In the meantime, if you have questions, please comment below, and I will both answer and incorporate your questions into future work.

Enjoy!