Pages

Showing posts with label Data analysis. Show all posts
Showing posts with label Data analysis. Show all posts

2018-06-12

Practical Text for the Data Professional.


Recently, I have been having conversations about text analysis. 



Before we get into the details, why would you want to do Text analysis?  Do you
  • Collect survey data?  
  • Customer feedback? 
  • Complaint forms? 
  • Market Content?
  • Solicit feedback through Social platforms?
  • Perform SEO?
These are just the tip of the iceberg when it comes to analyzing the text you deal with every day.

Text analysis, by itself, can be a little intimidating. So, I put together a small R notebook using some off the shelf CRAN packages to parse PDF files, and create some metrics that can be analyzed by Tableau and Gephi. The PDF files are a collection of books that I have downloaded from various sources over the years. Many of these are the PDF companions of hardback books I have purchased for my own learning of a given topic. Some are PDF conversions of Power Points from presentations I have attended.


The R notebook can be found on  RPubs, and the Tableau workbook can be found on Tableau Public.

Each cell of the R notebook can be a topic in and of itself. The process I followed for this outline is to
·         Simply (emphasis on simply) parse the document
·        Break the document into sections (not chapters)
·         Calculate the lexical score for each section
·         Calculate the Sentiment for each section
·         Annotate the text
·         Pull out the most frequently used Nouns, adjectives, Verbs, and Keyword phrases.

In the notebook I only show a single PDF that I parsed, I also create a “batch” process to create CSV’s for each of these. In addition to the csv files I also prepped the data into files that could be loaded into Gephi for Graph analysis. 

The individual CSV files, I loaded into Tableau for some different visualizations. 

This is an example graph of the smaller Automatic Keyword Extraction Graph created.

This shows the relationship between Documents and Sections that have the same keywords.

 If two documents use the same keyword that has been extracted from the raw text, there is a line or edge between the nodes which are the documents and sections.



The code I wrote is stored on my Github

Any of these features that are generated from the text could also be considered a feature to be used in a Machine Learning application as well depending on your use case for the text analysis.

I will be writing and speaking in much more detail about this process in the coming months, I will update this page when I have a link to where you can get more information. 

In the meantime, if you have questions, please comment below, and I will both answer and incorporate your questions into future work.

Enjoy!

2016-10-04

Is Analytics a Noun or a Verb?

English: The syntax tree of noun phrase "...
English: The syntax tree of noun phrase "my neighbour's daughter-in-law" with layered determiner analysis. (Photo credit: Wikipedia)
Is Analytics the name of your department, or do you actually "do" Analytics?

Doing analytics requires you to look at your data, apply some logic, and make or support making a decision with the data.


For many years I have built and maintained analytical platforms. These platforms had the core of a Business Intelligence architecture with some one-offs for the occasional "sophisticated" analysis as needed. I was not specifically doing analytics during this time. I knew many of the tools and techniques that were being applied. At times, I was even the one writing the SQL queries to pull the data together to load into SAS for statistical modeling. However, I rarely took it so far as to actually do the Analytics myself. That was not my role.

Now I am in a position where I am the one doing the Analytics, and I see and recognize the impedance mismatch that occurs when I use the term analytics, versus when some people use the same term.


Data Analytics is a very overloaded term in today's environment.  Yet as sophisticated as we may be in evolving from our ancestors simple things still make a big difference.


Using incredibly simple definitions:
A Noun is a person, place or thing.

A Verb is an action, or state of being.


Analytics can be a noun. "I am in charge of the Analytics department!"

Analytics can also be a verb. "I applied Analytics to the data until it gave me the answer!"

Analysis, or analytical thinking is a way of learning from and understanding the data that we have available to us in order to solve a specific problem or answer a specific question.

I think how this word evolved to be a noun is that there have been times where people with analytical skills(verb) were gathered together in one place. In order to have a question answered you had to go to the Analytics department (now it is a noun - place.)

As this place evolved, the people doing the analysis needed support, programmers, managers, project managers, special coders,etc.

Now you can say you work in Analytics and mean the department. This carries some clout with it, because it sounds as if you have the skills and capabilities of those doing the analysis.

Not necessarily. You may learn some valuable things, and through the natural sequence of apprenticeship you may be able to be the one "doing analytics" at some point.

To me, Analytics is a Verb, and it should only be a verb. Using it in any other context is a disservice to the word.






2015-05-27

Data-Science-or-Business-Intelligence?

Comparing Data Science and Business Intelligence


Over the past few years as I have been supporting more and more "non-traditional" (i.e. not a Data Warehouse or Data Mart) analytical platforms, I have noticed a number of differences between Data Science approaches and Business Intelligence approaches.

This image sums up many of my observations and gives a touch point for comparing the differences as well as similarities between the two approaches.



Reproducible versus Repeatable


One of the goals of #DataOps is to keep data moving to the right location in a repeatable, automatized manner. Most of the data warehouse environments I have worked on, the person doing the analysis does not run the ETL jobs. Todays data flows into existing data marts, dashboards, dimensional models, and queries that drive it all. These are repeatable processes.

Performing a Reproducible process on the other hand shows the entire process soup to nuts. The analyst pulled this data from that system, used this transformation on these data elements, combined this data with that data, ran this regression, and produced this result. Therefore if we raise the price of this widget by $.05 we will have this lift in profit (Ceteris paribus).

Predictive versus Descriptive


As described above the Data Scientist attempts to make a prediction about something, whereas the Business Intelligence analyst is usually reporting on a condition that is considered a Key Performance Indicator of the company.

Explorative versus Comparative


In most Business Intelligence environments I have worked with, the questions are usually along these lines: "Is this product selling more than that product?"

The Data Scientist would want to look at what product has the highest margin, or the product that has the largest impact on the bottom line. If someone buys product X, do they also purchase product Y?
What else is impacting this particular store? Does weather have an impact on purchase patterns? What about twitter hashtags?  What in our product line is most similar to a product that has a high purchase volume in the overall consumer community. 

Attentive versus Advocating


Data Scientist: The data shows us that consumers that purchased X also purchased Y. I suggest we relocate Y for the stores in this geographic area by 1 meter away from X, and in this geographic area they should be 2 meters away. Then we will analyze the same visit purchases for those two items to determine if this should be done in all stores.

Business Intelligence: The latest data from our campaign is shown here. The response rate among 18-24 year old males is less than what we wanted but we expect to see more lift in the coming weeks. 

Accepting versus Prescriptive



Data Scientist: Give me all data, I will analyze it as is, and determine what needs to be cleaned and what represents further opportunities. If there is a quality issue I will document it as part of the assumptions in my analysis.









Business Intelligence: The data has to be cleaned and high quality before it can be analyzed. No one should see the data before all of the quality checks, verifications, and cleansing processes are done.







Both of these approaches have business value.

I think Data Science will continue to get some press for quite a while, there will always be some amazing break through that someone used the algorithm of the day to solve a business problem. Then the performance of that algorithm will become a metric on a dashboard that is put into a data mart.

The guys and gals in #DataOps will make sure the data is current.

The Data is protected.

The Data is available.

The Data is shown on the right report to the people that are authorized to see it.

Your Data is safe. 



2015-05-26

Spark-Strategy

Data Strategy, but no Spark Strategy, how cute.


I see blogs, and blogs, and articles, and presentations, and Slideshares on creating a Big Data Strategy.  How does "it" (usually Hadoop) fit into your organization, who should get access, who should "it", so many questions, comments and opinions.

Guess what?

Your data is growing.

And growing.

If you use a Graph to analyze the data flows in your organization, chances are you will see ways to cut cost, and consolidate architectural components.

Bring things together and store them in one location.

Done!

Now what?

What tools do you use to analyze it?

You want to do the same thing you have always done?

Really?

Send two people in your organization to the upcoming Spark Summit. Let them show you there is a better way.





Spark is not just a new tool.

It is a new Path.