Pages

2016-01-02

GraphStats Package

English: Logo for R
English: Logo for R (Photo credit: Wikipedia)

GraphStats

Today is a small entry. I recently created GraphStats- an R package for putting the basic statistics igraph generates into data frames.

GraphStats

Right now it is pretty primitive, but I have some ideas of how to grow it and combine it with some other new packages I have in mind.

Please take a look, install it, and give it a try. Feedback is welcome.

Insallation is easy:
install_github("dougneedham/GraphStats")
Then to see a sample do:
LesMis <- graph.data.frame(Gephi.LesMiserables)

lm <- analyze.graph(LesMis)

lm$Graph will show the summary I have today.

Give it a try, and let me know what else could be useful.







2016-01-01

Eigenvector Centrallity Oddity with iGraph, Gephi, and NetworkX

I found something odd. 


I like graphs, and network analysis. I have have taken a few classes, and even written about the application of social network analysis to Enterprise Architecture (Data Structure Graphs). 

It my toolbox I use iGraph in R, NetworkX from Python, and Gephi for general analysis, visualization and network study. 
A segment of a social network
A segment of a social network (Photo credit: Wikipedia)

I stumbled onto something odd today. 

In R I loaded a graph originally generated from Gephi. Did a few calculations on the graph and converted the output to a data frame.

One of the metrics did not match between the two. 

Eigenvector Centrality. 

Thinking this only a little odd, I wrote some simple Python NetworkX code to assist me with determining which of the two answers was correct. 

Guess what? 

I got a third answer. 

I did some research on the mechanism used by each for calculating the Eigenvector Centrality measure

All of them appear to be valid. 

So, I created a test case. Using our friendly example of the konisberg bridges. The first image link in google that shows the bridges on today' google search is here: 

I created a simple input file to use in all three tools: 
Source Target
A C
A B
B A
B C
B D
D B
D C

Here are the results for Eigenvector Centrality using the tools mentioned above: 
Node Gephi iGraph NetworkX
A 0.414195271 0.7379944 0.4351621
B 0.585804729 1 0.5573454
C 1 0.6814476 0.5573454
D 0.414195271 0.7379944 0.4351621


I realize that all of these tools use slightly different algorithms all written by different authors.

However, I would not expect the answers to quite so divergent for such a reference problem as the Konisberg bridges.


I wrote this to ask a few questions:

  • Has anyone else noticed this? 
  • Is there a set of options to pass to iGraph, Gephi, or NetworkX to make their calculations be similar? 
  • What should be considered the proper Eigenvector centralities for the seven bridges of konisberg? 

I put all of my test code, along with test data on my github here: Eigenvector Questions

If anyone could point out a good way to get the same answer using multiple tools like this, I would appreciate it.


Thank you.


2015-12-29

Learning new things for Dummies

A floating dummy used for man over board train...
A floating dummy used for man over board training, commonly referred to as Oscar. (Photo credit: Wikipedia)

Learning new things for Dummies


I read incessantly.

Seldom do I read anything that is fiction any longer.

I find I simply have too little time that I can dedicate to leisure reading.

However, the topics I do read about vary widely.

Algebra, Geometry, Calculus, Statistics, Probability, Genetics, Econometrics, Weather, Day-Trading.

All of these topics I have learned about through the series of books from Wiley labeled the "..for Dummies" series.

Each book, that I have read, is a little over 300 pages or so. This usually takes me slightly less than a week depending on my interest, and whether or not I am completely unfamiliar with the topic. They all come with extensive reference information. If you read through one of the books and need more detailed information it will point you to the right direction.

Web links, other books in the series, other books that are generally used in the topic discussed are all referenced.

The main thing that drew me to the series, and keeps me buying new books in the series again and again is in the title itself.

If you are new to a subject you are a dummy. 

English: Histogram of sepal widths for Iris ve...
English: Histogram of sepal widths for Iris versicolor from Fisher's Iris flower data set. SVG redraw of original image. (Photo credit: Wikipedia)
 Guess what? We all are dummies about one subject or another.

These books seldom make an assumption that the reader has in-depth knowledge of the topic or even related topics  They do not even make the assumption that the reader is familiar with basic terms.

So far in the twenty or so books in this series I have read, none of them have started using a term that was new to me, and failed to properly introduce it.

Some of the books I get to serve as an overview of a topic I may have some mild interest in so that I may be able to more intelligently discuss the topic with others that are more fluent than I in the topic.

Other topics I have purchased whole collections of the series that are interrelated (Algebra, Geometry, Trigonometry, Calculus, Statistics, and Probability). They serve as a foundation refresher since I have not studied some of these topics for some time.

I do not limit my reading to this series, but this is usually my starting off point when I set out to learn a new subject. So far, the foundation this series lays allows me to press onward to even advanced topics quickly, efficiently and with little delay in looking for a glossary of terms.

Now, go learn a new subject you Dummy!








2015-06-25

edX Intro to Big Data with Apache Spark

BerkeleyX - CS100.1x


The title of the course is Intro to Big Data with Apache Spark. 

This course is a collaboration between UC-Berkeley and DataBricks in order to formally introduce many continuous learners to Apache Spark. 

I have a minor (very minor) advantage, in that I have actually done a few small research projects around Apache Spark. Including doing a writeup for setting up Spark 1.4 to use SparkR on Windows 7.

Week 1 - Data Science Background and course setup 
Week 2 - Introduction to Apache Spark. 
Week 3 - Data Management
Week 4 - Data Quality, Exploratory Data Analysis and machine Learning. 


The setup portion, to me is very valuable. Creating a locally running Spark environment can be a little tedious. Working with Databricks Cloud is a dream, and running things with EC2 I haven't tried, but it does take a couple steps to get things going.

The labs are straightforward with a few challenges to make you think about what you are doing. Regular expressions are incredibly useful, not only in particular for this class, but they can be used in a variety of settings. I would strongly recommend reviewing Regular Expressions, and Python before getting started.

The instructor is very active on the discussion forums and has done a phenomenal job of working with some very frustrated students. 

I think the frustration of many of the students are a result of being new to functional programming,  There have been discussions around this, and I think the next iteration of the class may spend more time emphasizing this.

Overall, each lab has demonstrated some new capability. The later labs do make reference to the earlier ones. 

The class is still going on, and I believe students can still enroll. 

This class can be taken by itself or in conjunction with the followup  course: Scalable Machine Learning

If the follow-on course is as exciting as this course is, it will be well worth the time invested in learning Spark Machine Learning. 

Now if they would come up with a full edX course on GraphX, that would be awesome!




2015-06-12

SparkR on Windows 7

Apache SparkR on Windows 7

Like many people, I have been looking forward to working with SparkR. It was just released yesterday, you can find the full Apache Spark download here: Apache Spark Download

I have set up SparkR on a couple machines to run locally for testing. I did have a few hiccups getting it working, so I will document what I did here.

1. JDK 1.8_045
2. R 3.1.3
3. Apache Spark prebuilt with Hadoop-2.6
4. hadoop-common-2.2.0-bin-master
5. Fix logging. 
6. Paths
7. Run as Administrator.

1. Install the JDK 1.8_045 available here: jdk 1.8
2. Install R 3.1.3 available here: R 3.1.3

The first two items are standard Windows based installs, so they don't have to be put in any particular location. For the following downloads, I recommend creating a directory called Spark under your My Documents folder (C:\users\\Documents\Spark)

3.  Download and unzip Spark1.4 prebuilt with hadoop-2.6 available here: Spark 1.4 to the spark directory just referenced.


4. This link describes the winutils problem pretty well. Unzip the hadoop-common-2.2.0-bin.master, set up a HADOOP_HOME environment variable. I created this under the Spark directory mentioned above.


5. Copy spark-1.4.0-bin-hadoop2.6\conf\log4j.properties.template to spark-1.4.0-bin-hadoop2.6\conf\log4j.properties.
Edit the file.  log4j.properties
Change:
log4j.rootCategory=INFO, console
to:
log4j.rootCategory=ERROR, console

6. Make sure your path includes all the tools you just set up. A portion of my path is:
C:\Program Files\Java\jdk1.8.0_45\bin;C:\Users\<yourname>\Documents\Spark\spark-1.4.0-bin-hadoop2.6\bin;C:\Users\<yourname>\Documents\Spark\spark-1.4.0-bin-hadoop2.6\sbin;C:\Users\<yourname>\Documents\R\R-3.1.3\bin\x64

7. Run a command prompt as Administrator.

If I missed any of these steps, I had a number of issues to get things to work properly.

Now you can run sparkR


Have fun with Apache SparkR. I know I have much to try!

Good luck.