Pages

2015-05-04

SPARCL-Simple-Python/AWK-Reduce-Command-Line

SPARCL - Simple Python/Awk Reduce Command Line

I need to analyze some data.
It is not overly large, so it isn't really a Big Data problem. However, working with the data in Excel is unwieldy.

I really need to analyze the structure of the data, and the unique values of either a single column, or combinations of columns.

There are actually a number of ways to do this.
Load the data into a MySQL database run SQL Group-by statements on the table.
Load it in to R, use the sqldf package to do the same thing.
Write a Python script to read the whole file and spit out combinations of distinct values for various columns.

All of these approaches have their advantages and disadvantages.

Each of them take a bit of time, and some of them just sit in memory.

Then I remembered a class I took from Cloudera some time ago. In the class the instructor showed us how to do a simple Map Reduce program without invoking Hadoop.

The command is simple:  
cat filename | ./mapper.py | sort | ./reducer.py

He suggested we run our code through this pipeline before submitting a full on hadoop job, just to make sure there were no syntax errors, and all of the packages were available to the python command interpreter.

This is exactly what I need to do.

The first attempt to do this I wrote a simple awk script for the "mapper" portion.
parser.awk:
#!/usr/bin/awk -f
BEGIN { FS ="," }
{print $1;}


This I ran with:
cat file | ./parser.awk  |sort  |./reducer.py

However, as time went on I needed to either look at different columns from my CSV file, or combinations of columns from my CSV file.

Rather than do a lot of AWK coding I wrote this Python:
#!/usr/bin/python
import sys
line = []
indexes = []
list_arg = 0
if len(sys.argv) == 2:
        index = int(sys.argv[1])
else:
        list_arg = 1
        indexes = sys.argv[1:]
for data in sys.stdin:
        line = data.split(',')
        if list_arg == 0:
                print "{0},1".format(line[index])
        else:
                string = ""
                for index in indexes:
                        string = string+line[int(index)]
                        if (index == indexes[-1]):
                                string = string+"\t1"
                        else:
                                string = string+","
                print string


Now I can do:
cat file | ./parser.py 2 4 6 99 3 | sort | ./reducer.py

The reducer function I will keep on my github to keep it easy to read.

Feel free to give this a try. I find it works best from within a VM like the DataScienceAtTheCommandLine

Good luck, and comment if you find this useful.


No comments:

Post a Comment