SPARCL - Simple Python/Awk Reduce Command Line
I need to analyze some data.It is not overly large, so it isn't really a Big Data problem. However, working with the data in Excel is unwieldy.
I really need to analyze the structure of the data, and the unique values of either a single column, or combinations of columns.
There are actually a number of ways to do this.
Load the data into a MySQL database run SQL Group-by statements on the table.
Load it in to R, use the sqldf package to do the same thing.
Write a Python script to read the whole file and spit out combinations of distinct values for various columns.
All of these approaches have their advantages and disadvantages.
Each of them take a bit of time, and some of them just sit in memory.
Then I remembered a class I took from Cloudera some time ago. In the class the instructor showed us how to do a simple Map Reduce program without invoking Hadoop.
The command is simple:
cat filename | ./mapper.py | sort | ./reducer.py
He suggested we run our code through this pipeline before submitting a full on hadoop job, just to make sure there were no syntax errors, and all of the packages were available to the python command interpreter.
This is exactly what I need to do.
The first attempt to do this I wrote a simple awk script for the "mapper" portion.
parser.awk:
#!/usr/bin/awk -f
BEGIN { FS ="," }
{print $1;}
This I ran with:
cat file | ./parser.awk |sort |./reducer.py
However, as time went on I needed to either look at different columns from my CSV file, or combinations of columns from my CSV file.
Rather than do a lot of AWK coding I wrote this Python:
#!/usr/bin/python
import sys
line = []
indexes = []
list_arg = 0
if len(sys.argv) == 2:
index = int(sys.argv[1])
else:
list_arg = 1
indexes = sys.argv[1:]
for data in sys.stdin:
line = data.split(',')
if list_arg == 0:
print "{0},1".format(line[index])
else:
string = ""
for index in indexes:
string = string+line[int(index)]
if (index == indexes[-1]):
string = string+"\t1"
else:
string = string+","
print string
Now I can do:
cat file | ./parser.py 2 4 6 99 3 | sort | ./reducer.py
The reducer function I will keep on my github to keep it easy to read.
Feel free to give this a try. I find it works best from within a VM like the DataScienceAtTheCommandLine
Good luck, and comment if you find this useful.
No comments:
Post a Comment