The data guy deals with hardware issues.

Fedora logo
Fedora logo (Photo credit: Wikipedia)

Remember how to do this?

Since moving into our new house, I had a few other priorities to deal with before getting my personal computing equipment in order.

Before the days of "the cloud" if you wanted to do an coding, or data analysis you needed to have a place to put the data you need to analyze, and it had to have a decent amount of memory and processing power in order to do any kind of analysis on data larger than just a few meg.

I had bought a tower some years ago, I used it during some of my consulting projects, as well as some of my personal research (research I should have pursued further, but now apache parquet negates).

It may not be comparable to an EC2 machine, but it gets the job done for me. 1Tb, 4Gig memory, quad CPU. It has been in storage in a box since we moved away from Maryland (3 years ago).

I dusted it off, set it up, brought out its monitor and fired it up.

"It's alive!"

Yay. I have a tower again! (or so I thought).

I connected it to my router, and decided that since this was running like Fedora 12, and the current version of Fedora is 24. Well in order to do that, I need to download the live disk.

After getting DVDR's to write the ISO image onto, I realized something.

I could use the live CD to test my laptop.

<rabbit hole>

My personal laptop died shortly after the move, it has been sitting idly in my office for some time. I had been hoping to replace it, but there are always issues that come up that have a higher priority.

I built a VM on my current machine, and ran the disk checks on the Fedora DVD to make sure it had burned correctly. Then I put it in my personal laptop.

Fedora Lives!

Ok, so now that means that the harddrive in my laptop is gone. So I found a replacement and ordered that.

</rabbit hole>

Now back to the tower. I wanted to get my data off the tower, so I copied everything off.

Bam, machine died.


I mounded a local disk, and copied off everything I needed, then did the Fedora install.

Fedora, up and running!

Bam, machine died.

Ok, let's check some stuff.

Deutsch: Kingston KHX1600c9d3lI tested the memory, and during the test.

Bam, machine died.

Ok, so this machine needs new memory.

Off to Best Buy to get new memory.

While I was at Best Buy, I saw that they also had ddr3 laptop memory. Since I have to replace the hard drive for my laptop, why not do a memory upgrade?

I got the ddr3 for the tower and the laptop, then replaced the memory in the tower.


It's Alive!

Now I can do a Yum update since the machine will be up for a while.

Yum is being deprecated in favor of a tool called DNF.

So DNF Update it is.

Once current, then I need to set this up as a non-gui machine. This is called initlevel 3 in prior versions of Fedora/Red Hat.

Now there is a new tool called systemctl. In order to switch over to more of a server configuration the command is:

systemctl isolate 

Ok, the Tower is now in good shape, so when my laptop harddrive arrived I replaced the hard drive, and added the new 8Gig stick, then installed Fedora.

Up and running after install, the laptop locks up periodically after a few minutes.

Nothing will launch, and it even loses connectivity.


I started to lose faith in Fedora, and try Ubuntu. When Ubuntu live booted, it had memtest86+ on the initial screen. I ran the memory test, and it immediately rebooted.

Let's try that again.

Same result.

Let's assume the memory is bad, and put the old memory back in.

Memory test again?

Now I have Fedora running on my laptop and Tower.

I will be going through migrating some of my R code now onto these machines.

While I am familiar with troubleshooting hardware issues, I am glad this is not something that I do on a daily basis.

It seems like this can be a constant rabbit hole in that, when you uncover one problem, often it exposes a new one.



Beware the "partnership"

New Smiley for the sMirC-series. beware!

"I want to make you my partner"


When someone says this to you, you should think of the phrase:
"Beware the ides of March."

From my experience this means:

"I have an idea, I want you to do all the work to build it. Once you have done all the work, I am going to get back involved and make a lot of money off all the work you did."

Don't get me wrong. If you and a friend come up with an idea, and you are the "technology" person. They are the "business" person, then becoming partners make sense.

But both of you are in the same boat.

If you are considering a side project, and the "idea" person likes what you say about the technical details about how to implement their business idea, you just got a customer.

Think about that term. Customer. If you are a customer, do you tell the service provider that they should be your partner?

No, they perform a service for you, then you pay them.

Technology side projects are no different.

What about the scenario when there is limited capital to go around?
This is a Catch-22 situation

The idea person needs something implemented, in order to get to the point where capital is available.

The technology person needs resources (compute resources, vendor resources, etc..).

However, if this is your first experience in consulting, here is the trick. If you have ever heard the phrase: "Make a business case for that purchase"? Now is your opportunity to apply your technical knowledge to a business scenario.

Here are a couple hypothetical exchanges.

Idea Person: "We need to be able to market my widgets to a broad range of customers."
Tech Person: "We can use X or Y to do that. X costs $200/month Y costs $300/month, and my time to manage either of them is $50/hour."
Idea Person: "Why do I pay $50/hour?"
Tech Person: "So you don't have to deal with either X or Y, you can focus on your value add widget."
Idea Person: "I want to make you my partner. Then we can both benefit."
Tech Person: "OK, for a partnership my rate is $25/hr, and we still need to pay for either X or Y."

Idea Person: "If we can put together this widget, then we can solve a key problem in this space."
Tech Person: "We can use this tool that costs $200/kit to make some prototypes. My rate to build it is $50/hr."
Idea Person: "Why do I have to pay you $50/hr?"
Tech Person: "Because you didn't even know that tool was available, much less how to use it to build your idea."
Idea Person: "I want to make you my partner. Then we can both benefit."
Tech Person: "OK, for a partnership my rate is $25/hr, and we still need to pay for those kits."

There are risks involved in doing side projects.

Whatever you choose to do, make sure you have an agreement written down with clear expectations from the onset.

While the costs of many things have come down with tools like the cloud, or even open source tools. It still takes expertise to use these tools, and seldom are the tools themselves completely free (as in beer).

Be careful working with side projects. Some can be rewarding, and there are some benefits to working with some perhaps new technologies.

However, the down side is they can become a time drain where in order to make any money the goal line keeps getting moved, and you have to jump through more and more hoops in order to meet expectations.


If you give a quant(or Data Scientist) some data...

With compliments to Laura Joffe Numeroff:

"If you give a quant some data, she's going to ask for storage.

When you give her the storage, she'll probably ask you for an admin tool. When she's finished, she'll ask you for a archive/restore test.

Then she'll want to have an audit prove there has been no compromise of PHI/PII/PCI data. When she looks at the audit logs,she might notice the production database doesn't have encryption. So she'll probably ask for some encryption tools.

When she's finished encrypting the production data, she'll want to verify that the data warehouse has proper encryption protocols set up. She might get carried away and encrypt all the data in the enterprise.

When she's done, she'll finally get around to running a regression. You'll have to approve cloud expenditures to be able to run the spark cluster.  She'll crawl in, make herself comfortable and munge the data a few times. She'll probably ask you to tell her more context around this data. So you'll tell her how the data was collected, and she will ask to see the initial graphs of user usage.

When she looks at the pictures, she'll get so excited she'll want to make some of his own visualizations. She'll ask for a larger monitor, and an industrial plotter. She'll make the visualization. When the visualization is finished, she'll want to share it with others. After sharing it with others, and getting feedback, she will create more visualizations to demonstrate the effect of regression on the data. Then she'll want to store all these visualizations on a web-server.  Which means she'll need more storage. She'll post the visualizations for everyone to see.

After she has stored the visualizations she will check the admin tool to make sure the users have access to it. The admin tool won't have the access logs stored in a format she can use to analyze it. So... she'll ask for some more storage to extract the logs.

And chances are if she asks you for the storage, she's going to need more data to put in the storage."


Show Your work

"Show your work!
Yes, you got the answer right.
I am still marking it wrong since I don't know how you did it."

I hated that phrase.

Yes, I heard it more than once.

Teachers never explained to me the point of showing your work is so that if you do in fact come up with the wrong answer they can assist you in how to tweak your algorithm.

I don't think any teacher would have ever explained things this way.

Honestly, most of my teachers I am not convinced they actually understood this is what they were doing.
They really only knew one method of solving a given problem, and in order for them to supply help the students had to follow their methods.

(The majority of my early education was in religion based schools, so these teachers were not Scientists, Mathematicians, etc...)

Now to today, I am a Data Scientist. Which means, I work with numbers, algorithms, Data, business users, technical experts, Architects, Statisticians, and in Domain experts.

I spend much of my day munging data into data structures, or algorithms that provide insight into our data, our users, and our customers. One thing I see Data Scientists doing, and not necessarily talking about is the whole "Show your Work" philosophy.

If you are going to make a claim about an insight you have about data, you should show how you got to that conclusion. In many papers it may be in a section like "Methods and Assumptions", but I think this is important in even internal presentations to business users. You should be able to show others how you got to whatever conclusion you have come to.

It may not be "Page 1", (actually Page 1 should really be your final conclusion, but that is another story), but for any presentation in addition to citing your sources, you should touch on any methodologies you followed.

For example, in my current research, I am evaluating Markov chain sequences of behavior patterns. I will write a separate blog about Markov chains at some point, but part of my foundation work has been to show how to both collect the data, then munge my observations into a probability transition matrix.

This has re-iterated to me that we in the data science community have a responsibility to be able to show our work.

Not everyone will want to go into the same weeds, rabbit holes, and other detailed work that you have done. But after taking any number of wrong turns during your analysis, you should be able to show from first principles how you arrived at your final conclusion.  Even if that is saved for a smaller presentation for those willing to go through your process with you.

Ad astra!


I am not a programmer.

Social-network (Photo credit: Wikipedia)

I write programs, but I am not a programmer

I am not a programmer. (Professional Software Developer, or Software Engineer).

Sure, I studied programming in school. That was a few days ago at this point.

I have taught myself, or taken online courses for R, Python, and other languages of the day while I have probably written more code in SQL than any other language.

I do not code in Java, or .Net

I strongly suspect I will never code in Java, or Net.

For many things, I use Databricks notebooks (Scala, Python, R, and SQL).

I am a data scientist. We start with Data. The way I explain myself to many people, I take the data that is generated by an application, and answer questions that the originally programmers, business analysts, users, and stakeholders probably never thought of.

This does require writing some code. I have to pull data from the source system (original application). Transform, or Munge the data to fit into an algorithm I am using (Time-series, Markov chains, Support Vector Machine, Random Forest, Regression, etc.) Then load it into either a series of files in my data lake, or even some tables I create for repeatable analysis.

Much of my time is spent working on understanding the meaning of the data, and how the application of this understanding will impact our business.

Writing code may be an important aspect of ensuring the analysis is repeatable as part of creating a useful data product. However, if you are focused too much on the code itself, versus how does the code help you transform your data into information and insight.

Isn't the purpose of our data products to gain insight from our data?