Normally on this blog I try and focus on more ‘pure’ data science and statistics as I spend the majority of my day focused on the IT-related side of data science. There’s also quite a bit of information already out there around the technical aspects of data science such as configuring python, installing R or MySQL, general Linux how-to’s, etc… so I don’t really feel the need to duplicate it. However, since I ran into this issue I realized it might be pretty common and it really straddles the line between data science and IT, so I figured I’d put the post up as it might help some people who run into the same scenario.
I had configured a standalone Spark cluster at home using VM’s (I run ESXi and KVM/ProxMox on a few boxes) and had written quite a bit of distributed python code to train thousands of Markov chains quickly. Everything was working great until I ran into a bit of wall when attempting to share the code with others, as anyone outside of my environment couldn’t point to my spark cluster or the Redis DB to hold pre-aggregated values that I had configured for the project. So naturally I figured I would hop over to Amazon Web Services and spin up the usual c3.4xlarge 3-instance cluster and get the code running and then share that with people.
That’s when I ran into an issue I hadn’t expected, and that was Amazon’s latest Elastic MapReduce version (4.0 at this time) and the associated AMI’s still included Python 2.6. I have no idea what that’s all about as 2.7 was released over 5 years ago. Maybe Jeff Bezos hates Python? Anyway, it’s really frustrating if your code is dependent on Python 2.7 or higher (collections.Counter(), anyone?!), so I just wanted to walk through the steps to upgrade an Amazon EMR cluster (all nodes) to Python 2.7 in case others have the same issue. (I think it should be emphasized that you definitely want to try this on a non-production environment first to ensure there aren’t any unintended consequences! If all you’re running is PySpark code, you should be fine).
The first step is to add your .pem file to your local ssh agent if you haven’t already.
ssh-add ~/.ssh/SparkCluster.pem
Now let’s connect to the name node / cluster manager node.
ssh -A hadoop@your.dns.info.amazonaws.com
And let’s get the list of IP addresses for our worker nodes / data nodes (save these off for reference).
hadoop dfsadmin -report | grep ^Name | cut -f2 -d: | cut -f2 -d' '
You’ll probably get a deprication warning, just stick your tongue out at it like I normally do and proceed..
Now that we’ve got the IP’s saved off, let’s just go ahead and knock out the upgrade on the name node. (Note that these instructions are based of off a post by Sebastian on his blog, I’m just incorporating them into our EMR cluster-based work flow for the same of simplicity).
sudo yum install make automake gcc gcc-c++ kernel-devel git-core -y
sudo yum install python27-devel -y
sudo rm /usr/bin/python
sudo ln -s /usr/bin/python2.7 /usr/bin/python
sudo cp /usr/bin/yum /usr/bin/_yum_before_27
sudo sed -i s/python/python2.6/g /usr/bin/yum
sudo sed -i s/python2.6/python2.6/g /usr/bin/yum
python -V ## should be 2.7x+
sudo curl -o /tmp/ez_setup.py https://bootstrap.pypa.io/ez_setup.py
sudo /usr/bin/python27 /tmp/ez_setup.py
sudo /usr/bin/easy_install-2.7 pip
And you should’ve noticed after checking the Python version we’re in friendly 2.7 territory. pip is also upgraded along the way and can be referenced via /usr/bin/pip27
Now all that’s left to do is to upgrade each worker node (hope you’re not on a huge cluster!).
In order to SSH to the other nodes, we’ll just do.
ssh hadoop@your.datanode.ip
and just follow the python upgrade instructions once again for each node. Don’t forget to exit out of the data node shell after you’re done or you can easily get lost in a crazy SSH chain.
That’s all there is to it! It’s rather simple and you can definitely write or find some bootstrap actions for the future to do this for you, but if you already had things up and running that should help you out a bit!