high performance computing
Last week I was in a team that competed in the Centre for High Performance Computing’s student cluster competition. This is a national competition and the winning team will compete at the International Supercomputing Conference in Leipzig, Germany next June.
In the competition, each team was given a budget of R250000 ($24000) and a parts list from a vendor (in our case, Dell) which they use to design a computing cluster. Our team ordered 21 Dell PowerEdge R210v2s and a 24-port gigabit Ethernet switch. We were unique compared to other teams because we had over four times as many units as any other team. We chose such a configuration because by our calculations it would have given us the highest theoretical performance. In hindsight we probably underestimated the performance cost of a gigabit network and so perhaps we should have invested either in a better network (very expensive to upgrade – new NICs, switch, cables) or opted for fewer, more powerful nodes. One advantage of many nodes that we didn’t fully expect was that we were able to process very large problem sizes that the other teams simply didn’t have enough memory for.
None of us had any experience setting up enterprise-level hardware before so it was quite an exciting challenge for us. We had four days from receiving the equipment to submitting benchmarks for our cluster.
We were provided no instruction on how to set these up, so our first task was figuring out how to put these machines into a rack. We sat for a few minutes quizzically turning rails over in our hands like stereotypical IKEA furniture parts.
While showing these pictures I should probably reiterate that we had no experience in enterprise hardware – there are probably some things visible which would get us fired from a data centre, but we did the best we could. If you see something, please let me know!
We had worked a week beforehand on scripts to install ArchLinux and other software onto the nodes, which we put onto four flash disks which let us configure nodes pretty fast. We probably should have set up some kind of PXE boot solution, but this worked fine. Despite having so many machines, we still finished configuring nodes before other teams. For live changes of configuration, instead of learning a provisioning framework like Puppet like we should have, we used a quick and dirty ssh command in a for loop, which worked way better than I expected from such a kludge.
We had a rather confounding problem with the DHCP service we set up on the head node, which resulted in all sorts of unsettlingly weird problems which I still don’t understand after the fact. We had single machines responding to two IP addresses and our address pool was half as large as it needed to be. At midnight on the first day we assigned static addresses and went back to our accommodation to get some rest.
The next day we set up OpenMPI and tried to compile the HPCC challenge and AMG benchmarks using the Intel C Compiler we had aquired a trial licence for, but we had problems using that compiler. We spent all day trying to link Math Kernel Library, four variants of the Basic Linear Algebra System and the benchmarks. I think we were just a little too inexperienced to tackle compiling a custom set of libraries, and so the next day we settled for GCC and ATLAS-LAPACK, and we finally got a benchmark running on all nodes.
All through this time we had the opportunity to talk to many industry experts. I had the immense pleasure of meeting Thomas Sterling himself – one of the inventors of the beowulf cluster, and entirely improbable to meet at the southern tip of Africa. I also had a longish chat with someone from Dell about their perspective on enterprise operations, which was surprisingly interesting. I really enjoyed talking to all these people, but wish we had all been formally introduced, and that I had more time to talk between convincing our cluster to operate.
While testing and optimising benchmarks on the next day, we discovered that a running benchmarks on a certain node would cause the entire cluster to idle around and never finish. We ultimately couldn’t find the source of this problem and had to remove it from the cluster. Also we had another two nodes which seemed to throw MPI transport errors and inflate the residuals on our benchmarks, so we ended up with a cluster of 17 compute nodes. That was rather disappointing because I really hoped to fix all the problems we had. But we had to focus on actually submitting benchmarks and didn’t have time to make it elegant. To be honest, because of our time limitations, our setup probably would have been a massive pain to administer long term. We only needed them to work for a couple of days so we didn’t spend much time making them convenient, although I did make a rather neat script that would run a batch of benchmarks while we slept at night. (sort of a scheduler I suppose)
We spent a lot of time on the OpenFOAM program which we had received a problem to compute with. We found that OpenFOAM assumed too much about the environment it was running in, and so we (mostly me and another teammate, Douglas) rewrote some OpenFOAM scripts so that all of OpenFOAM was conveniently located in our NFS directory. We also had linking problems which was a massive pain because OpenFOAM takes about two hours to compile. I credit a lot of magic work with our OpenFOAM setup to Douglas. He’ll have to explain to me what the hell he did to get it running at some stage.
In a nail-biting finish, we completed the OpenFOAM benchmark with two minutes to spare until the submission of results. We gathered quite a crowd while we counted down the iterations until the completion of our benchmark.
The night before benchmarks were due we had a bit of a disaster happen – all of our compute nodes refused to open login sessions. Physical ttys were still working, but trying to log in only yielded the last login time, and then hung forever. (the systems still responded to input, there was just nowhere for that input to go) SSH and SFTP was also affected. CPU utilization was low the entire time, but load was maxing out. (this implies, strangely, that our ganglia reporting daemons were all still alive and communicating) We rebooted one of the compute nodes, and then, even more strangely, all of them came back online. I would be interested in any hypotheses you might have on this. The one we rebooted died again later when we tried to fix the routing table (which had become confused by the reboot) and the NFS locked up the system trying to mount an unreachable location on user login. I learnt the hard way to make sure future NFS setups aren’t affected by this. I realise this paragraph is a smorgasbord of incompetance, but it was really a perfect storm of us still being slightly shaky on Linux systems, and severe time constraints. I usually prefer to spend hours fully understanding problems before I fix them, but that simply wasn’t an option here.
We were mostly pleased by our benchmark results, with the notable exception of HPL, which we received a pretty poor efficiency rating for, even for a gigabit network. We all felt that given more time, we could have optimised it to above 40%. Our dead nodes also hurt us there. However, we got a great AMG benchmark compared to the other teams. Our OpenFOAM benchmark was competitive, but wasn’t the best at the competition, because OpenFOAM doesn’t seem to scale parallel very well, and we had many lightweight nodes, while other teams had fewer but more powerful machines. Our cluster was also by far the quietest, with the Cray team leading in the jet engine category.
Throughout the competition, we had judges and people from industry asking us questions about our cluster and our rationale, which I think we fielded way better than I expected (I expected not to understand a word these people were talking about) but we gave them all solid responses. They were all pretty approachable and not too intimidating. However, many of them were from other countries and their accents were difficult to understand – but you could tell they knew their stuff.
In the end, my team came second, and we are immensely proud of what we achieved, going from only rudimentary Linux skill, to configuring and managing our own expensive cluster. Before this year, I hadn’t even considered a career in system administration, but we all really enjoyed what we were doing in the past week, despite severe lack of sleep.
(from left) Me, Liam, Doug and Kevin