IQ Biology Blog: Compute as a commodity: from petascale to your web-browser
By: Daniel McDonald
A few weeks ago, and thanks to support from IQ Biology travel funds, I had an opportunity to visit the University of California, San Diego. Plans changed (un)fortunately for the trip, and it was cut short (more on this below), but it was an exciting visit nonetheless.
The original intention had been to:
- Make progress on the American Gut manuscript
- Coordinate a visit with Prof. Perez who is the lead developer of IPython and its encompassing packages
- Participate in a code sprint to bring scikit-bio up to a beta release
- Meet with the Dorrestein lab, a world leading group in metabolomics work
Perhaps ambition got ahead of me, or just that there are too many cool things going on, but I wasn't able to meet all of the objectives. On the front of the American Gut manuscript, which is a critical component of my PhD thesis, I met with Justine Debelius who I'm sharing first-author with to resolving our outstanding figure order, and what items we wanted to pull from the American Gut outline. The outline is the wild west: we opened the door to our network of collaborators to contribute analyses and the list of authors has since grown to 56!
Part of the motivation for the timing of the trip was that Prof. Greg Caporaso (from Northern Arizona University) had organized a scikit-bio sprint at UCSD focused on bringing the software package up to beta in preparation for the SciPy conference this summer. We'd previously collaborated with Prof. Fernando Perez on code sprints, and IPython work, and he's based at UC-Berkeley, so I extended an invitation for him and one of his core developers, Min Ragan-Kelley, to join in on the sprint. I'd brought the details of the visit up with Prof. Larry Smarr during one of our weekly VROOM meetings (See "The Big Wall" here) as the Calit2 group had been looking at deploying JupyterHub on Gordon and Comet in SDSC as I figured the group would be interested to meet Fernando (they'd been doing experimental provisioning already). This opened the door for Fernando and Min to participate in the VROOM meeting that happened during the visit, which was one of the most awesome things I've ever witnessed. Calit2 has an unreal amount of infrastructure which includes 160gbps pipes around campus and up and down the west coast, petascale compute resources, and a development group driven to reduce the barrier to entry to get up and running on large systems. The initial plan had been to discuss hurdles in getting JupyterHub up, which then expanded in the course of the meeting to the potential to for it to encompass all of XSEDE. Imagine simply going to the XSEDE website, and having arbitrary compute across geographically dispersed systems, with native support for over 40 programming languages. That is not a dream, but a very realistic possibility and the implications could have rippling effects across the sciences.
As part of the scikit-bio code sprint (full details of which, including what was accomplished, can be found on our Hackpad), we resolved outstanding API details (which need to be stable for a beta release) as well as specific plans for how functionality can be marked as experimental, deprecated, etc. The project management aspects are perhaps not as sexy as extreme compute, but vital for a healthy project that emphasizes maintainability. However, we did get down into the nitty gritty with performance, including the development of a benchmarking platform for the sequence objects (I was not involved in this specifically). The benchmarking highlighted performance concerns relative to similar projects, like Biopython and PyCogent. In conjunction with Evan Bolyen, and using the benchmarking results, we were able squeeze out an order of magnitude or more decrease in the runtime for reverse complement by using index lookups and fancy indexing with NumPy. And perhaps even crazier, we were able to figure out how to get two orders of magnitude decrease in runtime for RNA -> Protein translation by using stride tricks. The NumPy array is one of the more remarkable data structures that has ever been created.
As I mentioned in the start, unfortunately the trip was cut short which made the overlap I'd intended with the Dorrestein lab not possible. However, the reason for the change in schedule was truly excellent: I was invited for an interview at the Institute for Systems Biology in Seattle. The specific interest is to work on something called the Wellness 100k Project (it's too awesome to describe in a mere blog post, so click that link). It was well worth cutting the UCSD trip short, as the visit went well and included lunch with Prof. Lee Hood, which itself was a remarkable experience.
Daniel McDonald graduated from the IQ Biology PhD program in 2015 and recently joined the Institute for Systems Biology in Seattle.