1) blog introduction

7 minute read

Published:

2024-01-29-intro

Welcome to the 0-to-Cloud blog!

First a message from the author:

What I’m trying to accomplish here is a short, multipart series where I take a short piece of code and take it from local execution to cloud execution. In each step I will try to introduce a new element that enables effective cloud deployment and explain what the element does and how it enables our final cloud deployment. Note, the target audience here is me from ~three years ago. What I mean by that is that was when I transitioned from academia to industry and was forced to adopt a cloud environment for my high performance computing needs. Practically, that meant I could no longer rely on abundant physical infrastructure of university computing clusters with dedicated admins for custom installation and troubleshooting. Also it meant that cost was no longer a fixed thing in the background and was very very much a foreground consideration.

So when I say I am writing this for an audience of myself from three years ago it means I very much wish I had a roadmap for how to begin navigating that aforementioned transition. I’m going to try to keep this as high level as possible so I will try my hardest to avoid diversions and keep the code focused on the task at hand and not showing off how fancy or shiny this stuff can get (and it can get very fancy and shiny!). On the other hand I would like this to be useable right out of the box so I will have a github repo of the code for each step of the way in case you would like to follow along (note, as is this probably won’t be completable wholly within the free tier of AWS, currently though I have racked up 11 cents in charges so if you (or your employers AWS account) are willing to fork over some dimes I believe it should be possible to follow the steps end to end). I hope, in the end, this may be of help to someone out there. I know this will likely be of use to me in the future as I frequently end up returning to the fundamentals of this sort of thing as new tasks and challenges emerge. That in and of itself is enough motivation for me but if you would like to ride along feel free and don’t hesitate to reach out to me with questions or comments.

-Dan

Ok lets begin

The code that we will be putting through the cloud paces does two tasks. The first task is to use the datasets command from the ncbi datasets program that does a variety of handy tasks related to publicly available genetic datasets on the NCBI. Specifically, we are passing the accession number for the tomato genome to the datasets program, pulling metadata for the assembly, and then storing that metadata in a file. If you were to have the datasets software already installed and wanted to execute this task on the command line of your local computer it would look like something this:

foo@bar:~$ datasets summary genome accession GCF_000188115.5  > tomato.metadata

This would take the tomato accession number (GCF_000188115.5) and produce a file containing metadata of the assembly (tomato.metadata). The second task is simply take the output of the previous task (tomato.metadata) and convert the entire file to upper case and would look like this if you were to execute it on the command line of a linux system:

foo@bar:~$ cat tomato.metadata | tr '[a-z]' '[A-Z]' > tomato.metadata.upper

Why this code?

I’m not sure exactly why anyone would need to complete this specific task but it does accomplish a few things that will be helpful to illustrating how pipelines in the cloud operate.

1. It contains multiple commands that depend upon one another

– i.e. a file produced by the first command becomes the input for the second command.

– This is a core feature or most bioinformatic pipelines and will help to illustrate the sort of utility that will let you develop more involved pipelines that may have dozens or more commands with varying levels of interdependence.

– There are tasks that do not require multiple steps with input/output interdependence but these single step tasks can often be more directly addressed by requisitioning a cloud computer like an ec2, tunneling in, and simply running the task without having to put in the extra work and abstracting your objectives through a pipeline framework.

2. It requires the existence of a piece of software (ncbi datasets)

– Again, that is not typically found preinstalled on a clean linux system like you could get from spooling up an ec2 instance.

Thus, to complete our objective we will have to figure out a way to ensure that when our important code gets run that it also knows where to access that critical piece of software. This is very important when building bioinformatic pipelines with more steps and that require compatibility with more bespoke software. Additionally, at our ending point our code will be scalable in that it could handle 100s of accession numbers and, if you were so inclined to spend the coin, the code to process those 100 accession numbers could be run simultaneously on 100 separate machines in complete parallel.

Hopefully you can see how a building our pipeline in a manner that allows us to explicitly automate the task of managing the software will save you greatly because you will not have to log into 100+ different machines and install your favorite software by hand in order to reap the parallelization benefits of working on the cloud.

One final message from the author:

I do also have final note about the flavors of cloud, pipeline, and containers I will be using as I have not explicitly touched on it yet. In the coming vignettes I will use docker to manage my containers, nextflow to manage my pipeline, and AWS – specifically AWS Batch – for cloud computing resources. I recognize that there are many flavors of these three computing components and these may not be familiar to you or may not be your preferred flavor. I have chosen these because they are the ones that I have experience with and also because they offer features that I feel greatly enhance my ability to give you the reader a start to finish view without jeopardizing the entire project by building it on platforms that require advanced background in pipeline deployments or me walking you through hundreds of lines of dry, esoteric code or requiring you to learn systems that may not even be supported/exist in a year from now. For example, in the case of nextflow pipelines, setting up compatibility with AWS can be, to my eye, an extremely computationally involved process requiring you to know many details about AWS Ec2, ECS, IAM, and Batch, but with the way I am presenting it here those details should be abstracted away and handled instead by nextflow’s browser interface Nextflow Tower. So! If I have neglected your favorite pipeline element here I truly am sorry, hopefully you can still take something away from this vignette and, depending on how this whole process goes, hopefully I will get a chance to run similar blogs that let me incorporate other flavors of pipeline elements.

This concludes my overall thoughts on the blog

Stay tuned for more additions