Company

Building a Modern Serverless Cloud for Bioinformatics

Eli Mernit

November 24, 20244 min read

Building a Modern Serverless Cloud for Bioinformatics

The Cloud Has a Usability Problem

No one loves setting up infrastructure on AWS.

AWS is extremely powerful, modular, and complicated.

If you've worked in software for a few years, you're probably pretty good at AWS by now. But it still involves so many unnecessary steps.

You need to think about IAM roles.
You need to manage a Kubernetes cluster (wait, what's a pod?)
You need to think about security groups.

It's not impossible to learn it, of course. Before IDEs, we programmed on punch cards and it worked pretty well. But IDEs made us faster and more productive.

Today, the cloud feels a bit like programming with punch cards. And we think there's a better way.

What if the Cloud Was Mostly Invisible?

Beam is a new cloud platform where you can add simple Python decorators to your code to instantly run functions on the cloud.

Suppose you want to batch process data from GenBank.

You need a containerized cloud function:

By adding this @function decorator, this code will run on the cloud instead of your laptop.

Behind the scenes, here's what Beam is doing:

Creating a Runc container with your image
Scheduling your container on a server
Running your code and streaming the logs back to your shell

You don't need Docker installed and you don't need an AWS account. It's pretty cool!

Of course, you could also run this locally. But what if you wanted a GPU attached? And perhaps you'd like to add a custom base image too?

Let's add gpu and base_image parameters.

Now your function will run on a GPU.

Of course, your laptop might already have a GPU. But what if you wanted to batch process data on 100 containers in parallel?

Let's add a .map() which will process our data across 100 containers in the cloud.

But wait, isn't this just a wrapper on AWS? Well, AWS sells servers, and this service runs your code on servers, so sort of.

But that's the extent of the analogy.

Behind the scenes, Beam leverages a network of servers across the world and is powered by a custom container runtime, scheduler, and image cache.

The code is fully open-source, so you can poke around at your leisure.

Plasmid Analysis: An End-To-End Bioinformatics Pipeline

In this example, we'll download plasmid data from GenBank and generate ML embeddings. To speed it up, we'll shard the ML embedding process across many containers in parallel.

Downloading Data & Filesystem I/O

The first step is downloading plasmid data from GenBank.

We'll run this like any ordinary Python file. But because we've got that .remote() method, this code will run on the cloud instead of our laptop.

We can verify that the GenBank data was downloaded by using the CLI to view the Volumes:

Running Parallel Jobs

The next step is to send the plasmid sequence to an ML model to generate embeddings.

Sequences are embedded in batches of 3000 base pairs. We use the .map() method in Beam, which spawns an individual container for each batch of sequences.

We'll run this like an ordinary Python function:

Each output_url can be opened in a browser to view the generated embeddings.

Deploying Sharable Web Endpoints

You might want to deploy this code as a web API instead. In your shell, run. This prints a web API you can use to invoke it.

This command generates a versioned REST endpoint to invoke the function via API:

That was pretty quick!

We now have a sharable web endpoint, and we didn't have to setup API gateway, we didn't have to setup our own SSL cert, and most importantly: we didn't need to wait for a DevOps engineer to deploy this for us.

Imagine if everything could be this fast?

Limitations for Bioinformatics

Of course, there are existing, powerful platforms that many know and love. While we believe in the power of Beam, there are several things that it cannot do.

There's no declarative DSL. Beam is just Python. Unlike Nextflow, there's no DSL to define a process. The Python code is the source of truth.
File-based workflow execution. While you can read and write data to a cloud filesystem on Beam, it doesn't offer the ability to spawn tasks based on particular filenames like you can do on Nextflow.
No DAG viewer. There's no DAG visualization tool, so you can't easily see how your application is chained together without reading the Python code.

We're not bioinformatics experts, so there are probably other capabilities missing that we aren't aware of. If we missed something important, let us know! We are eager to learn.

Summary

In this guide, we demonstrated several cloud capabilities, created exclusively through Python decorators.

But most importantly, there are several things that weren't included:

Writing Dockerfiles or other DSLs
Configuring Kubernetes
Creating EC2 resources and VPCs
Setting permissions and access policies

That's because all of this stuff is hidden away.

Our guiding belief is that, as a scientist, you don't care about this DevOps nonsense. You're paid to discover cool and important things, not to fiddle with Kubernetes or AWS.

With Beam, you don't have to think about that stuff. Just write your Python, add some special cloud decorators, and run it on the cloud.

In our view, this is how the cloud is supposed to feel: powerful, accessible, and invisible.

If you're excited about this vision for the cloud, we'd love to chat -- feel free to reach out in Slack anytime.

Eli Mernit

Published November 24, 2024

Building a Modern Serverless Cloud for Bioinformatics

The Cloud Has a Usability Problem

What if the Cloud Was Mostly Invisible?

Plasmid Analysis: An End-To-End Bioinformatics Pipeline

Downloading Data & Filesystem I/O

Running Parallel Jobs

Deploying Sharable Web Endpoints

Limitations for Bioinformatics

Summary

More from the Beam blog

The Top Serverless GPU Providers in 2025, Ranked by Cold Start

How Goblins Cut Inference Time by 50%

Start shipping on infra
you won’t outgrow.

Building a Modern Serverless Cloud for Bioinformatics

The Cloud Has a Usability Problem

What if the Cloud Was Mostly Invisible?

Plasmid Analysis: An End-To-End Bioinformatics Pipeline

Downloading Data & Filesystem I/O

Running Parallel Jobs

Deploying Sharable Web Endpoints

Limitations for Bioinformatics

Summary

More from the Beam blog

The Top Serverless GPU Providers in 2025, Ranked by Cold Start

How Goblins Cut Inference Time by 50%

Start shipping on infrayou won’t outgrow.

Start shipping on infra
you won’t outgrow.