Celery on Docker: From the Ground up

Published on Nov 15, 2018

Docker is hot. Docker is hotter than hot. Docker 1.0 was released in June 2014. Since then, it has been adopted at a remarkable rate. Over 37 billion images have been pulled from Docker Hub, the Docker image repository service. Docker is so popular because it makes it very easy to package and ship applications.

How do you dockerise an app? And how do you orchestrate your stack of dockerised components? This blog post answers both questions in a hands-on way. We are going to build a small Celery app that periodically downloads newspaper articles. We then break up the stack into pieces, dockerising the Celery app. and its components Finally, we put it all back together as a multi-container app.

What is Docker?

Docker lets developers package up and run applications via standardised interfaces. Such a package is called a Docker image. A Docker image is a portable, self-sufficient artefact. Whichever programming language it was written in. This makes it easy to create, deploy and run applications. In a way, a Docker image is a bit like a virtual machine image. But container images take up less space than virtual machines. 

When you run a Docker image to start an instance of your application, you get a Docker container. A Docker container is an isolated process that runs in user space and shares the OS kernel. Multiple containers can run on the same machine, each running as isolated processes. 

So far so good. What’s in it for you? Containers provide a packaging mechanism. Through this packaging mechanism, your application, its dependencies and libraries all become one artefact. If your application requires Debian 8.11 with Git 2.19.1, Mono 5.16.0, Python 3.6.6, a bunch of pip packages and the environment variable PYTHONUNBUFFERED=1, you define it all in your Dockerfile.

The Dockerfile contains the build instructions for your Docker image. It also is an excellent documentation. If you or other developers need to understand the requirements of your application, read the Dockerfile. The Dockerfile describes your application and its dependencies.

Docker executes the Dockerfile instructions to build the Docker image. This gives you repeatable builds, whatever the programming language. And it lets you deploy your application in a predictable, consistent way. Whatever the target environment. Private data centre, the public cloud, Virtual Machines, bare metal or your laptop.

This gives you the ability to create predictable environments. Your development environment is exactly the same as your test and production environment. You as a developer can focus on writing code without worrying about the system that it will be running on.

For operations, Docker reduces the number of systems and custom deployment scripts. The focus shifts towards scheduling and orchestrating containers. Operations can focus on robustness and scalability. And they can stop worrying about individual applications and their peculiar environmental dependencies.

The newspaper3k Celery app

We are going to build a Celery app that periodically scans newspaper urls for new articles. We are going to save new articles to an Amazon S3-like storage service. This keeps things simple and we can focus on our Celery app and Docker. No database means no migrations. And S3-like storage means we get a REST API (and a web UI) for free. We need the following building blocks:

Both RabbitMQ and Minio are open-source applications. Both binaries are readily available. This leaves us with building the newspaper3k Celery application. Let’s start with the pip packages we need (the full source code is available on GitHub):

# requirements.txt
celery==4.2.1
minio==4.0.6
newspaper3k==0.2.8

Next up is the Celery app itself. I prefer keeping things clear-cut. So we create one file for the Celery worker, and another file for the task. The application code goes into a dedicated app folder:

├── requirements.txt
└── app/
       ├── worker.py
       └── tasks.py

worker.py instantiates the Celery app and configures the periodic scheduler:

# worker.py
from celery import Celery
app = Celery(
  broker='amqp://user:password@localhost:5672',
  include=['tasks'])

app.conf.beat_schedule = {
  'refresh': {
    'task': 'refresh',  
    'schedule': 300.0,
    'args': ([
      'https://www.theguardian.com',
      'https://www.nytimes.com'
    ],),  
}

The app task flow is as follows. Given a newspaper url, newspaper3k builds a list of article urls. For each article url, we need to fetch the page content and parse it. We calculate the article’s md5 hash. If the article does not exist in Minio, we save it to Minio. If the article does exist in Minio, we save it to Minio if the md5 hashes differ.

Our aim is concurrency and scalability. To achieve this, our tasks need to be atomic and idempotent. An atomic operation is an indivisible and irreducible series of operations such that either all occur, or nothing occurs. A task is idempotent if it does not cause unintended effects when called more than once with the same arguments. The refresh task takes a list of newspaper urls. For each newspaper url, the task asynchronously calls fetch_source, passing the url. 

# tasks.py
@app.task(bind=True, name='refresh')  
def refresh(self, urls):  
  for url in urls:  
    fetch_source.s(url).delay()  

The fetch_source task takes a newspaper url as its argument. It generates a list of article urls. For each article url, it invokes fetch_article

# tasks.py
@app.task(bind=True, name='fetch_source')  
def fetch_source(self, url):  
  source = newspaper.build(url)  
  for article in source.articles:  
    fetch_article.s(article.url).delay()

The fetch_article task expects the article url as its argument. It downloads and parses the article. It calls save_article, passing the newspaper’s domain name, the article’s title and its content.

# tasks.py
@app.task(bind=True, name='fetch_article')  
def fetch_article(self, url):  
  article = newspaper.Article(url)  
  article.download()  
  article.parse()  
  url = urlparse(article.source_url)  
  save_article.s(url.netloc, article.title, article.text).delay()

The save_article task, requires three arguments. The newspaper’s domain name, the article’s title and its content. The task takes care of saving the article to minio. The bucket name is the newspaper domain name. The key name is the article’s title. Here, we use the queue argument in the task decorator. This sends the save_task task to a dedicated Celery queue named minio. This gives us extra control over how fast we can write new articles to Minio. It helps us achieve a good scalable design.

# tasks.py
@app.task(bind=True, name='save_article', queue='minio')
def save_article(self, bucket, key, text):  
  minio_client = Minio('localhost:9000',
    access_key='AKIAIOSFODNN7EXAMPLE',
    secret_key='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
    secure=False)  
  try:  
    minio_client.make_bucket(bucket, location="us-east-1")  
  except BucketAlreadyExists:  
    pass  
  except BucketAlreadyOwnedByYou:  
    pass  

  hexdigest = hashlib.md5(text.encode()).hexdigest()

  try:
    st = minio_client.stat_object(bucket, key)  
    update = st.etag != hexdigest  
  except NoSuchKey as err:  
    update = True  

  if update:  
    stream = BytesIO(text.encode())  
    minio_client.put_object(bucket, key, stream, stream.getbuffer().nbytes)  

When it comes to deploying and runing our application, we need to take care of a couple of things. This is typically solved by writing a scripts. Specifically, we need to:

  • ensure the correct Python version is available on the host machine and install or upgrade if necessary
  • ensure a virtual Python environment for our Celery app exists; create and run pip install -r requirements.txt if necessary
  • ensure the desired RabbitMQ version is running somewhere in our network
  • ensure the desired Minio version is running somewhere in our network
  • deploy the desired version of your Celery app
  • ensure the following processes are set up and configured in Supervisor or Upstart:
    • Celery beat
    • default queue Celery worker
    • minio queue Celery worker
  • restart Supervisor or Upstart to start the Celery workers and beat after each deployment

Dockerise all the things 

Easy things first. Both RabbitMQ and Minio are readily available als Docker images on Docker Hub. Docker Hub is the largest public image library. It is the go-to place for open-source images. This leaves us with dockerising our Celery app. The first step to dockerise the app is to create two new files: Dockerfile and .dockerignore.

├── Dockerfile
├── .dockerignore
├── requirements.txt
└── app/
       ├── worker.py
       └── tasks.py

.dockerignore serves a similar purpose as .gitignore. When we copy files into the Docker image during the Docker build process, any file that matches any pattern defined in .dockerignore is excluded. 

Dockerfile contains the commands required to build the Docker image. Docker executes these commands sequentially. Each command is called a layer. Layers are re-used by multiple images. This saves disk space and reduces the time to build images.

# Dockerfile
FROM python:3.6.6  
ENV LANG=C.UTF-8 LC_ALL=C.UTF-8 PYTHONUNBUFFERED=1

WORKDIR /  
COPY requirements.txt ./  
RUN pip install --no-cache-dir -r requirements.txt  
RUN rm requirements.txt  

COPY . /  
WORKDIR /app

We use the python:3.6.6 Docker image as our base. The python:3.6.6 image is available on Dockerhub. Then, we set some environment variables. LANG and LC_ALL configure Python’s default locale setting. Setting PYTHONUNBUFFERED=1 avoids some stdout log anomalies.

Next, COPY requirements.txt ./  copies requirements.txt file into the image’s root folder. We then run pip install. We then delete requirements.txt from the image as we no longer need it. Finally, COPY . / copies the entire project into the image’s root folder. Excluding stuff according to the .dockerignore file. As the app is now in the image’s /app directory, we make this our working directory. Meaning that any command executes inside this directory by default. Execute the Dockerfile build recipe to create the Docker image:

docker build . -t worker:latest

The -t option assigns a meaningful name (tag) to the image. The colon in the tag allows you to specify a version. If you do not provide a version (worker instead of worker:latest), Docker defaults to latest. Do specify a version for anything which is not local development. Otherwise, sooner or later, you will have a very hard time.

Refactor the Celery app

Containerising an application has an impact on how you architect the application. If you want to dive deeper, I recommend you check out the twelve-factor app manifesto. To ensure portability and scalability, twelve-factor requires separation of config from code. An app’s config is everything that is likely to vary betweeen environments. 

The twelve-factor app stores config in environment variables. Environment variables are easy to change between environments. Environment variables are language-agnostic. Environment variables are deeply ingrained in Docker. Refactor how we instantiate the Celery app.

# worker.py
app = Celery(
  broker=os.environ['CELERY_BROKER_URL'], 
  include=('tasks',))
app.conf.beat_schedule = {
  'refresh': {
    'task': 'refresh',
    'schedule': float(os.environ['NEWSPAPER_SCHEDULE']),
    'args': (os.environ['NEWSPAPER_URLS'].split(','),)
  },
}

We can simplify further. Any Celery setting (the full list is available here) can be set via an environment variable. The name of the environment variable is derived from the setting name. Uppercase the setting name and prefix with CELERY_. For example, to set the _broker_url_, use the CELERY_BROKER_URL environment variable.

# worker.py
app = Celery(include=('tasks',))
app.conf.beat_schedule = {
  'refresh': {
    'task': 'refresh',
    'schedule': float(os.environ['NEWSPAPER_SCHEDULE']),
    'args': (os.environ['NEWSPAPER_URLS'].split(','),)
  },
}

We also need to refactor how we instantiate the Minio client. 

# tasks.py
@app.task(bind=True, name='save_article')  
def save_article(self, bucket, key, text):  
  minio_client = Minio(os.environ['MINIO_HOST'],  
    access_key=os.environ['MINIO_ACCESS_KEY'],  
    secret_key=os.environ['MINIO_SECRET_KEY'],  
    secure=int(os.getenv('MINIO_SECURE', '0')))
  ...

Rebuild the image:

docker build -t worker:latest

Configuration

Our Celery app is now configurable via environment variables. Let’s summarise the environment variables required for our entire stack:

Worker image:

  • CELERY_BROKER_URL
  • MINIO_HOST
  • MINIO_ACCESS_KEY
  • MINIO_SECRET_KEY
  • NEWSPAPER_SCHEDULE
  • NEWSPAPER_URLS

Minio image:

  • MINIO_ACCESS_KEY
  • MINIO_SECRET_KEY

You need to pass the correct set of environment variables when you start the containers with docker run. In reality you will most likely never use docker run. Instead, you will use an orchestration tool like Docker Compose. Even when you do run only a single container. I will skip the details for docker run (you can find the docs here) and jump straight to Docker Compose.

Orchestrate the stack with docker-compose

Now that have all our Docker images, we need to configure, run and make them work together. This is similar to arranging music for performance by an orchestra. We have individual lines of music. But we need to make them work together in harmony.

Container orchestration is about automating deployment, configuration, scaling, networking and availability of containers. Docker Compose is a simple tool for defining and running multi-container Docker applications. With Docker Compose, we can describe and configure our entire stack using a YAML file. The docker-compose.yml. With a single command, we can create, start and stop the entire stack. 

Docker Compose creates a single network for our stack. Each container joins the network and becomes reachable by other containers. Docker Compose assigns each container a hostname identical to the container name. This makes each container discoverable within the network.

We define five services (worker, minio worker, beat, rabbitmq and minio) and one volume in docker-compose.yml. Services are Docker Compose speak for containers in production. A service runs an image and codifies the way that image runs. Volumes provide persistent storage. For a complete reference, make sure to check out the Docker Compose file docs.

# docker-compose.yaml
version: '3.4'
services: 
  worker:
    build: .
    image: &img worker 
    command: [celery, worker, --app=worker.app, --pool=gevent, --concurrency=20, --loglevel=INFO]
    environment: &env      
      - CELERY_BROKER_URL=amqp://guest:guest@rabbitmq:5672
      - MINIO_HOST=minio:9000
      - MINIO_ACCESS_KEY=token
      - MINIO_SECRET_KEY=secret
      - NEWSPAPER_URLS=https://www.theguardian.com,https://www.nytimes.com
      - NEWSPAPER_SCHEDULE=300
    depends_on:
      - beat
      - rabbitmq
    restart: 'no'
    volumes:
      - ./app:/app 

  worker-minio:
    build: .
    image: *img
    command: [celery, worker, --app=worker.app, --pool=gevent, --concurrency=20, --queues=minio, --loglevel=INFO]
    environment: *env
    depends_on:
      - beat
      - rabbitmq
    restart: 'no'
    volumes: 
      - ./app:/app

  beat:
    build: .
    image: *img
    command: [celery, beat, --app=worker.app, --loglevel=INFO]
    environment: *env
    depends_on:
      - rabbitmq
    restart: 'no'
    volumes:
      - ./app:/app

  rabbitmq:
    image: rabbitmq:3.7.8
    
  minio:
    image: minio/minio:RELEASE.2018-11-06T01-01-02Z
    command: [server, /data]
    environment: *env
    ports:
      - 80:9000
    volumes:
      - minio:/data
      
  volumes:
    minio:

Let’s go through the service properties one-by-one. 

  • build: a string containing the path to the build context (directory where the Dockerfile is located). Or, as an object with the path specified under context and optionally Dockerfile and args. This is useful when using docker-compose build worker as an alternative to docker build. Or when you want Docker Compose to automatically build the image for you when it does not exist.
  • image: the image name
  • command: the command to execute inside the container
  • environment: environment variables
  • ports: expose container ports on your host machine. For example, minio runs on port 9000. We map it to port 80, meaning it becomes available on localhost:80.
  • restart: what to do when the container process terminates. Here, we do not want Docker Compose to restart it.
  • volumes: map a persistent storage volume (or a host path) to an internal container path. For local development, mapping to a host path allows you to develop inside the container. For anything that requires persistent storage, use Docker volume. Here, we get minio to use a Docker volume. Otherwise, we lose all data when the container shuts down. And containers are very transient by design.
  • depends_on: determines the order Docker Compose start the containers. This only determines the startup order. It does not guarantee that the container it depends on, is up and running. RabbitMQ starts before the beat and the worker containers. By the time the beat and worker containers are up and running, RabbitMQ is still starting. Check out the logs using docker-compose logs worker or docker-compose logs beat.

Persistent storage is defined in the volumes section. Here, we declare one volume named minio. This volume is mounted as /data inside the Minio container. And we start Minio so it stores its data to the /data path. Which is the minio volume. Volumes are the preferred mechanism for persisting data generated by and used by Docker containers. You can find out more how Docker volumes work here. And here more about the volumes section in the docker-compose.yml.

In case you are wondering what the ampersand - & - and asterisks - * - are all about. They help you with repeated nodes. An ampersand identifies a node. You can reference this node with an asterisk thereafter. This is very helpful for image names. If you use the same image in different services, you need to define the image only once. When you upgrade to a newer image version, you only need to do it in one place within your yaml.

Same applies to environment variables. You define them for your entire stack only once. And you can then reference them in all your services. When you need to amend something, you need to do it only once. This also helps sharing the same environment variables across your stack. For instance, the minio container requires MINIO_ACCESS_KEY and MINIO_SECRET_KEY for access control. We reuse the same variables on the client side in our Celery app.

Start the Docker stack

With the docker-compose.yml in place, we are ready for show time. Go to the folder where docker-compose.yml is located. Start the docker stack with

# start up
docker-compose up -d

Minio should become available on http://localhost. Use the key and secret defined in the environment variable section to log in. Follow the logs with docker-compose logs -f. Or docker-compose logs –f worker to follow the workers logs only.

Say, you need to add another Celery worker (bringing the total threads from 20 to 40).

# scale up number of workers
docker-compose up -d --scale worker=2

And back down again.

# scale down number of workers
docker-compose up -d --scale worker=1

Conclusion

This was pretty intense. But we have come a long way. We started discussing the benefits of running an application on Docker. We then took a deep dive into two important building blocks when moving to Docker:

  • containerise a Celery application
  • orchestrate a container stack with Docker Compose

I’ve compiled a small list of resources covering important aspects of dockerisation. It’s about important design aspects when building a containerised app:

And here’s a list of resources on orchestration with Docker Compose:

Docker Compose is a great starting point. It’s a great tool for local development and continuous integration. And it can make sense in small production environments. At the same time, Docker Compose is tied to a single host and limited in larger and dynamic environments.

This is where kubernetes shines. Kubernetes_ is the de-facto standard for container orchestration which excels at scale. In my next blog post, we will migrate our little Celery-newspaper3k-RabbitMQ-Minio stack from Docker Compose to kubernetes.