Architecture & DesignA Guide to Docker Image Optimization

A Guide to Docker Image Optimization

Developer.com content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

by Nir Cohen

Docker is everything. Some have even claimed that it can prevent crime and famine. (I probably don’t need to ask you to note the sarcastic tone).

But seriously, for the disruption it has brought on in the industry, and the true market gap it has filled for many, it deserves respect. However, like with all tools, Docker has its upsides and downsides. One common pain point we constantly encountered when using Docker was image size. In order to be able to truly leverage Docker to its utmost potential in our daily work, we needed to find a way to optimize the output image significantly, for the most part, since the majority of our clients require offline installations rendering DockerHub unusable to us.

This post is going to dive into a hands-on guide for the process we went through to cut our Dockerfiles down to half in size (using a single example), with the final real working example complete with documentation and all.

So let’s start from the top…

Docker? Why?

Working on Cloudify has eventually brought us to work with containers. Why, you ask? Well, containers for us provided the following:

  • Multi-distro support: We’re able to run our stack on different distros with very minimal adjustments.
  • Easier and more robust upgrades:We’re not there yet, but we’re trying to provide an environment where you could easily upgrade different components.

    Using containers will allow us to substantially reduce the number of moving parts to a minimum so that we can have more confidence in our upgrade process. Less unknowns. Using containers will also allow us to generate a more unified upgrade flow.

  • Modularity and Composability: Users will be able to build a different Cloudify topology by replacing or adding services. We’re aiming at having our stack completely composable. That means you’ll be able to deploy our containers on multiple hosts and clustering them easily by using very defined deployment methods. Using containers will allow us to do just that.

New Problems

However, as we started journeying into using Docker, we stumbled onto several problems:

  • Most of our customers require offline installations. Thus, we can’t use DockerHub. In turn, this means that we must export or save the images and allow customers to download them and import or load them on Cloudify’s Management machine(s).
  • Due to the above, the images should be as small as possible. It’s very easy to create very large Docker images (just install openjdk) and as our stack is an extremely large one (As of the time this post is being written, the stack comprises of Nginx, InfluxDB, Java, Riemann, logstash, Elasticsearch, Erlang, RabbitMQ, Python and NodeJS), we can’t allow our images to grow substantially.
  • The stack has to be maintainable. Managing a stack of this complexity on every build is cumbersome. We have to make everything as organized as possible.
  • As Cloudify is open source, we would like to provide a way for users/customers to build their own Cloudify stack. This requires that our Dockerfiles are tidy and that the environment’s setup is simplified as much as possible.

So what to do?

Naturally, we need to perform several steps to get to the holy grail of optimized Docker images and Dockerfiles. This is still a work in progress, but we’re getting there.  Let’s review how we can optimize our Dockerfiles and take all above considerations into account to achieve our goal.

Orange you glad I didn’t say Dockerfile?

A lot of articles on the web suggest different methods for optimizing Docker images.

As Docker images are made out of writable layers, and each layer is added on top of the previous layer rather than replaces it, it’s important to understand the bits and bytes of building consolidated Dockerfiles.

Let’s take an example Dockerfile and see how we can optimize it. We’ll use a logstash Dockerfile as an example.

We won’t be spending time on learning how to write Dockerfiles though. To learn about the Dockerfile syntax, see Docker’s documentation.

Our Dockerfile:

FROM ubuntu:trusty
MAINTAINER Gigaspaces, cosmo-admin@gigaspaces.com

ADD LOGSTASH_NOTICE.txt /root/LOGSTASH_NOTICE.txt
ENV LOGSTASH_SERVICE_DIR /etc/service/logstash
ENV LOGSTASH_CONF_FILE ${LOGSTASH_SERVICE_DIR}/logstash.conf

RUN mkdir -p ${LOGSTASH_SERVICE_DIR}
ADD logstash.conf ${LOGSTASH_SERVICE_DIR}/

RUN apt-get update
RUN apt-get install -y curl
RUN apt-get install -y openjdk-7-jdk
RUN curl
   https://download.elasticsearch.org/logstash/logstash/logstash-1.4.2.tar.gz
   --create-dirs -o /opt/tmp/logstash.tar.gz
RUN tar -C ${LOGSTASH_SERVICE_DIR}/ -xzvf /opt/tmp/logstash.tar.gz
   --strip-components=1
RUN mkdir -p ${LOGSTASH_SERVICE_DIR}/logs

VOLUME /etc/service/logstash/logs
EXPOSE 9999
EXPOSE 19219

CMD ${LOGSTASH_SERVICE_DIR}/bin/logstash -f ${LOGSTASH_CONF_FILE}
   -l ${LOGSTASH_SERVICE_DIR}/logs/logstash.log --verbose

Now let’s build this image.

docker build.

will produce this:

REPOSITORY         TAG      IMAGE ID       CREATED          VIRTUAL SIZE
cloudify/logstash  latest   cc20af189412   52 seconds ago   802.7 MB

802.7 MB just for logstash? I don’t think so. Will this work? Yes. Will it scale? No.

Let’s summarize what we did there:

  • We declared the base image we’ll be using.
  • We copied a NOTICE file into the container.
  • We set some environment variables.
  • We created the logstash service directory.
  • We copied a logstash configuration file.
  • We downloaded and extracted logstash.
  • We declared a volume for logstash’s logs so that we can mount it into the host or a data container to keep the logs persistent.
  • We exposed ports to the underlying host.
  • We declared the command which will be executed when the container is started.

Let’s Optimize

We have several problems here.

Our image is bulky for no good reason and our Dockerfile – disorganized, on top of not having any in-file documentation. Additionally, we would like the development process to be as short as possible to waste little time waiting for images to be created.

Please note that I’m not declaring any of the following as best practices but rather potential practices.

These depend on the specific case you’re trying to solve. What is true for one, is not necessarily true for the other. Think for yourselves.

Unnecessary base image

Do we really need to use the ubuntu:trusty image? This image is 188MB while debian:jessie is 122. debian:wheezy is even smaller at 85MB.

Use the smallest base image possible.

Golang is statically linked. This allows us to use images like scratch or BusyBox which take several MBs only. If we’re running Go, we might use those images to end up with an image a few tens of MBs in size.

In our case, we require java to run. We could try and use either scratch or BusyBox, install apt and the rest of the dependencies on them to build the image and reduce its size substantially.

By any means, go ahead and use BusyBox with only the dependencies required for your application to run.

For this example, we’ll use debian:jessie instead.

So we would do something like this:

FROM debian:jessie

And get:

REPOSITORY          TAG      IMAGE ID       CREATED            VIRTUAL SIZE
cloudify/logstash   latest   ebfb98409931   About an hour ago  741.3 MB

While this isn’t necessarily a must, it just makes good sense. Why cram all kinds of junk you don’t need into your image?

Too many layers

If you look carefully at Docker’s documentation, you will notice that each command declared in the Dockerfile creates an additional writable layer on top of the previous (now read-only) layer.

Executing:

docker images --tree

(which was unfortunately deprecated (dockviz can compensate)), will result in the following:

└─511136ea3c5a Virtual Size: 0 B
  â”œâ”€27d47432a69b Virtual Size: 188.1 MB
  â”‚ â””─5f92234dcf1e Virtual Size: 188.3 MB
  â”‚   â””─51a9c7c1f8bb Virtual Size: 188.3 MB
  â”‚       â””─5ba9dab47459 Virtual Size: 188.3 MB Tags: ubuntu:trusty
  â”‚       â””─15dada7ba6a4 Virtual Size: 188.3 MB
  â”‚       â”œâ”€4b9f9322aad4 Virtual Size: 188.3 MB
  â”‚       â”‚   â””─dff3a51a4d91 Virtual Size: 188.3 MB
  â”‚       â”‚      â””─0fe747ee2b89 Virtual Size: 188.3 MB
  â”‚       â”‚         â””─76ec9af8b326 Virtual Size: 188.3 MB
  â”‚       â”‚         â””─272b3cc11949 Virtual Size: 188.3 MB
  â”‚       â”‚         â””─5b34e45eaa6d Virtual Size: 208.8 MB
  â”‚       â”‚         â””─a4b578204294 Virtual Size: 220.2 MB
  â”‚       â”‚            â””─a1357731b34a Virtual Size: 575.4 MB
  â”‚       â”‚            â””─64a2567849a6 Virtual Size: 661 MB
  â”‚       â”‚            â””─bd1155a6745e Virtual Size: 802.7 MB
  â”‚       â”‚            â””─138d6f278c8a Virtual Size: 802.7 MB
  â”‚       â”‚            â””─4c0fbb9e92da Virtual Size: 802.7 MB
  â”‚       â”‚               â””─8e9fbad9656c Virtual Size: 802.7 MB
  â”‚       â”‚               â””─cc20af189412 Virtual Size: 802.7 MB

So adding a line like so:

ENV MY_NICE_VARAIBLE I_LOVE_YOU

While nice (literally), will add a layer which potentially complicates the structure.

There ARE reasons not to consolidate all possible layers. For instance, when you’re developing, one of the things you would like to be able to do is use the caching feature Docker provides to build from the latest layer possible so that you don’t have to rebuild everything from scratch.

Consolidating Commands

We can consolidate the ENV, RUN and EXPOSE commands. So, for instance, we would do this:

ENV LOGSTASH_SERVICE_DIR=/etc/service/logstash 
   LOGSTASH_CONF_FILE=${LOGSTASH_SERVICE_DIR}/logstash.conf
RUN apt-get update && 
   mkdir -p ${LOGSTASH_SERVICE_DIR} && 
   ...
EXPOSE 19219 9999

An important note about consolidating ENV declarations:

The above example will not work for us. The reason is that the 'LOGSTASH_CONF_FILE' env var is using the 'LOGSTASH_SERVICE_DIR' env var but they’re both declared on the same layer. When Docker uses environment variables it assumes that they were previously declared.

In this case, we can do one of two things.

  • Deconsolidate the env vars to their original state.
  • Not use the 'LOGSTASH_SERVICE_DIR' env var in the 'LOGSTASH_CONF_FILE' env var but rather hardcode its value.

We’re not consolidating the two ADD commands as we want the files to reside in different folders (for the sake of this example, of course). Alternatives would be to have the files reside in the same folder or copy the entire directory and then add another RUN command to move the files around.

If we copied an entire folder, we would’ve moved the files inside the container like so:

ADD files/ ${LOGSTASH_SERVICE_DIR}/
ENV ...
RUN ...
   mv ${LOGSTASH_SERVICE_DIR}/files/logstash.conf
      ${LOGSTASH_SERVICE_DIR}
   mv ${LOGSTASH_SERVICE_DIR}/files/LOGSTASH_NOTICE.txt /root/
   ...

Let’s leave it at that.

What the…?

What we just did is a pretty dumb consolidation process. What if we wanted to use a shared java image for logstash, Elasticsearch and Kibana? Let’s adjust our consolidation process a tad:

ENV LOGSTASH_SERVICE_DIR=/etc/service/logstash 
   LOGSTASH_CONF_FILE=${LOGSTASH_SERVICE_DIR}/
   logstash.conf
RUN apt-get update && 
   apt-get install openjdk-7-jdk
RUN mkdir -p ${LOGSTASH_SERVICE_DIR} && 
   ...
EXPOSE 19219 9999

By using two run commands instead of one, we’re creating an intermediate image which we can later use to base other images on.

This is actually a rather stupid example. What you would probably do is create a completely separate Dockerfile with only the java installation and do 'FROM my_user/my_java_image' in all of your java requiring images. You get the point, so for the sake of this post, we’ll assume there are no required intermediate images.

So now we have:

REPOSITORY           TAG        IMAGE ID        CREATED               VIRTUAL SIZE
cloudify/logstash    latest     7a8e9c9ff367    About a minute ago    739.2 MB

Wait.. such a small decrease in size? Let’s continue and see.

Caching

Now let’s say you want to allow yourself to replace logstash’s configuration on a daily basis.

Docker’s caching feature validates the command strings in the RUN command as well as the checksums of files copied into the container upon build.

This means that when rebuilding the image after a change in the configuration file, it will use your previously created layers and add a new layer on top of them with the recent change.

The first invalidated command in the file will cause a new layer tree to be created. So, for instance, changing one of ENV, will cause the RUN command and everything below it to run again.

ENV ...
ADD files/LOGTSASH_NOTICE.txt /root/
RUN ...
   mkdir -p ${LOGSTASH_SERVICE_DIR}
   ...
VOLUME ...
ADD files/logstash.conf ${LOGSTASH_SERVICE_DIR}/

Let’s review.

  • The ENV part didn’t change, so no new layers will be added
  • The RUN, VOLUMES and EXPOSE commands didn’t change …
  • The ADD of the config file changed, so only here will a new layer be added to the tree.

The ADD command is vague

The add command both copies files, extracts tars and downloads files from urls. To make your Dockerfile more readable and ensure that you’re performing the action you meant to perform, we’ll use COPY instead of ADD to copy files.

So we would do something like this:

COPY files/LOGTSASH_NOTICE.txt /root/
...
COPY files/logstash.conf ${LOGSTASH_SERVICE_DIR}/

Take it however you want; you can decide that this means nothing to you. I think that it pretty much declares intentions and that’s about it.

Replacement Resources

Do we really need openjdk?

Openjdk is heavy, as it has like 72 dependencies we probably don’t need. So, why don’t we try and use Oracle’s jre which comes tarred and can be easily extracted?

Let’s replace:

apt-get install openjdk-7-jdk -y

With:

ENV JAVABASE_VERSION=7 
   JAVA_HOME=/etc/java 
   ...
RUN ...
   curl https://dl.dropboxusercontent.com/u/407576/jre-$
      {JAVABASE_VERSION}-linux-x64.tar.gz --create
      -dirs -o /tmp/java.tar.gz && s
   mkdir ${JAVA_HOME} && tar -xzvf /tmp/java.tar.gz
      -C ${JAVA_HOME} --strip=1 && 
   ...

Um.. the Dropbox link is temporary.. so just use your own. Anyhow, we now have:

REPOSITORY          TAG      IMAGE ID       CREATED          VIRTUAL SIZE
cloudify/logstash   latest   94b0eb29e619   11 minutes ago   498.7 MB

That’s more like it. Wait, it gets even better…

Service Version

You might want to be able to query your container for the version of the service it’s running.

In our case, let’s add an environment variable called 'LOGSTASH_VERSION' and assign it the value '1.4.2'.

ENV LOGSTASH_VERSION=1.4.2 
...
RUN ...
   curl https://download.elasticsearch.org/logstash/logstash/logstash-
      ${LOGSTASH_VERSION}.tar.gz --create-dirs -o 
      /opt/tmp/logstash.tar.gz && 

By doing this, we can later run:

docker inspect -f "{{ .Config.Env }}" docker_logstash_1

This will return the list of environment variables for our container from which we can extract the 'LOGSTASH_VERSION"' variable.

Why not call it “VERSION” instead?

You might be running multiple services in the same container, or basing your image on a premade image which already contains “VERSION”. For instance, we use a javabase image which contains a 'JAVA_VERSION' env var.

Note that we’re downloading logstash’s tar using the env var and so if the env var’s value isn’t valid, the build will fail, which is good.

Unused resources and cleanup

  • Do we need curl after the build process is done?
  • Do we need the downloaded tar files?
  • Maybe we can delete some of apt’s cached tar files?
  • Maybe there are some dependencies we can remove?

We can do this:

RUN apt-get update && 
   apt-get install -y curl && 
   curl https://dl.dropboxusercontent.com/u/407576/jre-
      ${JAVABASE_VERSION}-linux-x64.tar.gz --create
      -dirs -o /tmp/java.tar.gz && 
   mkdir ${JAVA_HOME} && tar -xzvf /tmp/java.tar.gz
      -C ${JAVA_HOME} --strip=1 && 
   curl https://download.elasticsearch.org/logstash/logstash/logstash-
      ${LOGSTASH_VERSION}.tar.gz --create-dirs -o
      /opt/tmp/logstash.tar.gz && 
   rm -rf /var/lib/apt/lists; rm /tmp/*; apt-get purge
      curl -y; apt-get autoremove -y

This last line will remove curl completely, remove unnecessary dependencies we might have, delete our previously downloaded tar files, and clean apt’s cache.

It’s important to note that due to the layered nature of the images, you must clean everything up in the same RUN command in which you downloaded/installed the resources, as the previous layers are not rewritable.

Now that we’ve cleaned everything up, we get:

REPOSITORY          TAG      IMAGE ID       CREATED          VIRTUAL SIZE
cloudify/logstash   latest   f92ae1ac7c73   15 seconds ago   363.7 MB

Nice! We cut down more than half the size.

Using wheezy would decrease it even further and using BusyBox with only the relevant dependencies would do even more justice.

In-file Documentation

Let’s add some documentation to the file.

...
# For the sake of log persistency
VOLUME ${LOGSTASH_SERVICE_DIR}/logs
# UDP 9999 for syslog.
# TCP 19219 for our custom logger.
EXPOSE 9999/udp 19219
...

We built a pretty solid and human readable Dockerfile. It doesn’t require much documentation. More complex Dockerfiles should be documented so that anyone reading them can understand the exact flow and reasoning behind its configuration.

Optimized Dockerfile Example

FROM debian:jessie
MAINTAINER Gigaspaces, cosmo-admin@gigaspaces.com

COPY LOGSTASH_NOTICE.txt /root/LOGSTASH_NOTICE.txt
ENV JAVABASE_VERSION=7 
   JAVA_HOME=/etc/java 
   LOGSTASH_VERSION=1.4.2 
   LOGSTASH_SERVICE_DIR=/etc/service/logstash
ENV LOGSTASH_CONF_FILE=${LOGSTASH_SERVICE_DIR}/
   logstash.conf

RUN apt-get update && 
   mkdir -p ${LOGSTASH_SERVICE_DIR} && 
   apt-get install -y curl && 
   curl https://dl.dropboxusercontent.com/u/407576/jre
      {JAVABASE_VERSION}-linux-x64.tar.gz --create-dirs
      -o /tmp/java.tar.gz && 
   mkdir ${JAVA_HOME} && tar -xzvf /tmp/java.tar.gz
      -C ${JAVA_HOME} --strip=1 && 
   curl https://download.elasticsearch.org/logstash/logstash/logstash
      -${LOGSTASH_VERSION}.tar.gz --create-dirs
      -o /tmp/logstash.tar.gz && 
   tar  -C ${LOGSTASH_SERVICE_DIR}/ -xzvf /tmp/logstash.tar.gz
      --strip-components=1 && 
   mkdir -p ${LOGSTASH_SERVICE_DIR}/logs && 
   rm -rf /var/lib/apt/lists; rm /tmp/*; apt-get purge curl -y;
      apt-get autoremove -y

# For the sake of log persistency
VOLUME ${LOGSTASH_SERVICE_DIR}/logs
# UDP 9999 for syslog.
# TCP 19219 for our custom logger.
EXPOSE 9999/udp 19219

COPY logstash.conf ${LOGSTASH_SERVICE_DIR}/

CMD ${LOGSTASH_SERVICE_DIR}/bin/logstash -f ${LOGSTASH_CONF_FILE} -l ${LOGSTASH_SERVICE_DIR}/logs/logstash.log --verbose

Author Bio

Nir Cohen is an architect at GigaSpaces working on Cloudify and a co-organizer of DevOps Days Tel Aviv. Nir is a relatively short, brown eyed human being, loves animals and holds true to ethics as a life path. He likes to think, walk long distances, breathe and eat lettuce. He likes to think of himself as an innovative, think-tank type of guy who adores challenges and has an extremely strong affinity to automation and system architecture. He’s primarily driven by researching and deploying new systems and services. Find Nir on GitHub or follow him on Twitter.

This article was contributed to Developer.com. All rights reserved.

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories