So, you’ve put the finishing touches on your latest awesome application or feature (as much as code is ever really done) and you’re ready to release it to the wild. Implementation is the fun part, right? The last step in a long process, where you drop your code in a public place and do the big reveal to your customers, after which they cheer and shower you with accolades. Also, there is cake.
As much as I hope you’re able to achieve cakery with the lowest possible amount of pain, more often than not implementation never goes as smoothly as you would like. That genius code that passes all of its unit tests and runs flawlessly in the development and QA environments typically experiences at least a few bumps when you deploy it to production.
We have been working on some pretty significant behind-the-scenes changes to Emma (still in early beta), designed to improve performance and scalability. Without giving away too much of the secret sauce, the new platform is a big change to our architecture, made up of multiple applications, each built following the principles of test-driven development (discussed in more detail by Kevin). Having a battery of automated unit tests to develop against has helped us ensure that each feature works as expected without breaking others in the process.
As we reached the end of the primary development phase, it came time to release the various applications to their new homes, a task undertaken by me and the infrastructure team. During this step, we experienced several challenges; even though we are very confident in the quality of our code and the new application architecture, there are some improvements we can still make to our implementation process.
Here are some of the issues we experienced during deployment of the code to production:
This was a big one. During development, the applications, their dependencies and the databases all lived on a single machine, typically the developer’s laptop. In the QA environment, we gave the databases their own machine, but all the applications and dependencies still lived on a single server. In production, we break it down to even more machines, each specially built by the infrastructure team to handle a specified task. Even though we knew this was ultimately the goal — and had built in the ability to configure the servers to point to each other — there were still a few areas where an application went looking on localhost and got really angry when nobody was listening on the expected port.
Troubleshooting, logging and monitoring
Troubleshooting in development is easy when you can redirect every single log and error message to stdout, but that’s not really an option in production. In addition, our infrastructure team likes to have log monitoring on all machines. This required us to revisit our logging strategy, to make sure we provided the infrastructure team with the information they needed in the format in which they needed it, while still giving the developers access to important troubleshooting information. Not that anything ever goes wrong in production.
Minor environmental differences
Differences in assumptions about installation directories led to a few instances where code was deployed to the wrong directory or user, or a directory was missing from the PATH. This caused easily fixable, yet frustrating and time-consuming, issues to troubleshoot.
The unit tests did such a great job setting up test environments, that it wasn’t until we moved to production that I noticed a couple of completely absent pieces of code, namely a script we use to read logs and a deploy script for an application that is a dependency of the primary application, but also lives on its own. Fortunately, it was easy to build a production version of the script based on the test version, as well as create a new deploy script. However, it added more time to the implementation phase that we hadn’t planned for.
How to avoid them in the future
None of the issues we ran into during rollout were major flaws in the code or even our process. However, they led to delays in making the new code available to our early testers (in this case, my co-workers) and could be somewhat avoided in the future with a few enhancements to the process.
Budget time to handle the unexpected
More often than not, you will run into issues you just don’t expect. Even though you can’t create a detailed plan to handle those issues, you can still budget the time to get them resolved, helping to ensure the project meets its deadlines. We have recently tweaked our project management guidelines so that rollout starts near the beginning of a project, running concurrently with the rest of the project phases. The initial rollout efforts might be little more than deploying the basic shell of an application to a test environment. However, the earlier you start testing different deployment scenarios, the sooner you will be able to perfect the process and recognize any failure points or holes in your application architecture.
Get your code running in an environment that mirrors your production environment
As mentioned before, your application might score 100% with your automated test suite and work perfectly in a test environment, but until you put that application in the environment where it will be running when your customers access it, you haven’t really tested it. The best way to do this is to put it into your production environment; however, that is often not going to be an option. Either production will be busy serving up the current version of your application to your customers, or the environment will still be under construction, either awaiting new hardware or the resources to build it. This is where the magic of virtual machines comes in. If the ultimate goal is to have the component parts of your application running on several machines, a single server divvied up into multiple virtual machines will give you a much more accurate test than just running everything on one physical machine. Performance won’t be the best, but, for the purposes of this exercise, the point is to make sure all of the separate components can talk to each and function as expected when the architecture is initially sketched out on a whiteboard. VMWare is a popular tool around Emma. However, if you’re looking for something a bit more open source and automated, VirtualBox and Vagrant can be used to easily build and re-build whole virtual environments in which to run your application.
Become friends with your infrastructure team
Our infrastructure team at Emma is made up of some great people, so I would want to be friends with them regardless of my selfish technical needs. However, on a purely professional level, we’re making sure to involve them as early as possible on all future projects to ensure we are getting their insight into the best practices for application architecture, as well as having them inform us of anything they might need in order to properly monitor the application or build out the hardware (virtual or otherwise). We are also coordinating schedules during critical points in the rollout process so that we can call on each other to assist in resolving any issues that pop up without having to deal with lag due to waiting for a response.
The issues I listed above are challenging, but not overwhelming, and I believe we’ll see some improvements in our process with the efforts we’ve undertaken to minimize their impact on our ability to build and deploy cool new features. A little patience and planning can go a long way to making our next round of feature releases even smoother than this one. Does anybody else have examples of unexpected issues they experienced while trying to deploy a new codebase for the first time?