Emma Tech

  • Emma Home
  • Emma Blog
  • Job Openings
  • RSS

Wrap-up meetings provide feedback that teams need to improve

Everyone makes mistakes, but wise organizations learn from them.

Selena Deckelmann 23 Feb 2011 maintenance windows post-mortems server maintenance 1 Comment

Nobody likes admit­ting mis­takes. Except this guy.

Getting peo­ple to go to a post-mortem meet­ing is easy. Getting peo­ple to par­tic­i­pate with­out a sense of impend­ing doom is hard. Most peo­ple don’t want to be there. They show up ready to fight or make excuses, with a pit in their stom­ach as they wait to talk about what went wrong.

So, how do you fix that pit-in-the-stomach feeling? We’ve worked on this a bit at Emma, and here’s my formula:

  • Set high-level, achiev­able goals and have meet­ings even when things go right.
  • Focus on how every­one will work together to make things bet­ter in the future, not what went wrong.
  • Get every­one to participate.
  • Share with the whole com­pany what the group learned.

Now might be a good time to tell you that I wrote about some changes to our inter­nal down­time process last week (read that post here); today I’d like to fol­low up with details about our ver­sion of a post-mortem meeting.

Set high-level, achiev­able goals and meet about success

A main­te­nance win­dow here is con­sid­ered a suc­cess when we make our changes, recover from any fail­ures with­out impact­ing pro­duc­tion and end on time.

As a group, we decided what’s okay to include in the win­dow, and stripped out some riskier changes. Those included tasks that were hard to esti­mate time for, or ones that would push against the amount of time we allo­cated for test­ing.  At this point, going into each win­dow, we have a clear list of tasks, and we can assess suc­cess or fail­ure of each task after the change.

In that first win­dow in January, we com­pleted the following:

  • Upgraded our PostgreSQL databases
  • Recovered 5% of the disk space on our largest data­base cluster
  • Fixed a long-standing main­te­nance issue with par­ent tables on our largest database

We decided to have a meet­ing after the win­dow — regard­less of whether the change suc­ceeded or failed.

Talk about what went well (aka Why I decided to call these meet­ings “wrap-ups”)

I always hated call­ing these dis­cus­sions “post-mortems.” I get why tech peo­ple want to com­pare the process to a med­ical pro­ce­dure, and I love a good zom­bie movie, but it sets the wrong tone. I decided to call them “wrap-ups,” to help make it clear that we’re there to reflect on the project, not find blame.

And here’s what we try to do in each wrap-up:

  • Spend time talk­ing about how things went well, and why
  • Focus on how to improve future projects
  • Distill what we learned

Documenting how the team man­ages main­te­nance win­dows makes the great work peo­ple were already doing vis­i­ble. We also open up the meet­ings so non-IT folks at Emma can con­tribute and make them better.

Conduct the dis­cus­sion for 100% participation

After a main­te­nance win­dow, we com­mu­ni­cate the out­come to the rest of our col­leagues. Then, I sched­ule a 30-minute meet­ing with a sim­ple agenda. We go over what hap­pened dur­ing the main­te­nance win­dow to:

  • Discuss what went right
  • Discuss what went wrong
  • And deter­mine what we could do to make things bet­ter next time

In our most recent wrap-up, seven peo­ple attended, and I requested at least one com­ment from each per­son on the agenda bul­let points.

What we learned

In just 30 min­utes, we came up with plenty of things that the group felt good about doing well and a set of clear changes to make in the future.

Here are some of the things peo­ple liked:

  • Creating a cus­tom error mes­sage for the main­te­nance window
  • Having a phone bridge and using Campfire through­out the win­dow to communicate
  • Using a wiki page to orga­nize tasks and each task’s owner dur­ing the main­te­nance window
  • Using the change win­dow to test out new Linux ser­vice scripts for the sys­tem admin­is­tra­tion team

This was our first main­te­nance win­dow where we used both Campfire and a phone bridge at the same time for the whole team. We chose Campfire because any­one new who joined could eas­ily see what con­ver­sa­tion had already taken place. We used the phone bridge to make it sim­ple to type com­mands and stay in touch at the same time.

In the past, we’d used email and RT tick­ets to doc­u­ment what was hap­pen­ing in the main­te­nance win­dow. Everyone loved hav­ing a wiki page to ref­er­ence and update instead. The wiki just had a bet­ter UI than email or a ticket, and pro­vided a bet­ter experience.

Finally, the sys­tems admin­is­tra­tion team used the win­dow to test out new ser­vice start/stop scripts for a series of cus­tom appli­ca­tions. This is the type of thing that can go un-exercised when you rarely have down­times or main­te­nance win­dows. The team was smart to seize the opportunity!

We also thought a few things didn’t go so well:

  • We didn’t give our cus­tomers enough of a heads-up.
  • Steps for the changes should have num­bers, not just times asso­ci­ated with them.
  • Our test­ing took quite a while because the change affected all the data­bases at the same time, and tests only looked at one data­base at a time.

There may have been other things that peo­ple thought we could have done bet­ter, but we kept the list short and action­able. We’ll change the process slightly in the future to inform cus­tomers bet­ter, add num­bers to all the steps and test data­bases concurrently.

Beyond this cur­rent win­dow, I also asked every­one to imag­ine how we might do things dif­fer­ently or bet­ter dur­ing other downtimes.

A few ideas included:

  • Trying out video con­fer­enc­ing dur­ing the main­te­nance, like Tokbox, to help make com­mu­ni­ca­tion even better
  • Pulling in more helpers for test­ing — for train­ing, and mak­ing the work­load lighter for the QA team
  • Using Salesforce to com­mu­ni­cate upcom­ing changes internally

My favorite sug­ges­tion, though, was:

  • Playing “Point of no return” when we know every­thing worked

Feel free to com­ment below — I’d love to hear how you man­age your meet­ings, and what you’ve learned.

1 comment

Benjamin N. Hare commented:

2012-04-20, 14:18

Thank you, thank you, thank you for tak­ing the time to write such an infor­ma­tive post. As a mid-level IT tech­ni­cian I am find­ing it very dif­fi­cult to fig­ure out how larger shops go about stream­lin­ing their main­te­nance win­dow processes. Each com­pany I have worked at has done things wildly dif­fer­ent. Sharing your expe­ri­ence in this blog post was extremely help­ful for me.

Again, thank you for tak­ing the time.

Leave a comment

Click here to cancel reply.

Emma Tech on Twitter

    Follow Emma Tech »
    Help wanted

    • Popular Tags

      Python12 api7 UX5 conferences4 postgres4 workflow4 time4 javascript3 PHP3 jQuery3 tools3 editors2 travel2 server maintenance2 Git2 maintenance windows2 Haml1 Frank1 Ruby1 CSS1 PyCon1 office1 Sass1 downtime1 post‑mortems1 cgit1 books1 Trac1 collaboration1 community1 Twitter1 Facebook1 OAuth1 coding1 cool sites1 Redis1 github1 objects programming refactoring1 integration1 salesforce1 usability testing1 Social Posting1 music1 productivity1 bugs1 TextExpander1 san francisco1 Convore1 Vim1 releases1 legacy data1 HTML1 reading1 Django1 PgCon1 testing1 TDD1

    Emma is a member of the Email Sender & Provider Coalition and the Messaging Anti-Abuse Working Group.

    Copyright © 2003 - 2013 Emma.
    All rights reserved.

    • Get Emma's Newsletter
    • Visit the Emma Blog
    • @emmaemailtech on Twitter
    • @emmaemail on Twitter
    • Emma on Facebook

    Emma's email marketing makes communicating simple and stylish.
    Inquire now for more details.