Thursday, November 29, 2012

Scrum tales - Part 9 - The Five Whys

When the sprint ends we expect all committed goals to succeed which unfortunately is not always the case. Scrum teams must meet to discuss what went wrong and to discover the flaws in the process vs. just toss the blame around

Scrum teams meet to conduct a Sprint retrospective meeting with the main purpose to identify and analyze the issues in the Sprint process and to figure out how to fix them in the future sprints. Sprint is an iterative process and no matter what goals fail there is always a reason why they failed and it is not due to one team or one person's fault but due to a flaw in a process; this underlying flaw is what teams almost always fail to identify and skip finding a way to resolve it in the future sprints

Let's reflect on what great companies already did before us and use it to our advantage - specifically the Five Whys lean manufacturing technique will help us explore cause and effect relationships underlying specific issues we encounter and to finally get to the root cause of a problem or in our case a failed sprint goal

How do the Five Whys work? The team meets to discuss an issue and iteratively asks five times "Why?" or more specifically "Why did the process fail?" followed by determining a Proportional investment how to solve the issue going forward. Best to illustrate this with an example applicable to Scrum and the everyday goals we have:

Developer team sprinted and Product backlog item saying "Get new product maintenance version approved for production" failed. During the Scrum team retrospective meeting, the following questions should be asked and answered:
   1) First Why
Q: Why wasn't the product approved for production?
A: Not all High priority bugs were fixed in time to get the approval
Proportional investment: This is not a preferred solution, but if it comes down to a time crunch, you will roll up your sleeves for a weekend and catch up

   2) Second Why
Q: Why weren't all High priority bugs fixed?
A: New bugs were discovered during the unplanned third static testing round
Proportional investment: You will always assume there will be maximum number of static testing rounds when planning a new Sprint and never raise everyone's expectations by committing to releasing the product if the goal is too risky

   3) Third Why
Q: Why did the product go into three static testing rounds?
A: The product was of lower quality than expected even though it is a maintenance release
Proportional investment: You will always plan ahead for better internal self-testing before sending a new build to a static test round to catch and fix the low hanging fruit before it goes over to QA and results in unexpected new bugs

   4) Fourth Why
Q: Why was the product of lower quality than expected for a maintenance release?
A: Some of the engine code was rewritten in between test rounds in order to fix a specific bug that a customer requested to be patched
Proportional investment: Always first raise a hand when you see large code changes are needed and review with Product owner vs. hacking away at previously tested code and in the middle of product testing, no matter if the Almighty asked for the patch in person

   5) Fifth Why
Q: Why was the engine code rewritten between during final product testing phase?
A: We have to fix ASAP all bugs forwarded over by Support team that customers need fixed, no matter if we're in final testing phase or not, we have team performance goal to achieve
Proportional investment: No, you don't; this means our system is flawed and the product will be delayed meaning all customers will have to wait or risk getting lower quality build for one specific bug fix. Let's make a rule not to patch anything during product testing phase and instead do it immediately after the release to production if deemed necessary

To help out with asking and answering the Whys, I will volunteer to be the Why Master for all Scrum teams meaning you need to invite me in all Scrum retrospective meetings if your sprint goals failed to quickly discuss how to improve our processes and prevent future sprint goal failures

Monday, November 19, 2012

Responsible diversification

...or simply put: how to have a cross-functional team without risking Sprint goal failure

The issue

On one hand, Scrum is all for having cross-functional and self-managing teams vs. narrow-specialists, while on the other hand each team member should be able to finish every goal and task within a Sprint no matter how complex it is; this appears to be the most common Scrum paradox and I'm seeing numerous articles and even entire books on how to solve it

We want to:
   a) Make Scrum teams fully self-managing
   b) Have employees of all qualifications, both Jr. and Sr. to focus on the same critical company goals

We don't want to:
   a) Isolate employees by making them run sprints on their own just because they cannot yet work on all tasks in the team
   b) Introduce valid excuses as to why Sprint goals weren't met
   c) Loosen up the Definition of Done and affect team deliverable quantity and/or quality

Solution

Our goal is to always KISS so to fix the mentioned paradox we'll introduce some flexibility by allowing Scrum teams to be collectively qualified to achieve all assigned goals; what does it mean for Scrum teams compared to the practice so far:
   1) Each team member is no longer needed to be able to perform all tasks in the sprint
   2) Each team must still be able to complete all team tasks with sufficient quality
   3) Tasks achievable by select team members only are to be treated as high risk tasks meaning they must be marked as such and must be prioritized within the sprint
   4) Each team must make an Internal redundancy plan for tasks achievable by select team members only and lay out the redundancy pairs in Sprint planning meeting

In practice - teams

   A) Support team can have one part of the team (Writers/Analysts) focus on writing, analysis and other high level tasks first while Product support guys focus on handling the actual customer support and lower level writing, posting tasks

   B) Core developer teams can directly integrate Jr. developers, have Sr. devs focus on new development design and complex bug fixes while Jr. devs work on low/medium complexity bugs and following specs

In practice - individuals

   A) Working exclusively on simpler tasks or taking ad-hoc tasks and insisting there is no more work in the sprint left for you when all low level tasks are done is an excellent way to not advance your career and eventually have your regular review result in DNE

   B) If you can work on all team tasks, you must still roll up your sleeves when there is critical low level / transactional work left; taking lower priority but more complex work in such cases is even worse than not working at all

Friday, November 2, 2012

Tune in to correct bug frequency

Not all bugs are created equal which means that some are "more equal" than others, especially those with higher Probability / Frequency rating; but how to determine the right frequency?

Here's what we found recently:

1) One of the main product features doesn't work at all when used on a specific database / a specific SQL script; this is High severity as the core feature is broken and this is always reproducible on this specific database / script it will remain a High bug?
No, the frequency here is Sometimes or even Rarely depending on the probability that customers will use that specific database or script

2) Product functionality worked perfectly during the first few repeated tests, but then it just stopped working. After OS reboot, the same scenario repeated; this is High bug as the feature is not working but as not always reproducible it will be Medium?
In such cases you need to stick a bit more with the bug and focus on isolating the exact individual steps that led to feature no longer functioning correctly. In case you isolate the bug cause better to reproduce it more Often / Always, it may even be a High bug

3) After changing some default system settings / stopping product background service / installing 3rd party kernel mode driver, our product stops working correctly and throws errors = High severity; this is easily reproducible in 100% of cases when following the exact steps which means frequency is Always?
Ask yourself how many customers will do what you just did - hack or tweak a default system or software installation or install that specific 3rd party software; the answer is very few, including those power users who like to play with everything - this is the actual probability of someone repeating those steps and it matches Rarely / Sometimes frequency corresponding to a Low/Medium bug

4) There are several UI standard violations on the main Options dialog as standard buttons are missing - this is Low severity and it is obviously reproducible always so this is a Medium bug?
Yes, but your reasoning is not right - although a Low severity issue it will be encountered by majority of the customers as present on high profile product dialog which then makes the frequency Often / Always

5) Several icon inconsistencies on the main product window when compared to other products - Low severity but this will be seen by all customers as this is on the main product window so Medium bug?
Recheck this - icons are on the main window but are you that certain that the majority of the customers will actually notice slightly different icons and text in the main menu / ribbon bar compared to other products? I say this will be noticed Rarely if at all

To summarize: frequency/probability should be interpreted with some thought vs. literal assumption that if the icon is "always different" or the button is "always missing" it corresponds to 100% or Always frequency

Thursday, November 1, 2012

Optimize for success

"With great power comes great responsibility" - Stan Lee

Self-management can be a double-edged sword without discipline and a plan. Many teams including QA now have SMART goals to guide you, however you must guide yourself on a daily basis in order to achieve monthly SMART goal expectancy

Let's focus on the latest real-life use case:
   1) Testing weekly plan was defined: use up to 2 hours per day to test patches and focus on testing our new enterprise product for the remainder of the day
   2) Testing weekly plan was refined and approved - by the end of this week we'll have 2nd testing round of enterprise product wrapped up
   3) Week is almost over, however enterprise product 2nd testing round has just started and will be postponed for 3-4 days

What happened? Here's what I heard:

   A) "We had too many patches to test and this took a lot of time"
These patches were preplanned in the weekly test plan - it is not an excuse to violate weekly test plan. Our goal isn't to fully regression test each patch and verify its readiness for production but to focus on verifying those few (usually 1-2) bug fixes, spot-test core functionality and get it out to the customers who requested it

   Solution: dedicate a fixed chunk of time for testing each patch build, up to 1 hour; first verify fixed bugs then spot-test core functionality and finally send the patch testing summary. If the whole QA team found no core functionality issues during this time, it is highly unlikely that the single customer who requested the patch will find any; if there are new issues, we'll quickly re-patch without wasting much time

   B) "We had too many Support team forwarded cases that we stuck with for a long time in order to not forward them to developers"
What I'm seeing is ~82% of Support forwarded cases handled within the QA team which is way above 50% SMART goal; although this will be measured more precisely soon and is also an important goal, you must optimize your time better - taking 4-5 hrs to stick with a single support issue is an overkill and will inevitably cause your other SMART goal (Zig score) to suffer and our products to be late to production due to testing delays

   Solution: dedicate a fixed chunk of time for sticking with a support issue, especially if you are over the 50% SMART goal expectation for the month. Discover your own diminishing return point and use it to balance out the SMART goals when you make no progress with the support case at hand

   C) "We had many ad-hoc issues that piled up and took more time than actual testing, developers needed help with specific bugs, new team member needed guidance, there were team planning meetings, bugs needed priority corrected as we updated severity guidelines, Skyfall is premiering in the movies this week"

   Solutions:
   a) Dedicate a fixed chunk of time for team planning meeting (learn from Daily scrum) - 30min max
   b) When creating bugs, explain them in more details so devs don't ask you to clarify them - this will save time for both you and the devs on the long run
   c) Don't go to see the new James Bond movie until you have achieved daily SMART goal of at least 50 Zigs
   d) Before finishing a workday stop for a minute and ask yourself: "Have I achieved all my daily SMART goal expectations and are we on track with the weekly plan?" If the answer is yes, go ahead and have a scary night watching Paranormal Activity 4