Fixing the flake on CI

Fixing the flake

A journey into what javascript means for integration tests.

The TodoMVC integration suite is a unique beast, it has several interesting demands that make it unlike any test suite that I have ever worked on before. The suite must be adaptive in the selectors that it uses based on what version of the spec the app was written against 90619f5, and it must be able to handle slight variations per implementation 1eeb055.

Our suite like most tests suites in the real world was been plagued by flakey and simply failing tests on CI for a long time. Now that we are running the tests on every PR 75f6a68 on travis 980cf54 this flakiness has only become more obvious and bothersome.

I wanted to use the TodoMVC test suite as an example of how someone could write explicit tests without succumbing to the dreaded:

//without this tests randomly fail?
driver.sleep(2000);

The first step in fixing the flake was to identify, classify, and hypothesis these seemingly random failures in our suite.

First lets look at how these failures manifested themselves in the test runs.

The story of the stale element

What does stale element even mean?

Well plenty of people are curious based on the 300+ questions on stackoverflow

Selenium provides a handy page explaining what this means and how to avoid it.

http://docs.seleniumhq.org/exceptions/stale_element_reference.jsp

A stale element reference exception is thrown in one of two cases, the first being more common than the second:
The element has been deleted entirely.
The element is no longer attached to the DOM.

Before we go into a solution, we should explore how this can happen, and why it happens with modern javascript applications.

Imagine a case where you have a list of elements and your fancy new framework does an inline replacement of elements instead of mutating the existing element, therefore as far as selenium knows the element has been deleted or is detached (depending on the object pooling / implementation details of the framework)

Well dang that stinks because this is going to happen with pretty much every single new framework out there.

to list a few...
Angular angular/protractor#543
React http://stackoverflow.com/questions/30862414/working-around-staleelementreference-in-selenium-webdriver-when-testing-a-react
Backbone http://stackoverflow.com/questions/28856332/dealing-with-stale-elements-when-using-webdriver-with-backbone-js

So what is the solution to the stale element issue? Well instead of hanging onto DOM nodes that we get from selenium we should look them up each time and interact with them. What does this even look like?

In this case what I ended up doing was fetching a list of DOM nodes and then holding onto their index position in the DOM and then requerying them based on this index each time that I needed to interact with it.

Problem solved

Element not found

Onto the next error that we were seeing, element not found.. This one requires a bit of a unique approach to solve. The thing about the DOM in a javascript application is that the DOM is constantly changing and just because you are interacting with it via webdriver does not mean the DOM is frozen, because of that you can not be sure that an element has been rendered at the exact time you test for it.

How do we fix this? This is where selenium's wait comes in handy http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp

We can wait for a condition before continuing, so in this case the solution was to wrap getting the DOM nodes in a wait to make sure the element was there and then interact with the node.

The trick here was the thenCatch which allowed us to catch the error case and to force a retry. win!

Clicking was not clicking?

The real zinger here was caused by our click actions not working. I was seeing the behavior where I would issue a click command and the click would not fail however it would not result in the expected click side effect (i.e toggling a state)

Once I realized that this was happening it came to me that since the DOM is async and the javascript on the page is evaluated outside of our test process it was possible that we were issuing the click command before the javascript had been bound to elements to capture the click event.

The solution in this case was to bake in a state test case to ensure that the click actually took effect.

In review, there is one thing to keep in mind when writing tests that are deterministic, always verify DOM interactions, always assume that the DOM is async, and always assume your CI server is the slowest computer in the world.

With these things in mind you should be all set to write reliable tests