Last Friday I received a bug report for one of the extensions I maintain. The reporter mentioned that a specific configuration lead to the task corrupting the JSON files it patched. It did so in a very peculiar way, by adding an extra } at the end of every file, basically making them not JSON.

It took a while to reproduce the case and then to figure out what was happening. Quite early in the debugging process I found that my assumption about the execution order didn't match reality and that I was fighting a race condition.

I'm not sure about you, but I hate fighting race conditions, they're often caused by interesting interactions between different units of code and since they're less likely to occur in a debugger, they're notoriously hard to debug.

Basically the code was supposed to be doing the following:

  • Read a json file
  • Patch a few fields
  • Write the json file
  • Read the same json file
  • Patch a few different fields
  • Write the json file again
  • Zip everything up

As you can imagine there's a few interesting interactions in this code that can happen when things are running out of order. Most executions went according to my expectation, but I did find a few cases where the following happened:

  • Read a json file
  • Read the same json file
  • Patch a few different fields
  • Write the same json file
  • Patch a few...
  • Zip everything up
  • .. fields
  • Write the json file

I still can't fathom how this would result in files having the same consistent issue, }} at the end of every file instead of }. But while it was the thing that triggered this whole investigation there was a much bigger issue.

How did the code end up in this state?

This project grew out of a collaboration between multiple people, has been updated in the past 2 years to support new features, has been partially upgraded to support new language features and frameworks and has received a few Pull Requests that added valuable new features using a mix of the above.

My lessons from fixing this mess are the following.

It took more time to debug than to rewrite the code

You've probably experienced this before, you spend 6 hours trying to debug and refactor the existing code in order to weed out the issue, but end up rewriting it anyway. Worst of all, this was my my first instinctive fear when I received the bug report: I'd end up rewriting this feature. I already knew this part of the code was a hodgepodge of patches, fixes and frameworks.

I do have to admit that rewriting was made a whole lot easier by having spent 6 hours to re-familiarize myself with the code, but I should/could have made this decision earlier saving me a few valuable hours.

LESSON: Don't spend good money after bad money. Be prepared to cut your losses early.

Typescript is less strict in its warnings than the C# compiler

My background isn't in TypeScript/JavaScript, I've learned quite a bit in the past years, but I'm still much stronger in C#. I suffer from biases created by my years of working with the C# compiler. To be frank:

The C# compiler and Static Analysis detects and warns the developer much earlier and with greater accuracy than TypeScript and TsLint.

In C# you'll have to be more explicit if you want to do bad stuff. In JavaScript/TypeScript, bad stuff is allowed to happen without being explicit about it.

LESSON: I must stay vigilant and not let the fact I have no warnings nor errors lull me into a false sense of security.

Node 5, 6, 8 and 10 are different

This may be blindingly obvious to you, but part of the reason the behavior was hard to reproduce, was because my local development station was running Node 10. Node 10 has a number of very useful features, including promisified versions of fs. The VSTS build agent runs its tasks in Node 5 or 6, depending on your agent version, and while there are numerous libraries available to shim the new features into the old runtime, they're not 100% the same. Some shims out there on npm don't have all the latest fixes and may not be 100% compatible.

Installing Node 10 on the build agent doesn't change the Node version used by the agent to execute its tasks. It changes the version of Node used to execute the npm tasks in your workflow.

Only after switching back to Node 6 on my development workstation did my problems become easy to reproduce. It turns out that the shims that add promise support for fs are not working exactly as documented.

I currently keep the 2.111.1 agent around to test against Node 5. The most recent agent (2.136.1) ships with Node 6, I test on that as well.

LESSON: Be sure to test your code at least on the same runtime used "in production".

Don't mix Promise libraries, async/await and old-style delegates

Over time the code had become a mix of:

  • Callbacks
  • Q Promise library methods
  • Q wrapped callbacks
  • ES6 standard promise methods
  • async/await

The bugged part of the code interacted with little bits of everything from this list.

The first thing I did to clean it up was to remove every mention of Q. This caused the problem to become deterministic. Not exactly sure what Qs impact was, but removing it helped tremendously.

The second thing I did was to replace the fs.writeFile and fs.readFile calls with fs-extra promisified versions. This further reduced unexpected behaviors. I'm sure my partial knowledge on how callbacks, promises and await interact were causing problems here that could have been solved without introducing fs-extra, but alignment helped me wrap my head around the problems without having to fully understand and project all of these different implementation details.

LESSON: Stick to a single paradigm to allow your brain to focus on what's important: what your code is trying to achieve.
LESSON: Don't mix and match all kinds of different async libraries. Each has particular implementation details and they may not work together the way you'd expect them.

Understand Promises when using async/await

Daniel Brain explain this really well in his blogpost with the same title. So I'll stick to a few examples where I messed up:

// THIS IS BAD
const files: string[];
files.map(async () => { doStuff(); } );

zip();

The TypeScript compiler nor TsLint will complain about this code. DoStuff is an async method that returns a Promise and we're not calling await on it, nor returning the promise, so it will happily run when it pleases and will finish when it does.

So let's fix that:

// THIS IS BAD
const files: string[];
files.map(async () => { await doStuff(); } );

// OR THIS IS STILL BAD
files.map(async () => { return doStuff(); } );

zip();

Solved that problem. But wait... When you run this you'll find out that it's not actually waiting... As it turns out files.map will return a Promise[] and again, the TypeScript compiler nor TsLint  (see update below) will complain that you're not waiting for that:

// THIS IS STILL BAD
const files: string[];
await files.map(async () => { return doStuff(); } );
await zip();

Done right? Unfortunately not. When you await on a Promise[], the code will immediately continue. I had expected this would result in waiting for all promises to resolve or to raise a warning (my C# bias fooling me). Instead you'll need to add a little bit of Promise-fairy-dust to solve that one:

// THIS ACTUALLY WORKS
const files: string[];
await Promise.all(files.map(async () => { return doStuff(); } ));
await zip();

With all of these fixes combined, the code runs in the expected order.

LESSON: Be aware of your personal biases. They may fool you into a false sense of security. C# is not TypeScript.
LESSON: Testing these kinds of async patterns is hard. You'll need to change the way you write your code if you want to validate execution order dependencies.

See also:

Update
It turns out TsLint has a feature to warn for incorrect promise handling:

{
    "rules": {
        "no-floating-promises": [
          true,
          "Promise",
          "Q.Promise" 
        ]
    }
}
Which helped me find a whole new list of interesting issues to solve.

Unit testing is hard

I teach this all the time. In order to test your code, the best way to write the code is to write the test first. When faced with old, unfamiliar, untested code, the first step is to apply software-vise as Michael Feathers calls it:

Clamp down known good behavior, detect any unintended changes to it – gives us control over the work
Source: Ink Stone

But it's so easy to ignore this sage advice and try to get a quick fix out. Same applies to spending just enough time to fix all the naming and other objective things that can hide in your code base.

LESSON: Be professional.

Use sync code sparingly

In the past I asked a friend to review some of my build tasks. The comments included: "why are you calling fs.existsSync?" My reply: "because it's easy, because it works and because it makes my code a lot easier to read". His retort: "but that's not the Node way!".

And I'm still torn about this one. The VSTS agent runs a single job at at time. Me calling things synchronously may cause the code to be slightly slower, it may cause the logs to stream to the server after, instead of during, execution... Are those strong enough reasons to make the code harder to maintain... For me? For a person who knows Node?

At the same time, most APIs in Node are using the async/await pattern and Promises. One can't ignore these altogether, is mixing both paradigms really easier to read?

Am I trying to kid myself? Am I procrastinating, by postponing the required learning and working around it? I suppose I am. As a result I've decided to study, to ask reviews to more people and to try and remove this gap in my skills.

LESSON: Know your weaknesses, do not work around them. Dare to ask questions, involve others and be open to feedback. It's what Agile is built on top of.

Conclusion

None of these lessons were a surprise to me. Being Professional is hard and balancing open-source, work and personal time just makes it harder. I'm glad I rewrote most of the code, I've reached out to figure out how I can better test these types of problems and I learned a lot about JavaScript, Promises and asynchronous behavior. I know I still have lots to learn.

PS: I still don't understand why this issue presented itself by corrupting the JSON files. But I'll leave that for someone else to figure out.

PS: I'd love constructive feedback, pull requests or suggestions to further improve.

Photo credit: Frans Berkelaar.