Learning Distributed Systems with MIT and Go
Reflections on a month of hacking away at the Raft protocol with an open course from MIT
I have moved into a much more technical role at work and I was feeling a little out of my depth. I have started working around various systems that I could describe at a very high level, but know little or nothing about how they actually work.
I came across on Hacker News a link to MIT's public graduate level course 6.824 on Distributed Systems. This is the sort of thing I deal with at work now, so I thought it would be a good fit. I figured this was a good opportunity to reflect on the progress I've made so far. My code is here: https://github.com/NickSavage/MIT-6.5840-Distributed-Systems
What I didn't realize going in was how applicable the course work is to some of the things I see now at work, namely Clickhouse and Nomad. The second lab of the course involves implementing the Raft protocol in Go (a new language to me), which underlies both of those other systems.
I don't think its important that I know the ins and outs of the protocol itself like I do now (I'm not an engineer, I will never need to build Clickhouse), but I think its valuable to have an in depth understanding of things like this in order to be able to make more competent decisions.
Common tech problem is the layers of tech you need to build in order to get at the meat of the problem you're trying to solve. In order to implement a distributed system, you need to have multiple programs running at once (multiple terminals/VMs/etc), you need a way for the pieces to communicate with each other, etc. These are non-trivial problems, and like many other coding problems you end up getting sucked into solving the pre-problems instead of the one you are actually interested in. Malcolm in the Middle has a great scene showing this problem. With this course, they have a great framework for simulating all of that and get you right to the point of writing the relevant code.
I had two frustrating weeks where I made little progress towards Lab 3B. I had the basics working within a few days, but struggled to make progress at some of the more difficult parts of it. I fell into a routine of plugging away for 30 minutes or so, try tweaking a line or two, running the code again and see that I broke something else, try tweaking something else, etc. This is a really bad habit to be in, but it happens so naturally. It usually starts off by thinking "I am almost there, just with a few tweaks I should have it". In this case, since distributed systems are not trivial to work with, I ended up having one or two bugs that only come out under certain conditions and one fundamentally misunderstood part. The difference in lines of code between now and two weeks ago is probably only 20-30, which is a really small amount, but I doubt I could have gotten there without stepping back and really considering the problem.
As an aside, I see people at work doing the same thing. It is really easy to fall into the trap of thinking you know more than you actually do. This is the bicycle problem: do you know how a bicycle works? Can you draw a bicycle properly? 40% of people cannot apparently: https://www.scotthyoung.com/blog/2015/12/22/illusion-of-explanatory-depth/.
Some key takeaways:
Having a good platform to jump off of to solve problems is a really strong selling point. Implementing a lot of this is really hard to get right, and its really not a value-add in any way.
Even with a good platform, distributed systems are really hard to get right. I want to say I understand the Raft protocol pretty well now, but the concurrency needs were really killing me.
Go is a lot of fun to code in! I have another project that I've been working on with a python backend and a react frontend. I'm thinking of rewriting the backend in Go to really get a handle on it.
I fell into the overconfidence trap of thinking I understood things that I didn't. This led me to making incremental changes when really it was a much more complex problem than I imagined
My next steps are probably to take a ‘real’ system and get it working myself, taking something like nomad and making it work with multiple nodes
I’m proud of myself for making a month on this course. Open courses take a lot of motivation to finish. Just looking at the views on the various videos, they follow a power law distribution, where lecture 1 has 500k views, lecture 2 has 175k and (etc) lecture 20 has 15k.