Let’s (not) build an Event Sourcing app with DDD pt1.5 — in-memory projection issues

To continue from part 1 here at Let’s build an Event Sourcing app with DDD pt1, this is part 1.5, where I address the in-memory projection issue.

The app workflow

This is an example of the real-world ES system I have worked with, and it has many issues that I will address in this blog.

When booting up, the app needs to load the event from the database, and then it applies the event type/payload to update the value delta. Then, the event stream processor is started as a polling from DB to update the changes. At this point, we haven’t got a snapshot feature yet.

Initializing the app

When handling user action

The action handler will do some logic validation and build the event payload.
Raise(insert) the event to the database.
The event processor will keep polling for the newly inserted events, save them, and apply them to the in-memory state via redux.

Main workflow

Why is it so different from my previous post?

Because I saw many problems with this, I decided to do a new learning project, which is a hybrid solution between the traditional and ES pattern. We can go step by step to make it more on the ES side.

The problems

Rebuilding the state instead of persisting them into DB tables.

The idea of the event stream is the single source of truth. Then, the app updates its state by applying changes from events sequentially. E.g. create account -> add fund 100 -> add fund 200-> charge 50 => account will have 250 at the end instead of querying the table accounts to get the data.

It’s straightforward and easy to understand on paper, and blockchain also has a similar workflow.

The example app uses redux and in-memory state as projections. So, you can imagine the “database” will be a big JSON object in memory. You would think it would be fast since it’s using RAM and no network or anything, right? But let’s get to that later.

There are a lot of violations of the 12-factor app in this design.

It is fast until it scales

For a couple of thousand events to the first million events, the thing seems alright. The JS looping function performance is still acceptable; it may take less than a minute to spin up.

But when it hits the scaling phases, the event count gets sky-rocketing.

1 million turns into 2, 4 and 7, reaching 20 million events in no time. And so is the start-up time.

You can imagine the situation where you deploy the app, and it needs to loop through an array of 20 million items and apply the change before it is fully deployed, which would take a lot of time. That’s the first problem. ❓

Scale horizontally? forget it

The whole system is a stateful app, meaning that when you create multiple instances, you cannot make them have the same consistent state. So when instance one processes the event belongs to the entity accountAfor example, at the same time, another request hits instance two and does something on accountA as well, things will get messy.

Because there is no way we can apply any locking on the entity!

So ended up, we disabled the auto-scale horizontally feature on GCP 😅

Reinvent the wheel, sort of…

Let’s get to the part where we have a giant giga JSON object reducer state.

How do you query users with a balance of> 500?

Looping through the whole Object.values(state.user).find(u=>u.amount>500) 🤡

How do you join two data entities, such as user + paid bill?

Looping through the whole Object.values(state.bills).find(b=>b.userId == userId) 🤡

How do you respond to a request to get an array of users created for the last 3 days? 🤡

How can you build a view from the normal DB table?

…

There are tons of problems that people spend decades trying to solve and putting in the database software and design patterns. Now, why do we have to do it all over again with the improvement and such?I don’t know why, either.

It’s like we create our problems…

Sustainability

From the reinventing the wheel above, it has a lot of performance issues, but it does not stop there, it also creates more critical issues like

The system will be blocked during the loop time since it’s written by Node.JS
The app crashes when we do a lot of objects.value() Object.clone()… on the big JSON in-memory object (max heap size) 💣
Race conditions without proper transaction lead to critical issues and generate bad events.
The worst thing is when the system crash, you’re unable to query the current state to debug!

A chain reaction

For some complicated apps with a lot of stateful properties, the logic will be complicated.

Well, when we work on complicated things, we will likely introduce bugs.

So, what is the biggie here? Every software has bugs.

In event sourcing, one of the principles is not to mutate the event stream. Now, how would you fix the bad events that are introduced by bugs?

You may say: “just add a new event to patch the old one, ez 🤡”. But in reality, it’s not as simple as that. For some complex entities like inventory, bad event leads to a lot of mistrust in data since we’re just comparing the delta. Especially when you have some stateful and the biz that depends on the event payload.

“It will help data engineer life much better.”

I worked with some great data engineers before, and they all said that it is a nightmare because they have to write the whole app biz logic in SQL so they can build the actual data table that they can use.

Then you have two single sources of truth! one from the app, and another one from data team, well done!

Conclusion

TBH, there are a lot of unmentioned issues that I didn’t put in here. However, you can see the general picture of the cons while the pros are mentioned everywhere on the internet.

Now you can see the reason why, in my pt1, the approach may sound silly, but I’m taking it step by step to make it more ES-like and learning along the way.