725% Faster Without Optimisation. Why Exploiting Hardware Beats “Clean Code”
You’ve been working on a game project for a couple of weeks now.
You know that C++ is the industry standard, so that’s what you are making your game in.
Your systems and types are created following the Object Oriented principles you were taught—nice and clean.
You finish implementing the latest system—a simple aggregate damage calculator that tracks total enemy damage taken over time.
You run the game, excited to test it out, and notice the frame rate is a bit low. You had a solid 60 for two weeks—now it’s 54.
“That last system isn’t doing much...” you think to yourself.
Confused, you import chrono and record the time taken.
time: 0.33ms
time: 0.21ms
time: 0.36ms
time: 0.27msA bit of jitter, but nothing serious—with 16.6ms in a frame, you don’t see how this can be the problem.
You profile another system: 1ms. A bit heavy but nothing serious, you only have a dozen systems.
You profile another: 2ms. And another: 1ms. And another: 0.75ms.
You profile them all. They add up to just over 16.6ms.
That’s when it hits you—you’ve got a dozen systems that all need to be refactored and sped up, somehow.
You don’t understand why they are slow. You followed the principles.
You wonder how people make games go fast. You haven’t even achieved a sliver of the full vision for your game.
It doesn’t make sense. You think you must be missing something.
You start to doubt yourself. You ask yourself if you are just too stupid to do this.
You think to yourself that perhaps you didn’t follow the rules properly. So, you check:
Everything does one job.
Virtual methods take care of different code paths.
The inheritance is predictable.
Classes only inherit what they use.
The dependencies are injected with clean interfaces.
The prospect of refactoring all those systems is daunting.
You don’t even know where to begin, seeing as the code is clean.
You shelve the project. Your dream of making a game is extinguished.
You go about your life.
Occasionally, you think back on the game you wanted to make with a deep sorrow in your heart.
You did follow the rules. It’s not your fault the systems are slow. You were taught a paradigm that doesn’t map to reality.
You were taught that by following rules about code structure, your code would be clean.
What that actually means is irrelevant. The problem is that computers don’t work that way. And you are programming computers.
Code Example
Here’s a small example of the same C++ code written a number of different ways. It demonstrates that the programming paradigm you were taught is wildly inefficient.
The imports and timing code are removed so you can focus on structure.
Here are the basic data types used across the examples:
struct Vec3 {
float x, y, z;
};
struct Mat4 {
float m0, m1, m2, m3;
float m4, m5, m6, m7;
float m8, m9, m10, m11;
float m12, m13, m14, m15;
};The Beginning - Standard OOP (Virtual)
Imagine this is the base class for your other game types.
struct Entity {
Entity(Vec3 position, Vec3 velocity) {
this->position = position;
this->velocity = velocity;
}
virtual void Update() {
this->position.x += this->velocity.x;
this->position.y += this->velocity.y;
this->position.z += this->velocity.z;
}
private:
Vec3 position;
Vec3 velocity;
Mat4 rotation;
std::string name;
};Here’s a test. You have probably seen analogous code a thousand times.
Instantiate new entities, keep track of them somewhere so they can be iterated over and their update methods called.
int main () {
std::vector<Entity *> entities;
for (int i = 0; i < 10000; i += 1) {
// Randomly generate x,y,z, vx,vy,vz
auto e = new Entity({x, y, z}, {vx, vy, vz});
entities.push_back(e);
}
for (int i = 0; i < 100000; i += 1) {
for (const auto &e : entities) {
e->Update();
}
}
return 0;
}The result:
Average: 29 microseconds
Median: 29 microseconds
Shortest: 28 microseconds
Longest: 553 microsecondsAssuming these numbers are too high for your budget—how may you optimise this?
You may have noticed two critical performance killers:
Update is virtual.
Entities are stored by pointer.
Changing those for the sake of performance may be worth the inflexibility.
Entity becomes:
struct Entity {
Entity(Vec3 position, Vec3 velocity) {
this->position = position;
this->velocity = velocity;
}
void Update() {
this->position.x += this->velocity.x;
this->position.y += this->velocity.y;
this->position.z += this->velocity.z;
}
private:
Vec3 position;
Vec3 velocity;
Mat4 rotation;
std::string name;
};Instead of adding by pointer, entities are copied into the vector:
std::vector<Entity> entities;
for (int i = 0; i < 10000; i += 1) {
// randomly generate x,y,z vx,vy,vz
auto e = Entity({x, y, z}, {vx, vy, vz});
entities.push_back(e);
}The result:
Average: 23 microseconds
Median: 23 microseconds
Shortest: 20 microseconds
Longest: 819 microsecondsA 26% speed up—nothing to scoff at. But let’s assume it’s still too slow.
What can be done now?
The usual culprits have been removed: virtual methods and pointer chasing.
Look at the Data
Isolating the data that lives in an Entity:
Vec3 position;
Vec3 velocity;
Mat4 rotation;
std::string name;The two fields needed for this computation are position and velocity.
So what happens when you bundle data together like this?
The CPU needs to move your data into registers to process it.
The way it does that is by fetching it from main memory.
However, the time to fetch data from main memory is very slow. So hardware guys started adding little memory caches onto CPUs.
L1, L2, and now L3 and L4 caches—with L1 being the fastest to access and L4 being the slowest, besides main memory.
Due to physical constraints the caches are small. And the prefetching routines are dumb. They basically pull in linear blocks of memory when you access some.
[ bytes you asked for ][ more bytes pre-fetched ]L1 is 128KB on this CPU. The Entity type is 128 bytes in the OOP example.
That means 1024 of the 10000 can fit in the L1 cache.
The data used for this operation is 24 bytes—wasting 81.25% of the data throughput.
The Bridge: Megastruct
Before jumping to extremes, there’s a popular middle ground: Megastruct.
You keep the same fields in your types and cop the performance hit. But, you understand why there is a performance hit.
Instead of being told your code is clean, and therefore good, you understand the trade-off you are making.
I teach it in my course for the first 2 game types: Metroidvania and RPG.
It’s almost the same except that you reach directly into entity state rather than calling an Update method.
for (int i = 0; i < entities.size(); i += 1) {
auto e = &entities[i];
e->position.x += e->velocity.x;
e->position.y += e->velocity.y;
e->position.z += e->velocity.z;
}There’s a marginal performance increase over OOP-No-Indirection version.
The real difference is both structural and psychological. You separate data and code that operates on data.
The Pipeline: Data-Oriented
Understanding a little about memory cost and CPU caches, you can reason about another solution.
If taken as a given that all Entities must have Positions and Velocities, they can be stored as such:
std::vector<Vec3> positions;
std::vector<Vec3> velocities;A simple function may be used in place of a method:
void update_entity_positions(Vec3 *positions, Vec3 *velocities, int count) {
for (int i = 0; i < count; i += 1) {
positions[i].x += velocities[i].x;
positions[i].y += velocities[i].y;
positions[i].z += velocities[i].z;
}
}The result:
Average: 4 microseconds
Median: 4 microseconds
Shortest: 4 microseconds
Longest: 304 microsecondsA 7.25x speed increase. 725% faster.
No algorithm changes. Just consideration of what data is needed for the operation.
Understanding this small example is one thing. Architecting production ready game engines around this data flow paradigm—without falling back onto OOP dependency graphs—is entirely different.
If you are tired of hitting the 2-Week Wall and want to learn how to build pipeline-driven engines from the ground up, join the Program Video Games Vertical Slice course today.
We use Odin and Raylib to build a Metroidvania, JRPG, RTS, and Roguelike.
Stay subscribed for future deep dives into Structured Game Programming.
Cheers,
—Dylan
Notes:
All tests were conducted on an Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz
All tests were compiled with
g++ -O2
References:

