GameDev Log #2: "Rendering Engines"

Sep 26, 2020

For the past two months, I've been on a deep dive in the world of 3D graphics rendering. It's been an absolute whirlwind of learning, and it is about time I wrote a blog post to milestone the things I've been doing.

It's been years, but in the past, I've used open source game engines like Godot, LÖVE, and Pygame to make some small (read: tiny) games, like a Pong clone, a Flappy Bird clone, even a failed Legend of Zelda clone! (and other failed projects...)

My Flappy Bird

My Pong Game

Memory lane. My first attempts at making games.

My work on Grimoire is very different. First, it's going to be a 3D game engine first and foremost. I am no longer chained to an x-y coordinate plane: I will be free to move about z! Second, it's my game engine, and it's built, mostly, from scratch. That means all the conveniences an already built game engine would give me to handle sound, user input, even the game loop, is now just all gone. I've got to implement and design it myself.

This includes one of the most important components to any game engine: the graphics renderer.

But first, some thanks

To self-teach my way into this, I've been reading the following:

These folks are absolutely amazing for having written these materials (and answering my dumb questions), and I've learned so much just going through them all. It is so unlikely these authors will ever see this blog post, but just in case: Thank you!

A Tour through Graphics Programming

Think of the most beautiful, ground breaking videogame you might have played recently. If you haven't played one in a while, think about any 3D animated film - like anything from Pixar or Dreamworks. How on Earth do they achieve such beautiful graphics and images? For videogames, how do they render each image, no matter what craziness is happening on the screen consistently at 60 times per second?

They use a computer's (or gaming console's) graphics processing unit (GPU).

The GPU has an incredibly important job: for every single pixel on your screen or window, determine what color it should be as fast as possible. If your computer screen or TV has a 1920x1080 resolution, that's over 2 million pixels that need to be calculated 60 times per second. Once the GPU determines the colors for every pixel, it loads the color data into a data structure called a Frame Buffer, which is an array in memory that stores the color/pixel data until it is rendered on screen (usually very soon after it is loaded).

Why do programmers use the GPU instead of the CPU to do this? In general, the computation to figure out the color for a single pixel is typically straightforward from a computation perspective, even if the linear algebra involved could still send your head in a tailspin. A GPU will have hundreds to thousands of computing cores on which it could parallelize the workload. On top of that, its specialized hardware supports a Rendering Pipeline, on which a number of stages operate in parallel to feed data from one stage to the next until it hits the Frame Buffer.

The Rendering Pipeline

A GPU's rendering pipeline has a series of stages. They each have a unique role to play in the overall pipeline as it takes data coming from the application and transforms them into colors for the frame buffer. They also vary in terms of how much they can be customized with new behavior.

"Fixed" stages offer no configurability for programmers. They will work the way the hardware manufacturer designed it, and that's about it.

"Configurable" stages, like its name implies, has behavior which is mostly pre-determined by the hardware manufacturer, but a programmer may tweak its settings or configurations.

"Programmable" stages offer the most customization. Programmers load these stages with smaller programs called shaders. Shaders have their own programming languages which apparently are similar in syntax to C, but I wouldn't know since I don't code in C. There are different types of shaders, but in general, I think of shaders like a single function you might find in a regular application. Shaders expect some specific type of input data and other stages in the pipeline expect each shader to return zero or more pieces of data, depending on the shader type.

That's cool, but what are the pipeline stages really?

As an example, the OpenGL rendering pipeline has a pretty good overview of each of its stages. (More on what OpenGL is later.)

There are way more stages than what I'm about to describe, but this is the way I like to frame it in my brain (for now). In general, these are the main pipeline stages:

Rendering Pipeline

A real rendering pipeline is way more involved than this

  1. The Vertex Shader is among the first stages in the rendering pipeline.
    1. The vertex shader receives a single vertex, which refers to an individual point on something we want to render, plus this vertex's attributes (like its positional coordinates), maps these coordinates to a coordinate system called homogenous clip space, performs some texture and lighting calculations, and returns the vertex's new position represented in this homogenous clip space.
  2. The Clipping, Screen Mapping, and Rasterization stages are fixed by the GPU manufacturer.
    1. These stages are responsible for taking the vertices output by the Vertex shader and clip them to fit into yet another coordinate system called normalized device coordinates (NDC). The stage allows vertices and surfaces that are still inside the bounds of the NDC space to continue through the pipeline, then discards (or "clips") the rest. The rasterization step takes all the surfaces to be rendered and breaks them down into individual elements called fragments.
  3. The Fragment Shader is yet another programmed stage in the pipeline.
    1. A fragment is a group of data that correspond to one or a group of pixels on screen. A fragment shader's duty is to take a single fragment at a time and determine what color that fragment (and thereby a set of pixels) should be.
  4. The Fragment tests and post-process operations stages include a set of tests for each fragment/pixel to determine further whether the color generated by the fragment shader should continue on toward the final frame buffer.
    1. These tests include a depth test to help the GPU determine whether one fragment is occluding (blocking from view) another fragment. These tests also include a few more tests which I won't be getting into, but I'll name and link them for further reading: Scissor Test and Stencil Test.
  5. Finally, data that has passed all the previous stage's tests get color blended into the frame buffer. We call it a "blend" because there may already be data in the frame buffer during this writing step.

Render State and Registers

The programmable shaders really determine the overall success of the pipeline and its ability to render the correct images on screen.

At a high level, the pipeline is a series of stages configured to process rendering primitives up until pixel color data can be loaded into the frame buffer at the end. It's a data processing powerhouse!

The set of data moving and transforming from one stage to the next is known as the Render State. The render state does not live in RAM (or VRAM) the same way a regular application would store its data. There is no call stack or heap. Instead, shaders receive input and send output by way of registers. There are four types of registers:

  • Input registers
  • Constant registers (uniforms)
  • Temporary registers
  • Output registers

Input registers and output registers are somewhat self-explanatory. These are specially designated register locations on which the pipeline can load with input before running a given shader stage, or locations from which the pipeline can take data that was placed there by a shader.

Temporary registers are registers that the shader program will place intermediate data calculations. They don't persist throughout the pipeline.

Constant registers, also known as uniforms, are (usually) read-only global state shared among all parallel running shader programs of the same type. They may even be shared among different stages in the pipeline. Typical kinds of data that get placed in these registers are a view-projection matrix (by programs that use some kind of Camera object, aka most games) and model instance data.

Speaking the GPU's language

Most programmers are intimately familiar with writing applications for the CPU. They write their applications in whatever language of their choice, such as Rust, Kotlin, Python, etc. Then some translating helpers, in the form of a compiler or interpreter, takes their program code and transforms it into a version that the CPU can understand.

How do you talk to a GPU?

Modern graphics hardware implement one or more industry standard graphical APIs. Operating systems, like Windows, MacOS, or Linux will have SDKs and runtime libraries for applications to use these APIs, with varying degrees of compatibility.

Graphics API

Some of the most popular PC graphics API

The industry standard graphical API is belongs to a suite of tools by Microsoft called DirectX. For the most part, DirectX is only compatible on the Windows operating system, but last I checked, there might be support coming for Linux, through Microsoft's Windows Subsystem for Linux (WSL) project.

Both OpenGL and Vulkan are open cross-platform graphical APIs. For many years, if you wanted to write a graphical application to work on multiple platfoms without having to write completely different code using the different APIs, OpenGL was the way to go. People sometimes view Vulkan as the next generation of OpenGL, but it's grown to be a bit different. Vulkan is considered to be "lower level" than OpenGL and OpenGL itself enjoys continued development, so Vulkan has hardly become a replacement.

Metal is a proprietary graphics API by Apple for MacOS. For many years, MacOS allowed you to use OpenGL, but at some point the powers-that-be at Apple decided to deprecate all other APIs and force everyone interested in writing graphical programs for Apple devices to use Metal. As far as I know, Vulkan recently became "supported" for MacOS, but through another library called MoltenVK which basically runs Vulkan on top of Metal.

These aren't the only graphical APIs out there. I'm sure video game consoles have their own API, even if they might support one of the cross-platform APIs above.

Grimoire ❤ wgpu-rs

This brings me to the graphical API that I decided to use for Grimoire's development. None of the above. From the top rope, in comes WebGPU!

Don't let the name fool you. Yes, the API specification is being written by the W3C's GPU for the Web Community Group. Yes, the ultimate goal is to enable web browsers to access the GPU in a performant way. But despite the name, the specification and the implementation works just as well for standalone applications like a videogame.

On top of that, it interops extremely well with Rust, given that the core low level implementation is written entirely in Rust.

Grimoire is using the user-facing wrapper library called wgpu-rs.


Love this library. Check it out here

Similar to the goals of OpenGL and Vulkan, the API is cross-platform and targets DirectX (specifically DirectX 12), Vulkan, and Metal, which covers all my bases for writing a game to work on any computer.

Application Responsibilities

The graphics pipeline, its stages, and its shaders are not useful until an application (like Grimoire) wields it.

Your application is your videogame. This application must manage the entire game state: It must remember where the player character is, remember where the enemies are, remember where the enemy projectiles are and what direction they're flying, etc. On top of that, your application needs to store the information needed to render all of it: everything from the vast mountains in the distance, to the claustrophobic chambers of a dungeon. The GPU and your graphics API of choice can't assist with any of this.

It's the application's responsibility to:

  1. Submit all relevant rendering primitives to the GPU,
  2. Load and manage data inside the pipeline's constant registers, and
  3. Invoke the pipeline to begin drawing.

We've talked at length about how the rendering pipeline stores data in registers and how each stage works to receive input and spit output to the next stage like one big chain of operations. This sounds like the application just shovels loads and loads of data over to the GPU, which isn't completely wrong, but doesn't tell the whole picture.

In order to submit all relevant rendering primitives to the GPU, your game needs to have a step to determine the likelihood that a given instance will actually show up on screen. For example, in a first-person shooter game, there's no use sending to the GPU rendering primitives data for objects behind the player. The object isn't in the player's field of view and so it won't be seen, and if there's no chance for the object to be seen, then we'd just be wasting cycles on the GPU.

In order to load and manage data inside the constant registers efficiently, one must do so carefully, and it likely involves some kind of sorting. The read-only data inside of the pipeline's constant registers are, well... constant, for all stages in the pipeline. This means that any time you need to change the data in these registers, you'd have to flush the pipeline, load the new data into the registers, then fire off the pipeline again.

For example, imagine you've got a bunch of archer enemies in your game. You've got two meshes, one for the archer enemy model, and one for the arrow projectiles they're shooting. You need to render a scene containing 10 archers and 7 arrows. To do this efficiently, you might create a uniform buffer containing the locations of all 10 archers and load it into the constant registers, then feed the pipeline with the mesh data for the archer. Once that finishes, you'll flush the pipeline, then do the same thing for the arrows: create a uniform buffer with the locations of all 7 arrows in the scene, then feed the pipeline with the mesh data to draw an arrow. If you keep switching back and forth between archers and arrows, you'll be doing more than 1 flush of the constant registers, which is a waste of effort.

Cutting it here

This blog post was actually super light on the progress I made on the Grimoire engine itself. But that's okay. I'm really excited just to have learned so much about graphics programming. I'll be excited to show some screenshots and progress in a future blog post.

That's all for now. Thanks for reading!