Let’s prototype characters that think and feel

October 14

Since this is my own website, and it’s not really meant as a professional blog, let’s try something different. Usually I like to finish whatever I’m working on first, then do a write-up afterwards and make everything look good. For this post, I want to try a new format, and write as I go, like a live blog.

Let’s try and create videogame characters that think, feel, move and speak.

To prepare for this, I’ve spent the past month learning new things:

  • C++, and building a C++ library from source using CMake
  • Writing C# wrappers for C++ libraries
  • Working with the ONNX Runtime, including CUDA and TensorRT
  • Working with Animators, including bools, triggers, transitions and root motion
  • Working with Navmeshes, including runtime baking and dynamic obstacles
  • Working with ScriptableObjects, including saving/loading from disk

I will be using a technique called greyboxing, which I learnt in college. This means that this prototype won’t have fancy artwork, it will look as simple as possible.

See you soon.

October 15

Prepared the basics today, a simple character that can walk around and pick things up. Just because this prototype is going to use simple artwork, doesn’t mean it won’t be path traced in real time!

I’m recording these with a 35mm f/1.4 lens in 21:9 resolution, 23.976 fps with 180 degree shutter and 3 point lighting. I’ve also coded a little ffmpeg tool to automatically turn these into high quality gifs.

October 16

I added basic character stats and some debug GUI today. The next part is going to take a bit longer, so I don’t know when the next update will be. I spent the first part of 2024 creating my own game engine in C# console to prepare for this moment, so now it needs to be ported. It includes a full interaction engine with policies.

Once the interaction engine has been ported, it needs to be hooked up to the characters above, and we’re going to need some kind of turn based gameplay controller to manually control the characters for now.

October 19

Finished porting my interaction engine and managed to do some improvements in the process. I also added policies for navigation (calculating paths) and visibility (tracing rays) so the LLM will not attempt to interact with items it can not reach. I also finished up the turn based gameplay controller and everything is working as it should, will add some footage later.

I also helped the LLamaSharp team with their 0.18 release, since we’re going to be using that later in this prototype.

Next up is the sensory engine, which will prepare the characters to see, hear, smell and feel.

October 21

I added support for walking to interactables today. It works by tracing a ray to the interactable and then subtracting the interactable diameter and setting it as a pathing target, while checking if it’s reachable or not. I’ve also added sittables which extend from the interactables and have both a sit and stand target, instead of dynamically generating a destination. It’s also possible to sit down on the sittables and stand up again. No code progress on the sensory engine, because I am still thinking of a way to store the actual data, since we’re going to need to be able to transform the data for both the LLM and GUI. Simply storing interaction-sensor data isn’t enough either, since the character should be able to perceive (for example) changes in weather, and other world info.

October 22

Added world space interaction debug interface and have started designing the basics for the perceptual/sensory buffer.

October 23

I have finished designing the perceptual buffer and have the first version up and running.

October 24

Today I’ve been working on perceptual policies, which will decide what the character can see/hear/smell/feel using tools like ray tracing and human eye fov calculations. I also implemented UUID4 for interactables and the perceptual buffer today. There is actually a good reason behind this, something I came up with, I will explain later.

October 25

I finished coding the perceptual policies today, next up are continuous interactions. Currently, when an interaction gets executed (lighting a campfire, or sitting down on a chair) it gets fired once, and wether a character perceives this (and thus registers the interaction in their perceptual buffer), depends on the registered perceptual policies for that interaction. However, if a character sits down on a chair or starts dancing (which shouldn’t just fire once, but should be some kind of continuous state), and we then walk around this wall and see the character, we should still perceive their continuous interaction (and it should end up in our perceptual buffer).

October 26

Started working on the continuous interactions today. I’ve also done a lot of refactoring to all the existing code. I also wasn’t happy with the single ray vision policy, so I added a cone tracing option. Here’s an example with 8 rays @ 8 degree spread angle.

October 28

Continuous interactions are more complex than expected. Let’s take speaking for example. Inference is done through tokens (text-based), and we can’t just mark the interaction as completed when we are done generating tokens, we need to actually keep track of the playback of the second layer of inference (text -> audio). We don’t want the second character (Alice) to reply while the first character (Bob) is still speaking. We also don’t want inference/speech to pile up (things they’ve been wanting to say for the past 10 minutes, but were still speaking the previous sentences), and we don’t want the second character to endlessly wait, if the first character just won’t stop talking (because the first character is noticing the second character keeps waiting for them to finish).

Not sure how long the design of the improved interactions will take. For now, this is my updated short-term todo:

  • Continuous interactions <- you are here
  • Grabbable interactables
  • Environment/health percepts
  • Interaction response percepts
  • TTS in talk interaction
  • Interaction save/playback
  • Object based inference C# side (encoder + decoder)
  • Object based inference LLaMA side (training)
  • Character stats + basic challenge
  • First test (youtube video?)

And also a few nice to haves, but not required:

  • Queued interactions
  • Equipables

October 29

First thing we’re going to need to do is extend the TTS implementation. I added the required functionality for an IsSpeaking flag, and OnSpeechStarted and OnSpeechCompleted events this morning. I’ve also added events for navigation (OnDestinationReached).

October 30

Finished coding the continuous interactions this morning. I’ve decided to skip the grabbables for now, they are not required for a first prototype, let’s not over-engineer this. If we need to transport items, we will use a backpack, which is easier to code. I’ve also improved the Interaction GUI (and thus the data that will be sent to the LLM) to respect the AlwaysVisible property on the Interactions. This means we can now see some Interactions in the world even though they can not be executed, with the reason mentioned in red.

October 31

Forgot something important. Obviously we don’t want the characters to be aware of everything in the world. If we want to sit down on a chair, we first need to know if that chair exists, so we first need to discover it. Finished up the discovery code this morning, so we’re good to go!

Interactable discovery is handled every frame, and stays in memory, so when a character turns around, facing their back to an interactable, or hiding behind a wall, they will not forget about that interactable. We also automatically discover interactables if we perceive interactions through evaluated perception. For example, if Bob talks to Alice behind her back, but Alice has never seen Bob, but she can hear him, she will now learn of Bob’s existence. Also, if Alice can hear Bob sitting down on a chair, and Alice has not yet seen the chair, but can hear Bob sitting down on a chair, she will learn about the chair’s existence. Obviously, this only works if her sensors can succesfully register this.

Next up are environment & health percepts, which means being able to sense (for example) changes in temperature, and sensing pain from (for example) freezing temperatures.

November 1

Created the basics for time of day & temperature today, no percepts on the character side yet.

November 2

Characters now take damage due to cold, and health percepts are in, which work according to an accumulated damage & time threshold to avoid spamming the perceptual buffer. I’ve also added interaction response percepts. For example, if we’re gathering wood, we’d want to notify the character how much wood they have gathered, or if their axe breaks.

Ideally the health/damage thresholds would be triggered by the LLM (in batches of processed data) instead of a certain amount of damage over time, will look into this later.

November 3

I’ve been working on the TTS implementation to prepare for the next item on the todo, however, things get pretty complicated when we want to have two different speakers at the same time, in an efficient way.

November 6

Here’s the deal, I’ll dump my raw thoughts here so you know what’s going on.

We can’t allow feature-creep and over-engineering during prototyping, but what we do build, should be done properly, so we don’t end up throwing everything away and get into tech debt later. It’s very easy to build things fast to show-off, but not everything translates well into a final product.

The interaction engine is done, and the code is clean, I’m satisfied. However, for TTS, things get complicated. As a game developer, you also need to have some understanding and be invested in the legal side of things. I have a separate drive with all my licenses and invoices related to this project.

I’m using Piper (C++) and a wrapper (C#) to use the C++ library in the C# engine. Piper uses a phonemizer to convert text to phonemes. Piper and piper-phonemize are both released under the MIT license which means we can use them, but piper-phonemize relies on an espeak-ng fork which is released under the GPL-3.0 license, which means we can’t use it.

The next version of piper will make use of epitran, which is MIT. The thing is, we don’t know how long this will take, and if this takes 1 or 2 years, this means I won’t be able to ship the game. I could take the gamble, keep doing what I do, and hope the piper version with epitran releases around the same time when I want to release my game, but I don’t like gambling.

There are multiple discussion (1, 2) about the licensing for the espeak/piper repos, which has become a difficult subject because the original developer (Johnathan Duddington) has disappeared of the face of the earth years ago, thus the license can not be changed. However:

“espeak-ng code owners are fine with API use via dynamic linking for closed source program”

The thing is, I want to be sure, and I don’t like the idea of a GPL dependency somewhere down the hierarchy. I came up with a solution to create an open source voice plugin (MIT), and allow users to select their own version of the voice plugin. The game would ship with a version of the open source plugin with some basic mumble sounds, and people would be able to download, compile, or even fork and build their own plugin with piper support, if they wanted to.

Another option, is to use a piper fork that does not use espeak. However, their API works completely different. I could take their whole project apart and build a compatible API and wrapper for it, but I’m very new to C++, so this will take a long time. On top of that, they mention the “phone conversion is not nearly perfect but it should work for simple applications” so it could be that it’s so bad that it’s not even usable, and I won’t find out until I’m done, which sucks.

Moving on, the voice library I wanted to use is libritts_r, which can be found on huggingface over here. The voices are released under a MIT license, but diving into a licensing discussion over here reveals that they were finetuned on a lessac dataset (blizzard license), which means it can not be used commercially, which means we need to swap to the regular libritts (CC BY 4.0 license).

This is a lot of work, and this doesn’t even have anything to do with the technical challenges either. The current C# wrapper I’m using is based on an open source wrapper, and I have added functionality for:

  • CUDA (C# and C++ side)
  • Multi-speaker model loading (C# and C++ side)
  • Async

However, to offer IL2CPP compatibility, we can not use marshaling delegates that point to instance methods of native code, which the wrapper does by default. I’ve made the changes to convert these to static methods, however, the callback writes to a PCM buffer, and static methods can not write to an instanced PCM buffer, which means the PCM buffer also needs to be static, which means we can not use separate instanced PCM buffers for different characters in the game.

One option is to convert the PCM buffer to some kind of dictionary format, generate unique ids for speakers, use them as indexes for the PCM buffers, pass them to the C++ side when generating speech and back to the callback on the C# side to store them in the correct location.

For now, I have created a separate TTS branch, and started completely from scratch, not sure which direction I’ll choose.

November 7

Alright, I’ve made my decision. I will be taking apart the piper without espeak project, and:

  • Clean up their code
  • Port my changes
  • Add compatibility for upstream API
  • Write C# wrapper form scratch

This way, we can get the best of both worlds. We can go full MIT and have everything properly set up on the licensing side, while also offering future compatibility with upstream, in case of an early epitran release.

Their code is a bit of a mess (which they also warned for in their readme) so it’s going to need some restructuring and cleaning up, but that’s fine. I’ve always looked up to C++ like it’s some kind of elite programming language that I wanted to learn some day, so I see this as a good opportunity.

See you soon.

November 10

IPA loading is in on both the C++ and C# side.

First version is up and running in engine, we have audio output! Lots of cleaning up to do, but I’m very happy.

November 12

I’ve cleaned up a lot of the code so far, nothing much to write about, going at a steady pace.

November 13

Callbacks and PCM buffers now use IL2CPP compatible instances, which is the first step to supporting multiple voices at the same time.

November 15

Currently implementing SafeHandles, after that, I’ll do a first test with multiple speakers.

November 20

SafeHandle implementation is done. Currently working on some stuff behind the scenes to prepare for the next step, will probably create a part 2 of this post where we actually move to the LLM side.

November 25

Quick update before I head to bed. I’ve built a little low power “llama box” last week. It’s a server with a i7-6700k, 16 GB DDR4 and a Tesla P4, overclocked to 1531 MHz and modified to use the CUDA sysmem fallback policy. It can finetune a 4096 context length, rank 32, 4-bit QLoRA for a 7b LLM in a single night, and allows me to train and game dev at the same time.

I’ve also been researching evaluation datasets, and prepared a tool to visualize my training results. I’ve been running experiments for the past week to make sure I understand exactly what I’m doing and have the best possible training setup ready for part 2 of this blog. I’ve been training models with DoRA, rsLoRA, Unsloth, Liger Kernel, all the way from rank 8 to rank 128, with dropouts from 0 to 0.2, warmups from 0 to 0.2, learning rates from 0.00001 to 0.00005, different schedulers and layer targets, to compare every combination in terms of training and eval loss.

As for the Piper implementation, I’m debating between finishing it up as-is, or including support for multiple backends (ROCm, TensorRT, etc), settings and inference stats. It’s nice to be able to support AMD cards, and have settings/stats for debugging, but it’s not a requirement for now.

November 26

Decided to add the multiple backends with their settings anyways. I’m skipping the inference stats, we don’t need them.

December 4

Was suffering from sinusitis for the past 1-2 weeks, never had this before, but I’m starting to feel better now. Where were we again?

Currently, the C++ side of the TTS library uses a synthesis config (noise parameters, etc) which is stored in the Voice. After playing around with some code changes, I noticed we can actually change speaker without re-loading the voice. Ideally, I would like to change the Voice to a Model internally (to avoid confusion), and split the synthesis config up into a model and speaker/synthesis config, and maintain these on the C# side. This way we can load the model with the model config (persistent settings) and execute inference with our speaker/synthesis config (temporary settings). If we do this, we can have two separate characters, speaking with different voices, using the same model loaded in memory.

I did a quick test with this when adding the backend selection, but noticed it was crashing on leaving play mode. I’ll re-do the backends and find a way to make this run without issues.

December 6

Re-did the backends, full async multi-speaker with different voices sharing the same loaded model. I think there’s a small issue with streaming text left (we might need this to improve LLM response times), I’ll have a look tomorrow.

December 11

Not much to write about currently, I upgraded the llamabox from 16GB to 32GB RAM (and upgraded the development PC from 32GB to 64GB RAM), added an extra 256GB SSD and upgraded it’s connection from 200mbit to gigabit, it’s currently training a another test model.

As for the TTS implementation, I need to experiment with streaming text and cherry pick the event (speech start / speech complete) commits from the dev branch because of a bad merge, and then it should be ready to go into the main project.

December 18

Streaming text now works. It was a little difficult because doing batches of words causes unwanted pauses here and there, and is really dependent on text generation speed, so it now speaks sentence by sentence. I’ve also cherry picked the event code from the dev branch, all ready to go into the main project.

December 19

I’m not satisfied with speaking sentence by sentence, as we need to wait until we have a full sentence before the character starts speaking, and I’ve come up with an idea to improve this. Will give this a shot tonight. It’s a tricky subject because it requires a few locks in the right places to make this thread safe from both the C# / C++ side, and we can’t allow inference for every word because it will spam and overload the inference engine. More soon.

Introduction

I’ve been thinking about starting this blog for a while now. However, I’ve been stalling because I’m not sure which direction I want to take here. Do I want to blog for an audience, or just as a logbook for myself? Do I include explanations to make things easier to read? Do I write a big introduction post with all history up to this point, or just add pieces where necessary? I guess I want to please every possible audience, but is that even realistic? Do I stall on these questions for another week, or just start right now?

Let me take you on a journey where I experiment with bleeding edge technology in videogames. The screenshots in this post have been taken in-engine, using the graphics code I have been working on for the past 3 years.

My name is Ramon Stefano. I’ve been modding and creating games since I was a kid. I started with drawing on paper, switched to 3d modeling and texturing when I was 12 years old, and then started coding. I’ve studied game design & application development in college, worked full time in racing games & API development, and have been working on my passion project, a racing simulator, for many years.

I like film/cinematography and want to create immersive, magical experiences in videogames that make you forget about the real world.

To do this, I want to bring photorealistic graphics to videogames using real time path tracing. This road from the very first videogames to photorealism, with graphics and graphics cards improving every year, is a journey we’ll only get to experience once, and I want to be a part of it.

I like to experiment with bleeding edge technology. The kind of things that are still in active research, things that are not ready, not stable, and won’t appear in mainstream videogames for the next few years.

This summer, I’ve been preparing the technology for a little side project called Tiny Adventurers, a game about clay characters coming to life and having to face challenges to survive.

Here’s what I have in mind:

  • Path traced graphics
  • Ray traced audio
  • Speech to text
  • Text to speech
  • LLM trained on personality, sensory and interaction data

To start off, to achieve path traced graphics, we’re going to need a somewhat modern graphics card.

I will be doing all the graphics programming on this RTX 3080.

The RTX 3080 has the necessary RT and CUDA cores on board which we need to do path tracing and LLM inference in real time.

We’re also going to need a VR headset. I’ve picked the Quest 1 over the Quest 2 because it’s supposed to have better colors. I prefer realistic colors over resolution.

And of course, papers, lectures and books. I got myself this nice hardcover of Ray Tracing Gems II which includes a lot of key ingredients towards building a real time path tracer.

I dropped everything maths related in school because I wanted to become a guitar teacher. Unfortunately, graphics is all about maths and statistics. Thanks to TU Wien & Utrecht University, I’ve been studying and watching lectures to get back up to speed.

I want the characters to have some kind of personality. The past few years I’ve been reading books, watching interviews and listening to podcasts to learn about personality disorders, but I don’t know a lot about personality types. So, I decided to pick up this book about personality types called “Surrounded By Idiots”, a dutch book for a change.

If you’ve come this far, thanks for reading. The next post will most likely be a technical deep dive into either graphics or character behavior.