049 - Initial benchmarking of the autonomous NPC architecture

January 5, 2020

Rendering Pookie's House

We started off the week by creating a mesh for the sticks that surround Pookie's house.

We created one mesh and then rendered it repeatedly in different orientations on the tiles surrounding Pookie's home - leveraging some of the Scenery and RenderableId work over the last two weeks.

Pookie's house Rendering Pookie's house, Still need to add a few more things inside - also need to eventually become a good artist - practice makes perfect hopefully.

Autonomous entity movement

After getting Pookie's house rendering I wanted to quickly add some components to my specs entity component system that would allow non-player characters to move around - instead of just standing still like they do now.

I spent a few minutes planning this out on paper - then realized that it would be best to first do some reading on some of the latest techniques for designing autonomous NPCs.

My googling landed on behavior trees first - but they didn't feel quite as modular as I would like.

I then found goal oriented action planning and it seemed to fit the bill of what I was looking for. I started to get ideas and inspiration on how I could add interesting autonomous behavior to NPCs over time without things progressively becoming unmaintainable.

I started by trying to plan out the first pass at the goal oriented action planning system - but that proved a bit difficult. It's hard to design when you don't have a use case in mind just yet.

So I took it back to the basics. I only planned out the higher level data flow - making sure that it would allow me to fill in more blanks over time as I learned and had more requirements.

The flow that I landed on was to treat NPCs in much the same way we treat players. Right now players have a ClientConnectionComp which is used to send them the latest information that they know about the world once per game tick.

/// Powers clients being able to connected to the game and control an entity
#[derive(Component, Debug)]
pub struct ClientConnectionComp {
    /// The user_id for this player / client.
    ///
    /// During real gameplay this typically comes from the `users` table, or in integration
    /// tests its a made up key into a HashMap of components.
    user_id: i32,
    /// Used to send state updates to the client by either WebSocket or MPMC channel.
    client_sender: ClientSender,
    /// Indicates that the ClientSender's connection has closed - and when that happened.
    ///
    /// After some time has elapsed the ClientConnectionMonitorSystem will remove this entity
    /// from the world.
    connection_closed_at: Option<DateTime<Utc>>,
}

There is also a PendingClientRequests struct that holds requests that the client has sent to the server and processes them once per game tick. A client request might, for example, be to walk, or drink, or drop something.

We want our NPCs to behave similar. They receive their known world state and can look at that and send client requests indicating what they want to do. From the game's perspective an NPC is really no different than a real player. They just happen to use code to decide what ClientRequestBatches to send up to the server instead of eyes, hands and a brain.

So with that planned out the only thing I needed to do for now implementation wise to start was get a quick version one in place that sent state to autonomous entity's and have the entity's not even look at state and instead just randomly send up a request to move. Then over time as I need more behavior I can start looking at the client's state and use something similar to goal oriented action planning to decide on client requests to send.

This architecture has another benefit of allowing me to distribute NPC goal selection over multiple machines should I ever need to. Each machine would just have a TCP or WebSocket connection to the server - receive state - send up client requests. NPC autonomous behavior would be decoupled from the game server.

Assessing technical feasibility

There was one major assumption in our autonomous NPC architecture that needed to be validated before we could lock into the strategy.

Can we iterate over thousands of entity's every game tick (around 600ms) and calculate, serialize and then send them their known world state?

If the answer to this question was yes - we'd be in great position. If our autonomous decision making ever got too slow for the single game server we could spread it across multiple additional machines.

If the answer was no - it was back to the drawing board.

Starting the benchmarking adventure

So we needed to verify that we could iterate through thousands of entity's - calculate their state - serialize their state - and send their state over some sort of ClientSender (our light abstraction over how a client is connected to the game).

I made up a rough goal of getting to sub 300ms for 10,000 connected clients. Going into the game I wanted to support 5,000 connected players - so this would mean that there was a maximum of 5,000 autonomous NPCs being processed alongside them.

The nature of the code under test is that it more or less looks like this (psueodocode):


#![allow(unused_variables)]
fn main() {
// Pseudocode for how we send known state to entities every game tick.
for entity in entities {
    let state = ClientWorldState::new();
    for other_entity in entities {
        state.add_entity(other_entity);
    }
    entity.send_state(state);
}
}

All of my benchmarking is on a 6 core 2019 MacBook Pro - so I went in hoping that if I could achieve sub 300ms my personal computer I'd be in good shape running the game on a much more powerful AWS EC2 instance.

Here's how the benchmark looks (powered by criterion):

/// Benchmark a run of the ClientStateUpdaterSystem with
/// 10_000 idle entities at the same location.
pub fn csu_10_000_idle_entities(c: &mut Criterion) {
    c.bench_function("CSU 10_000 idle entities", |b| {
        let mut world = create_world_without_entities().world;

        let mut receivers = vec![];

        for user_id in 0..10000 {
            let (sender, receiver) = crossbeam_channel::unbounded();
            receivers.push(receiver);

            world
                .create_entity()
                .with(ClientConnectionComp::new(
                    user_id,
                    ClientSender::new_mpsc(sender),
                ))
                .with(TilePositionComp::new_1x1(50, 50))
                .build();
        }

        b.iter(|| {
            ClientStateUpdaterSystem.run_now(&world);

            // Clear the buffer of state updates after each run
            for receiver in receivers.iter() {
                receiver.try_recv();
            }
        })
    });
}

Note that at the end of each iteration we run 10,000 try_rec. This is just to prevent the memory footprint from increasing on each run of the loop by ensuring that we clear out any ClientWorldState that was sent to our fake clients. In the future I could adjust the benchmark to not have this happen inside of the part that is timed - but I wasn't going for perfection here - just a ballpark benchmark.

First take - humble beginnings

I took a first look at the numbers before changing any of the code. We were synchronously looping over 10,000 entities with a nested loop over another 10,000 entities - so I expected things to be a little slow.

Benchmarking CSU 10_000 idle entities: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 171781.7s or reduce sample count to 10
Benchmarking CSU 10_000 idle entities: Collecting 100 samples in estimated 171782 s (5050 iterations)

171782 / 5050 = ~34 seconds per iteration. Very far from our 300ms target.

Try outer par join

Using specs's parallel iterator was a one line change and dropped the time per iteration to 16.6 seconds on my 6 core machine.

Benchmarking CSU 10_000 idle entities: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 84072.6s or reduce sample count to 10
Benchmarking CSU 10_000 idle entities: Collecting 100 samples in estimated  84073 s (5050 iterations)

Note that in a real benchmark you should wait for the benchmark to finish. For our technical feasibility testing I just took the estimate that criterion prints out within a few seconds of starting the benchmark and ran with that.

Now limit to only letting you know about up to 100 other entities

We dropped things down to around 156 milliseconds per iteration by letting clients only know about a maximum of 100 other entities by calling .filter on the inner parallel iterator loop.

Benchmarking CSU 10_000 idle entities: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 789.8s or reduce sample count to 10
Benchmarking CSU 10_000 idle entities: Collecting 100 samples in estimated 789.81 s (5050 iterations)

One of my unofficial goals is for the game to work smoothly on a dial up connection - which is 56 Kbps or 33,600 bytes per tick (600ms). If the average update per entity was 300 bytes then a player on dial up could know about up to 100 other entities.

Better yet - though - is I want updates so be much smaller than that. From second to second players will for the most part barely change at all.

If a player legitimately hasn't changed in any way the diff should be one byte. If they've only changed position - the diff would be around 20 bytes.

So say we estimated around 50 bytes per player when a bunch of players were packed together would mean that we could certainly fit 100 players into your state update - and in fact we could fit 600 pretty comfortably.

Using Take instead of Filter

I was experimenting with par_join on the inner loop but it was slower. I was using .filter and AtomicUsize to count to 100 and filter out additional entities once we counted to 100.

I switched back to join and things were faster (the thread overhead wasn't worth it for only 100 entities.)

Then I rememebered - oh I can use take now. My jaw dropped:

The iteration time dropped to 4.6 ms!

Benchmarking CSU 10_000 idle entities: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 234.4s or reduce sample count to 10
Benchmarking CSU 10_000 idle entities: Collecting 100 samples in estimated 234.37 s (5050 iterations)

HashMap with capacity

I used cargo-flamegraph to dive into what was taking time within the ClientStateUpdaterSystem.

Client state updater system benchmark flamegraph Using cargo-flamegraph to analyze performance.

The number one thing that was taking up time was our HashMap.insert call.

We haven't implemented state diffing yet so I'm anticipating some time will go to that when we eventually have it.

I tried replacing the HashMap with a Vec but there wasn't much of a speedup.

I was ready to stop here - but after chatting with my dad letting him know about my progress he asked if there was anything I could do to get it in the 3ms range.

Somehow from that question my brain realized that the reason HashMap -> Vec didn't have much of an impact was because the bulk of the cost wasn't from the insertions - but the allocations as the HashMap grew.

I pre-allocated the HashMap with enough space for 100 entities and dropped the time to 3.96ms.

Benchmarking CSU 10_000 idle entities: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 200.1s or reduce sample count to 10
Benchmarking CSU 10_000 idle entities: Collecting 100 samples in estimated 200.06 s (5050 iterations)

Things would probably be faster if I switch to a Vec now - but I'm saving that as future work. No sense spending any more time optimizing now that we've achieved our proof of concept.

Final benchmarking notes

There are more allocations in the ClientStateUpdaterSystem that I can get rid of to drop the runtime down even further - but that isn't a priority right now. We've achieved our goals with this technical feasibility exploration - and can call this strategy of treating NPCs like players viable.

Right now we're optimized for a large number of entities - but in the future we want to also benchmark and optimize for a small number of entities. For example - par_join might not make sense for only 200 players are online as the overhead of multi-threading might cause a slowdown in that case.

But these aren't problems we need to solve for now.

Overall I'm excited about the implications of this test. We should now be poised to have virtually uncapped goal planning for our NPCs. If things are slow - distribute them across machines. Nice!

Also - we might be able to support much more than 5,000 players technically - so this opens up our game design in the sense that we'd no longer constrained to that limit and should we see fit can increase it.

Financials

Our financials for December 2019 were:

itemcost / earning
revenue+ $4.99
aws- $111.19
adobe substance- $19.90
GitHub LFS data pack- $5.00
photoshop- $10.65
ngrok- $10.00
chinedun@akigi.com Google email- $12.00
CLion yearly subscription- $89.00
------
total- $252.75

Okay - I need to cut ngrok and the chinedufn@akigi.com email as I'm legitimately not using them - I'll do that this month.

We bought a CLion subscription since the Rust plugin is fantastic.

Otherwise more of the same.

Other Work Last Week

Trying to get browser tests running in CI

Spent a few hours trying to get the browser test from last week passing in my Docker container (since we use the container in CI) but couldn't. The canvas WebGlRenderingContext is null for some reason. I'll revisit this another day but moving on for now.

Unit testing conversations

Ran into an issue playing the game where I could not complete the quest if I had more than one of an item.

Adjusted my integration test tooling to make it easy to try a quest multiple times with different configuration.

But then realized this problem of validating these smaller variations in the flow was better suited by a unit test - so added a way to unit test an individual conversation node. I'll continue to use and expand on this initial conversation unit testing infrastructure in the future as we add more dialogue into the game.

/// Fix an issue with our conversation graph where you were expected to have at least one
/// SnailBody in your inventory but we were accidentally only allowing the conversation to
/// progress if you have exactly one SnailBody.
#[test]
fn talk_to_pookie_on_quest_step_20_with_multiple_snail_bodies() {
    let mut run_args = make_conversation_run_args_production_graph();

    for quantity in 1..=2 {
        assert_talk_to_pookie_on_quest_step_20_with_multiple_snail_bodies(&mut run_args, quantity);
    }
}

fn assert_talk_to_pookie_on_quest_step_20_with_multiple_snail_bodies(
    run_args: &mut ConversationSystemRunArgs,
    quantity: u32,
) {
    run_args.set_quest_step(&QuestId::SpookieMosquitoes, 20);
    run_args.add_item_to_inventory(IconName::SnailBody, quantity);

    run_args.set_current_and_advance_to(CONVO_GRAPH_ID, None, None);

    run_convo_system(run_args).unwrap();

    let conversing = run_args.main_action_comp.maybe_conversing().unwrap();
    let current_text = conversing.current_text_and_responses().unwrap();

    assert!(current_text.text().contains(USE_PESTLE_TO_GRIND_UP_SNAIL));
}

Focusing on portability from day one

Added specs to the front-end game app and introduced the RenderSystem.

We're keeping the game-app frontend crate client agnostic.

The client passes in Boxd functions to handle different target specific functionality.

For example, the web-client crate passes in Box<dyn FnMut(&RenderJobs) -> () + 'static> function that the game-app calls. The web-client is using WebGL to render the game based on what client agnostic RenderJobs (raw data) it receives. One day we'll have other clients - such as a Mac desktop client - that pass in their own set of ClientSpecificResources.

In this sense we're designing for portability from the beginning - even if we don't plan to port to other platforms anytime soon.

/// Resources that are platform dependent
pub struct ClientSpecificResources {
    /// Powers communicating with the game server backend
    pub game_server_connection: Box<dyn GameServerConn>,
    /// Used by the RenderSystem to render a frame.
    pub renderer: Box<dyn FnMut(&RenderJobs) -> () + 'static>,
}

Next Week

We'll put enough of our autonomous NPC architecture in place to have an NPC that decides when to wander around the world.

For simplicty we'll start off by ignoring it's state and basing the decision off of random chance.

Over time we'll rely more and more on state and goal oriented action planning for NPC decision making.

We'll also be working on training of the Hitpoints discipline this week. I have some interesting ideas for the mechanics of this training that I want to implement - and I'll be sure to dive into that in next weeks journal entry.


Cya next time!

- CFN