From the Burrow

TaleSpire Dev Log 180

2020-05-19 01:19:51 +0000

Hi folks, today I’ve been poking around with fog of war and working out how the updates are going to be spread over multiple frames. I’m going to keep working on this so I can select areas and apply fog to them. For now, the mesh generated will just be a big, grey, minecraft-looking mesh as we don’t need to worry about the visuals for a while yet.

Ree’s been working on business things today, and his work on the creature-controller has gone well too. There are still some things to work out with creature scaling over various framerates, but it’s getting closer.

Back soon with more!

TaleSpire Dev Log 179

2020-05-15 18:04:21 +0000

Nothing substantial today, I’m afraid. That last day or so, I’ve had something like writer’s block for coding, so I’ve moved my weekend a bit sooner.

I had started looking into fog-of-war, but I’ve got nothing new to report yet.

When I’m rested up, I’ll dig back into a few tile related bugs that have been reported this week, and then I’ll get back to the fog of war.

Have a good one folks.

TaleSpire Dev Log 178

2020-05-13 15:38:48 +0000

Heya folks,

After shipping the update yesterday, I reviewed all the usages of the data from the boardAsset files in TaleSpire. There were clear patterns in the usage that will allow us to pick better data layouts and divide the data that is needed in jobs from the data, which isn’t. This description is pretty vague, but basically, it’s good news :)

I’m going to play a little with non-burst-compiled jobs in Unity as they are going to be useful in the future when I need to do simple processing on non-blittable types, but don’t want to have to introduce a sync point.

I’ve also started looking back into a bug, which forced me to hold off from fog-of-war before the beta release. I had a case where rendering to a cubemap would result in the top and bottom faces being flipped. I didn’t report it at the time (which was dumb on my part), so I spent the first part of today reproducing the issue and submitting the issue to Unity for review.

You can find the repro that shows the issue here. Luckily it seems this issue does not occur when using a different RenderTexture constructor as reported by this lovely person here. With this knowledge, I can dig back into this and see if I can make more progress on fog-of-war again.

I can also give an update that the last bug we reported to Unity has been successfully reproduced internally, so odds are good that we will see progress on that in the future.

That’s all for now. I’ll probably make a more detailed post on the fog of war implementation once I have some more news.

Peace

TaleSpire Dev Log 177

2020-05-12 02:23:10 +0000

I’ve been a little quiet today, but I’ve been around. I’ve mainly been testing and re-checking the code that handles board format upgrades. I didn’t push out an update today as, during testing, I saw some cases where GM-blocks showed up incorrectly. I’m pretty sure these are pre-existing bugs, but it’s better to delay the release and fix them, rather than take the risk.

Tomorrow I’ll carry on testing the board format upgrading and then put out the release. Initially, the release was to add fog control to atmospheres but it now also includes:

  • A fix which means that when you return to boards, it (finally) remembers where your camera was and where it was facing
  • A fix for one instance of an error where, on rejoining a board that failed to load, you got a ‘failed to correctly assign internal id’ error.

And of course the above gm-block fixes.

After this, I’m aiming to focus on updating the asset format used by TaleSpire. I’ve been talking about doing this for about a year, but other things were always more important. The big goals are to improve load times and to make the asset data accessible to the job system. This will be huge for me as there is far too much complexity and inefficiency in the asset code due to constantly having to repack data in structures amenable to the job system.

Of course, some community projects have started consuming some of this data and, even though use of that data is not supported, I’d rather not break those. What I think we’ll do is make a portion of the data available as a separate json file. TaleSpire itself won’t use that file, but it’ll be handy for others. I’ll keep you posted on these changes.

Have a great evening folks,

Peace.

TaleSpire Dev Log 175

2020-04-29 10:52:33 +0000

Yesterday didn’t bring too much new from me. I submitted the potential bug to Unity, pushed out our release, and then spent most of the day catching up with social media and discord posts.

Today I’m going to try staring at the ‘delete tiles coming back on reload/copy’ bug and see if I can make any headway on it. It’s a real nightmare of a bug as it’s hard to reproduce, and it seems to be about something that isn’t happening. It’s not throwing any apparent errors and doesn’t show up until way after the presumed event. I’m not expecting to find it today, but maybe I can rule out some potential causes.

I hope you have a good day folks!

p.s. A note to people thinking of doing the indy game thing. Do not underestimate the amount of time it takes to write updates, handle support tickets, keep up to date with your community, and do the business stuff. It’s a privilege to have these ‘problems,’ but writing updates will take you longer than coding some features :p The less of you there are, the more time you’ll each be spending. Shout-out to Ree & Dwarf who’ve handled all the support tickets and business-ness while doing all their other jobs too.

TaleSpire Dev Log 174

2020-04-28 00:23:03 +0000

Ah, it’s a been a day that reminds you that rarely in programming can you ‘just’ do something.

On Saturday, I wanted to push a patch that fixed a bunch of bugs, one of which was related to an SSE4 requirement in the current build.

The fix should have been simple, SSE2 support has been added in later versions of the Burst compiler, we can just upgrade the package and rebuild.

The package is still in preview, and so there can be bugs. However, sometimes that kind of tradeoff is worth it, especially given that we are still in Beta. If it passed our tests, I’d have been happy.

Alas, we quickly saw issues which only occurred in the Burst compiled jobs. Time to get testing :)

As expected, the commit that caused the issue was the one where we upgraded the packages. We then tested each version of Burst and Unity.Collections to track down the first version, which caused the error.

In the release notes for the earliest problematic version we saw that one change was:

Improve codegen for structs with explicit layout and overlapping fields.

This was a great candidate as we had a type that is essentially a c# union. You can write such structs in c# like this:

[StructLayout(LayoutKind.Explicit)]
public struct Test
{
    [FieldOffset(0)] public int A;
    [FieldOffset(4)] public Bar B;
    [FieldOffset(4)] public Baz C;
}

I took our structs and removed parts until I had a simple case that showed the issue. I then moved it to a separate project, re-confirmed the packages and pulled other Unity versions to make sure it was still an issue.

I’ve made a repo for the test, and tomorrow I’ll report this as a potential issue to Unity.

Now the take-away from this is not to shit on Unity. This is an under-development build of the package explicitly available for finding these kinds of issues. This is just development.

However, it’s a fun example of how ‘just download it’ is almost never really the case in any significant project.

Hope this finds you well, sorry I’ve not been around the community today, constantly recompiling does not make one that sociable :p

TaleSpire Dev Log 173

2020-04-25 02:29:01 +0000

Hi folks. Recently, all the dev updates have been patch release notes, so I thought I’d drop in to say hi.

Bug fixing has been intense, and we’ve covered a lot of ground. Over 100 tickets are closed on github, and we have no worries about being able to tackle plenty more. After seemingly fixing one of the significant connection issues the other day, I’ve felt pretty sluggish. Feels like my body telling me to take it easy, so I’ve tried to do that for the last two days.

This evening I started pulling exception logs to see what easy wins I could find there. Where possible, we push the first exception we find to the server, and that can be quite a goldmine. It went pretty well, and the next update is going to have a bunch of random fixes in it.

The more exceptions we can remove, the less spurious or random bugs we will see reported. Some times a bug we find is just the side-effect of some earlier thing breaking. Kind of like if your car flips off the road, it might have had something to do with the wheel that fell off just before :P

The fixes had stuff like:

  • selections still trying to highlight tiles while switching boards
  • A state-machine for board sync hadn’t registered one of its failure states
  • UI trying to access stuff that had been disposed while transitioning to the main menu.

Yeah, a real mix.

Ree has (amongst other things) been looking into a GPU instancing library and Unity’s new ECS. No conclusions reached yet, but we’ll let ya know once there is news there.

We’ve also switched to using some Unity packages, which are still in preview. This is a risk, as they are still finding bugs, but we’ve done it for a couple of reasons:

  1. It allows us to dig into the ECS stuff mentioned above
  2. The preview version of their Burst compiler now seemingly supports fallback to SSE2 instead of requiring SSE4 (which has caused some users issues)

The next patch will have these changes so we can see if that helps.

Alright, it’s rather late, so I’m going to get some sleep.

Peace

TaleSpire Dev Log 172

2020-04-15 23:44:44 +0000

Good-evening once again from the north, it’s been a good day.

I will be pushing a patch tomorrow that seems to fix the bug, which caused certain boards to break with HardAssert errors. I have also tested some of the broken boards, and 6 of the 7 I tested were able to load again.

Let’s start with caveats. This will not fix all broken boards, tomorrow’s patch only addresses this specific bug, and we don’t yet know if all boards which had the issues are recoverable. If your issue manifested as something other than lots of HardAssert errors in the logs, then this fix won’t recover your board. However, once this is out, I will start going after other bugs so we will see what can be fixed.

Ok, now the meat.

Many folks were stuck, not being able to get into their boards because of a bug which put this error in the logs:

HardAssertFailure: HandleAddLayout: layout.AssetCount != _dequeuedAssetData.

An assert is a check that must be true for the program to continue. In this case, the presentation for a zone (see yesterday’s dev-log for the details on what those things are) had been told it had to spawn a certain number of assets. However, there was not enough asset data in the queue to satisfy that request. Given that there was no sensible thing to do at that point, it threw an exception.

Now, I should have written some code to catch that exception, tell you something had gone wrong, and given uploading the board. However, I forgot to add the check for that code path before the release. Sorry about that :

The first part of the work was what I did yesterday. Removing an optimization that I was concerned could have been hiding the issue due to its complexity. This did work, and this morning, I was able to open a previously broken board.

However, that fix, whilst required, was only part of the issue. The root cause was still there.

TaleSpire has to store a lot of asset info. Each kind of asset has a GUID that identifies it. A GUID is a 16-byte value, which is pretty large and so we don’t want to store more than we have to. One simple thing to do is to group assets of the same kind together; then, we only need one GUID for all of them.

There are other values that can be treated similarly, and so we put these together in a struct called an AssetLayout. Each AssetLayout also has an ID (a uint in this case), which uniquely identifies it in the Zone.

An issue that was occurring was that, when a player deleted some assets that they did not place (or if they were from a freshly loaded board) and then pressed undo, the layouts that were restored were not given new unique Ids. This meant the Zone potentially contained multiple layouts with the same id.

This goes unnoticed at first, but on load of a board, we have to spawn all the assets. The information is sent off to the presentation, and that’s where we hit the issue of layouts not matching the expected numbers of assets, all because our booking had been messed up.

Luckily the undo portion was relatively easy to fix. But we still have broken boards. What we needed was for something to go and clean up the ids of the layouts.

Let’s take a slight diversion. Whenever a place places assets, we add a layout and the asset-data for whatever was just made. Over time, this will result in a lot of small layouts for the same kind of tile. We have to keep them separate at first due to how undo/redo works, but once they are out of the players undo/redo history (for instance, when we save the board), we have an opportunity to defragment this data.

I had not had time to implement this before release, but as luck would have it, this operation is exactly what we need to clean up the layout information in the broken boards.

So an afternoon of code later that is working. For now, I’m running this on load, rather than on save as this will hopefully fix some of the broken boards. However, once the patch has been out for a while, I’ll change it to happen on save instead.

So that’s that! I will review the patch tomorrow, test it with some more boards, and then I’ll get it out to you all via Steam.

Huge thanks to Skye, lollypopunk, Al McWhiggin, Shmuky, and Mercer for sharing their broken boards with me so I could do better testing.

Have a good one folks

TaleSpire Dev Log 171

2020-04-15 00:53:18 +0000

Today has gone slow and steady.

In the morning, we wrapped up the second beta patch and got that out. I spent a couple of hours around community conversations and then got down to work.

The goal is fixing the HardAssert errors that are plaguing people. To understand where those are coming from, we have to dive into some implementation details.

TaleSpire spawns a lot of objects, and we have to spawn them on short notice. We have no way to know at what point the player will just decide to drag out 10000 tiles and expect it to just work. Unity cannot spawn that many tiles in one frame[0], and so we have to spread the load over multiple frames.

The board is divided into 16x16x16 unit zones, and each operates largely independently of the others. This is important as boards have the potential to be huge and so we will want to only spawn the tiles in the area we are in. Also, because of multi-gm support, building could be happening in any other part of the board. This means TaleSpire must allow for zones to be in memory and fully editable without having to have spawned all the tiles in the zone. [1]

To that end, every zone may have a presentation. The presentation is the visual component of a zone and is what handles spawning tiles from the zone.

One issue with spreading an operation over many frames is that by the time it completes, another operation may have already arrived and have needed to edit the tiles you are spawning. Because of this, the presentation has a queue of changes to be applied. This means an operation makes its change to the data-model and then pushes the required change (and potentially new assets) into the presentation’s queue. We then currently give four milliseconds per frame [2] for the presentations to apply as many ops as they can.

Doing this, we give ourselves the ability to keep a reasonable frame rate. We don’t do a great job at this right now, but the pieces are in place for this to work [3].

Now let’s throw in networking. We cannot send all the tile data for each operation across the network; it’s not feasible. What we do instead (much like how RTS handle troop movements) is to send a description of what we need to be done (the operation) and let each client apply that change. As you can imagine, all clients have to agree on the order to apply these, or each will end up with a different end result. The details of that are beyond the scope of this post, but regardless we can talk about the issue that arises from this.

Agreeing on an order, and communicating the operations takes time, but games feel terrible unless actions are immediate. What we can (and do) do is to apply local changes immediately and then wait to see what order is agreed upon. If it turns out, we were already in the right order, then great! If not, we need to effectively revert our change, apply the changes that came before ours, and then reapply ours. Again making this fast in TaleSpire could be its own post, so I will give a TLDR by saying the way we do it involves applying the local changes eagerly to the presentation (and not the data) and then apply the change to the data-model when it arrives.

Finally, we can talk about the bugs :P

The HardAssert issues I’m currently working on are occurring when there is a mismatch on the presentation between the number of tiles the queued operation says needs to be spawned, and the number of tiles in the inbound queue. Clearly, there is something awry here, and so I am attacking one source of complexity in the code that enqueues operations for the presentation. The code in question was meant to act as an optimization in cases when a chunk of tiles is added and removed in the same frame (which can happen in undo/redo if you are super fast). However, given the rarity of that action, and the severity of the issues, I’m removing this in favor of something simpler for me to understand and debug.

Given all of the complexity I mentioned above (and plenty I left out), you can imagine that this has to be done rather carefully.

Today I managed to make the simplification, and so tomorrow I can start looking at the HardAssert errors themselves. I’ll then be talking to people in the community who have offered to let me use their boards as test cases, and we’ll see what we can learn from there.

Now that we have covered some of the background, I can also explain why we haven’t just added local save files, as many have suggested as a temporary fix.

When code in Unity throws an exception, it does not halt the whole program as you might expect. Instead, it travels up the stack to the calling MonoBehaviour and destroys that instead. Now let us imagine that that MonoBehaviour was handling something like applying the messages which have had their order decided. In that case, we can see what will happen, our local changes will be eagerly applied, but when you save and reload, lots of stuff is missing. In fact, everything that was changed after the MonoBehaviour died will be missing as all you were seeing was the presentation of the assets.

Of course, there should be something that freaks out if a message goes missing like that, but clearly, that is not happening. It’s very curious and very likely related (assuming this theory is correct)

We can also suppose that the cases where people see duplicate assets on reload could also be related. Imagine what happens if a delete message is lost. Locally we eagerly apply the change, but the delete message never arrives to change the data. So next time you reload, the tiles are back! In some cases, it’s more subtle as tiles made afterward are still there. Going into that would make this post even longer, so I’ll pass on that for now.

The takeaway from all of this is that all of it should be fixable, I’m not expecting fundamental issues here, just compounded bugs which result in a spectrum of exceedingly aggravating and destructive behavior.

As I progress on this, I’ll keep you abreast of what I’m finding.

Have a good one!

Peace.

[0] Unitys new DOTS systems change a lot about the performance of such operations. However, we could not move to it for the Beta as until very recently, their hybrid renderer did not support custom per-instance data, and that is critical to TaleSpire.

[1] We apply most operations in parallel using Unity’s job system, which means we can make very large data changes very quickly. There is also a lot we can do to improve this more, but that’s a subject for when things are more stable.

[2] This obviously should be adaptive to the current load, but it isn’t yet.

[3] This is a recurring theme of the rewrite from the alpha to the Beta. We have tried to put ourselves in a place where the goals we set for Early Access are achievable, even though certain features might not exist right now.

TaleSpire Dev Log 170

2020-04-13 18:48:03 +0000

Allo, bug fixing was gone reasonably well today.

This particular issue is not resolved but is mitigated for now.

Since the launch, we have been seeing a very large number of boards fail to sync because the backend refused to acknowledge that the people upload were GMs. The reason the backend thought this was the case was that their session no longer existed in the database.

Each time you connect, you are given a session, and it’s refreshed each time a keep-alive message is received from the game. The game sends keep-alive messages every 5 minutes. If a session is not updated for an hour, it is considered inactive and removed from the DB. It is also removed if TaleSpire signs out.

I guess we could sprinkle the word ‘should’ everywhere in the last paragraph as something clearly isn’t doing what it’s meant to.

The most obvious candidate was the code the removed the old sessions, and so I temporarily disabled it to observe the behavior. This didn’t seem to have an effect, and so I needed to investigate signout as a potential source of issues. However, whilst this is happening, we are obviously losing people’s boards, which is pretty heartbreaking, so we needed something sooner than waiting for data and hoping it illuminated the problem.

Each websocket connection made to the server spawns an Erlang process that is tied to that connection. When the connection dies, the process dies (and vice versa). We can store information along with this process, which allows us to tie information to your connection (the data remains server-side). We always authenticate you before creating the websocket connection, so the process almost represents the session we are interested in.

Ultimately I’ve wanted to move in this direction for a while but have not had time to. However, with things breaking, this afternoon became an exercise in how fast I can code carefully :D

It took about 6 hours to get the DB and server patches written and tested. We deployed at around 18:10 Norway time, and so far, it appears I’d only missed one thing. That was resolved quickly, though.

With that, all of the failures I’ve seen server-side about rights to save are gone. This does not mean all board persistence issues are fixed; it only means that this one cause is being handled. I still need to understand the session issue properly and keep an eye to see how things progress.

However, it does mean that this is no longer the highest priority. The next priority is now the ‘HardAssert’ failures that are corrupting board files. That is my task for tomorrow.

There is also likely to be another patch update either in a few hours or in the morning. We’ll keep ya posted

That’s the lot for now.

Peace

Mastodon