Good-evening once again from the north, it’s been a good day.
I will be pushing a patch tomorrow that seems to fix the bug, which caused certain boards to break with
HardAssert errors. I have also tested some of the broken boards, and 6 of the 7 I tested were able to load again.
Let’s start with caveats. This will not fix all broken boards, tomorrow’s patch only addresses this specific bug, and we don’t yet know if all boards which had the issues are recoverable. If your issue manifested as something other than lots of
HardAssert errors in the logs, then this fix won’t recover your board. However, once this is out, I will start going after other bugs so we will see what can be fixed.
Ok, now the meat.
Many folks were stuck, not being able to get into their boards because of a bug which put this error in the logs:
HardAssertFailure: HandleAddLayout: layout.AssetCount != _dequeuedAssetData.
An assert is a check that must be true for the program to continue. In this case, the presentation for a zone (see yesterday’s dev-log for the details on what those things are) had been told it had to spawn a certain number of assets. However, there was not enough asset data in the queue to satisfy that request. Given that there was no sensible thing to do at that point, it threw an exception.
|Now, I should have written some code to catch that exception, tell you something had gone wrong, and given uploading the board. However, I forgot to add the check for that code path before the release. Sorry about that :|
The first part of the work was what I did yesterday. Removing an optimization that I was concerned could have been hiding the issue due to its complexity. This did work, and this morning, I was able to open a previously broken board.
However, that fix, whilst required, was only part of the issue. The root cause was still there.
TaleSpire has to store a lot of asset info. Each kind of asset has a GUID that identifies it. A GUID is a 16-byte value, which is pretty large and so we don’t want to store more than we have to. One simple thing to do is to group assets of the same kind together; then, we only need one GUID for all of them.
There are other values that can be treated similarly, and so we put these together in a struct called an AssetLayout. Each AssetLayout also has an ID (a uint in this case), which uniquely identifies it in the Zone.
An issue that was occurring was that, when a player deleted some assets that they did not place (or if they were from a freshly loaded board) and then pressed undo, the layouts that were restored were not given new unique Ids. This meant the Zone potentially contained multiple layouts with the same id.
This goes unnoticed at first, but on load of a board, we have to spawn all the assets. The information is sent off to the presentation, and that’s where we hit the issue of layouts not matching the expected numbers of assets, all because our booking had been messed up.
Luckily the undo portion was relatively easy to fix. But we still have broken boards. What we needed was for something to go and clean up the ids of the layouts.
Let’s take a slight diversion. Whenever a place places assets, we add a layout and the asset-data for whatever was just made. Over time, this will result in a lot of small layouts for the same kind of tile. We have to keep them separate at first due to how undo/redo works, but once they are out of the players undo/redo history (for instance, when we save the board), we have an opportunity to defragment this data.
I had not had time to implement this before release, but as luck would have it, this operation is exactly what we need to clean up the layout information in the broken boards.
So an afternoon of code later that is working. For now, I’m running this on load, rather than on save as this will hopefully fix some of the broken boards. However, once the patch has been out for a while, I’ll change it to happen on save instead.
So that’s that! I will review the patch tomorrow, test it with some more boards, and then I’ll get it out to you all via Steam.
Huge thanks to Skye, lollypopunk, Al McWhiggin, Shmuky, and Mercer for sharing their broken boards with me so I could do better testing.
Have a good one folks
Today has gone slow and steady.
In the morning, we wrapped up the second beta patch and got that out. I spent a couple of hours around community conversations and then got down to work.
The goal is fixing the HardAssert errors that are plaguing people. To understand where those are coming from, we have to dive into some implementation details.
TaleSpire spawns a lot of objects, and we have to spawn them on short notice. We have no way to know at what point the player will just decide to drag out 10000 tiles and expect it to just work. Unity cannot spawn that many tiles in one frame, and so we have to spread the load over multiple frames.
The board is divided into 16x16x16 unit zones, and each operates largely independently of the others. This is important as boards have the potential to be huge and so we will want to only spawn the tiles in the area we are in. Also, because of multi-gm support, building could be happening in any other part of the board. This means TaleSpire must allow for zones to be in memory and fully editable without having to have spawned all the tiles in the zone. 
To that end, every zone may have a presentation. The presentation is the visual component of a zone and is what handles spawning tiles from the zone.
One issue with spreading an operation over many frames is that by the time it completes, another operation may have already arrived and have needed to edit the tiles you are spawning. Because of this, the presentation has a queue of changes to be applied. This means an operation makes its change to the data-model and then pushes the required change (and potentially new assets) into the presentation’s queue. We then currently give four milliseconds per frame  for the presentations to apply as many ops as they can.
Doing this, we give ourselves the ability to keep a reasonable frame rate. We don’t do a great job at this right now, but the pieces are in place for this to work .
Now let’s throw in networking. We cannot send all the tile data for each operation across the network; it’s not feasible. What we do instead (much like how RTS handle troop movements) is to send a description of what we need to be done (the operation) and let each client apply that change. As you can imagine, all clients have to agree on the order to apply these, or each will end up with a different end result. The details of that are beyond the scope of this post, but regardless we can talk about the issue that arises from this.
Agreeing on an order, and communicating the operations takes time, but games feel terrible unless actions are immediate. What we can (and do) do is to apply local changes immediately and then wait to see what order is agreed upon. If it turns out, we were already in the right order, then great! If not, we need to effectively revert our change, apply the changes that came before ours, and then reapply ours. Again making this fast in TaleSpire could be its own post, so I will give a TLDR by saying the way we do it involves applying the local changes eagerly to the presentation (and not the data) and then apply the change to the data-model when it arrives.
Finally, we can talk about the bugs :P
The HardAssert issues I’m currently working on are occurring when there is a mismatch on the presentation between the number of tiles the queued operation says needs to be spawned, and the number of tiles in the inbound queue. Clearly, there is something awry here, and so I am attacking one source of complexity in the code that enqueues operations for the presentation. The code in question was meant to act as an optimization in cases when a chunk of tiles is added and removed in the same frame (which can happen in undo/redo if you are super fast). However, given the rarity of that action, and the severity of the issues, I’m removing this in favor of something simpler for me to understand and debug.
Given all of the complexity I mentioned above (and plenty I left out), you can imagine that this has to be done rather carefully.
Today I managed to make the simplification, and so tomorrow I can start looking at the HardAssert errors themselves. I’ll then be talking to people in the community who have offered to let me use their boards as test cases, and we’ll see what we can learn from there.
Now that we have covered some of the background, I can also explain why we haven’t just added local save files, as many have suggested as a temporary fix.
When code in Unity throws an exception, it does not halt the whole program as you might expect. Instead, it travels up the stack to the calling MonoBehaviour and destroys that instead. Now let us imagine that that MonoBehaviour was handling something like applying the messages which have had their order decided. In that case, we can see what will happen, our local changes will be eagerly applied, but when you save and reload, lots of stuff is missing. In fact, everything that was changed after the MonoBehaviour died will be missing as all you were seeing was the presentation of the assets.
Of course, there should be something that freaks out if a message goes missing like that, but clearly, that is not happening. It’s very curious and very likely related (assuming this theory is correct)
We can also suppose that the cases where people see duplicate assets on reload could also be related. Imagine what happens if a delete message is lost. Locally we eagerly apply the change, but the delete message never arrives to change the data. So next time you reload, the tiles are back! In some cases, it’s more subtle as tiles made afterward are still there. Going into that would make this post even longer, so I’ll pass on that for now.
The takeaway from all of this is that all of it should be fixable, I’m not expecting fundamental issues here, just compounded bugs which result in a spectrum of exceedingly aggravating and destructive behavior.
As I progress on this, I’ll keep you abreast of what I’m finding.
Have a good one!
 Unitys new DOTS systems change a lot about the performance of such operations. However, we could not move to it for the Beta as until very recently, their hybrid renderer did not support custom per-instance data, and that is critical to TaleSpire.
 We apply most operations in parallel using Unity’s job system, which means we can make very large data changes very quickly. There is also a lot we can do to improve this more, but that’s a subject for when things are more stable.
 This obviously should be adaptive to the current load, but it isn’t yet.
 This is a recurring theme of the rewrite from the alpha to the Beta. We have tried to put ourselves in a place where the goals we set for Early Access are achievable, even though certain features might not exist right now.
Allo, bug fixing was gone reasonably well today.
This particular issue is not resolved but is mitigated for now.
Since the launch, we have been seeing a very large number of boards fail to sync because the backend refused to acknowledge that the people upload were GMs. The reason the backend thought this was the case was that their session no longer existed in the database.
Each time you connect, you are given a session, and it’s refreshed each time a keep-alive message is received from the game. The game sends keep-alive messages every 5 minutes. If a session is not updated for an hour, it is considered inactive and removed from the DB. It is also removed if TaleSpire signs out.
I guess we could sprinkle the word ‘should’ everywhere in the last paragraph as something clearly isn’t doing what it’s meant to.
The most obvious candidate was the code the removed the old sessions, and so I temporarily disabled it to observe the behavior. This didn’t seem to have an effect, and so I needed to investigate signout as a potential source of issues. However, whilst this is happening, we are obviously losing people’s boards, which is pretty heartbreaking, so we needed something sooner than waiting for data and hoping it illuminated the problem.
Each websocket connection made to the server spawns an Erlang process that is tied to that connection. When the connection dies, the process dies (and vice versa). We can store information along with this process, which allows us to tie information to your connection (the data remains server-side). We always authenticate you before creating the websocket connection, so the process almost represents the session we are interested in.
Ultimately I’ve wanted to move in this direction for a while but have not had time to. However, with things breaking, this afternoon became an exercise in how fast I can code carefully :D
It took about 6 hours to get the DB and server patches written and tested. We deployed at around 18:10 Norway time, and so far, it appears I’d only missed one thing. That was resolved quickly, though.
With that, all of the failures I’ve seen server-side about rights to save are gone. This does not mean all board persistence issues are fixed; it only means that this one cause is being handled. I still need to understand the session issue properly and keep an eye to see how things progress.
However, it does mean that this is no longer the highest priority. The next priority is now the ‘HardAssert’ failures that are corrupting board files. That is my task for tomorrow.
There is also likely to be another patch update either in a few hours or in the morning. We’ll keep ya posted
That’s the lot for now.
Some people have noticed that TaleSpire will sometimes put a ♥ in the system clipboard and have been wondering why.
Our paste strings are a base64 encoded, compressed chunk of data. In order to use it in-game when need to decode and decompress it, which takes time.
Given that people want to paste a lot, we don’t want to have to go through that process every time. So we do the following:
If the system clipboard contains a string that isn’t ♥ we try to process it and paste it.
We store the processed version in an internal clipboard in the format we want
Now that we have done that, we want to use the internal version, but there is still a valid paste string in the system clipboard, which means we have to clear it or compare it to the last string we pasted. Those strings can be large, so it would be nice if we could have some shorter indicator.
We don’t want to clear the clipboard as we also want to give a nice message when a user tries to paste something that isn’t a valid paste string (we need to do this or paste can feel unresponsive, and we’ll get bug reports).
So instead, we put a little one character token into the system clipboard. Something that is rarely going to be there accidentally. I just used the Unicode heart as that felt nice at the time.
This is fast to check, and met the other requirements I had, letting us use the internal clipboard in most cases.
p.s note the above is important even if you aren’t sharing pastes as when you copy we have to populate the clipboard with the string
Well, it’s the first dev log after the beta release and in a surprise to no-one except me I’m exhausted today :D I had a good sleep, but clearly my body knows the release is done and fancies a bit more.
Today I did start looking at board sync issues however, as anything that breaks boards is a high priority.
To aid in the process as we hunt this down, I’ve added a small icon to request a board upload. It’s temporary as we want all persistence of boards to be totally transparent to players, however, obviously that is not the case right now. It also doubles as an indicator that shows you when the game knows there are changes that need to be persisted. It looks like this
After the usual 5 second delay, it will automatically upload, and then the icon will change back to its default state.
I also found a very ugly bug that could cause boards to become corrupted if the upload occurred while switching boards. What would happen is that on completion of an upload, the metadata for the current board (including its id) would be updated. However, uploads are async, so if you created a new board and if the upload took a while (multiple seconds), then it could have concluded when you were already in the new board. This would then overwrite the current board info with info from the previous board. This is daft behavior that is a hangover from the alpha.
I doubt it accounts for many of the issues seen, but it’s certainly one I’m glad we’ll be rid of soon.
I’ve implemented the change to TaleSpire, a related change to the backend, and tested on my local setup. I’m not going to put out a patch tonight as it’s 22:00 here and I have a glass of port that needs a drink and ‘breath of the wild’ that needs playing (also it’s very much tempting fate to push the last thing at night :p)
I’ll probably push this patch tomorrow.
The release was wild. I should probably write an update on just that soon. There were very last second (as in 15 minutes before release) server changes, which have resulted in DNS entries not having propagated in time (a glaring oversight on my part). Given the usual 48-72 hour propagation times, we are hoping to see a drop in the number of cases of the ‘stuck at login screen’ error by the end of Monday. After that, other cases will be a different bug, and we’ll attack that!
I’ll also be going through the github issues during the week. For now, my focus will be on the most board-breaking issues, but in time, we’ll get through them.
Thank you all for the amazing support, both in reporting issues but also just being so good about these bumps as we ride our way through them.
I hope you are all doing well and have been able to get far enough to see where we want to take this thing. Some of the creations we’ve seen you make already have been mind-blowing. This year is going to be great!
Have a great night folks,
Seeya with another update real soon.
Quick one as I’m dead tired and am getting straight to bed after this.
No surprises here; we are working on the beta :p. Bug fixes everywhere. @Ree has been working on the GM blocks, UI, input system, cutscene mode, and more. I’ve been mainly working on fixing issues in creature sync, long-running server services that clean up old sessions & board snapshots, and other stuff that is a blur.
The production servers are up, and we are now testing internally on those. My focus is bug-fixing from now to the release.
Right after some sleep of course.
Yesterday I managed to get ‘unique creatures’ working in TaleSpire again. A unique creature is a creature which has has been marked as special by the GM, it is no longer tied to a single board, and it’s hp, stats, etc. are retained across the campaign.
We use Photon for the realtime networking in TaleSpire. Anyone who is connected to the same board is connected via the same Photon ‘room’. When you are in the same room, you can send and receive RPCs, events, and synchronization messages.
However, if someone is in another room and summons a unique creature, we need to remove it from any board it was already in. We can’t use Photon for that, so instead, we use our backend. When you make tell the backend that a unique creature has entered a board, we broadcast a message using erlbus and then all listeners send a message down to TaleSpire itself. Naturally, you wouldn’t want every client receiving all messages, so we use topics. When you join a campaign, we subscribe your client to a topic named by the GUID for that campaign, and we do the same when you enter a board. This means it’s trivial to send messages to any client regardless of if they are in the same Photon room or not.
All of that appears to be behaving itself (although it could definitely feel better), and so today, I’ve been working on a selection of client-side bugs.
- Torch state not syncing
- GMs couldn’t focus the cutscene camera of players
- GMs couldn’t remove specific dice groups (clear all worked though)
All of these are now behaving again. Next, I need to have a look at tile drop-in and some cases where things remain highlighted after the selection is removed.
I think tomorrow I’ll look at line-of-sight again and hook up the bit that actually hides and reveals creatures based on whether the creature you have selected can see them. Also, I’m 99% sure I’m sampling the cubemap incorrectly when collecting Ids, so I need to look at that and see what I’m screwing up.
Good morning folks,
Yesterday went pretty well. I was able to deploy the production DB, two servers, and show that they are communicating as expected.
One thing that took me a little bit to grok was how the two Erlang nodes were meant to find each other given only the node’s name. This article talks about how epmdless is great, but how you also need something extra when the docker containers are on separate machines. We use epmdless_dist like the article shows, but the example has this code:
epmdless_dist:add_node(‘email@example.com’, 17012). % Add node into epmdless database
And I didn’t understand how erlang would know the address of my other server unless it was assuming that the
@host1.com portion was the name for the server as well as the container. It turns out this confusion was valid, and there are a few variants of the
add_node function. The one I needed was
add_node/3. Here is the spec:
-spec add_node(Node, Host, Port) -> ok when Node :: atom(), Host :: inet:hostname() | inet:ip_address(), Port :: inet:port_number()
After calling that, I could
net_adm:ping(‘firstname.lastname@example.org') from the second server, and it correctly resolved the Erlang node. The two were then correctly connected, and erlbus messages started working across them transparently. Very cool :)
My job for today is to get ‘unique creature’ support back into TaleSpire. It got ripped out during the big engine refactoring as the code was so entwined with how we used to handle creatures. This time it should end up much more straightforward.
Until the next one, Ciao
Time for another little update on what I’m poking at.
This week I’m focusing on server-side work, and the first thing I needed to look at was the new server setup. Previously the servers themselves were in a private subnet with the load-balancer being the only thing that was public-facing. With the move the websockets, I’m no longer using Amazon’s load-balancer, and so now the servers are public-facing again.
I needed a way for messages to be delivered between servers as players might be connected to separate instances in the same AZ. This should be made easy by erlang’s distribution system. However, we also want to make sure that we don’t accidentally connect servers, which shouldn’t be connected.
The first answer to this is to use erlang’s “cookie” setting. The cookie simply stops and two erlang nodes connecting if their cookies don’t match. Knowing that we set the cookie in the settings file before we push the build to aws. We also generate a random name for the node at that point. It’s a bit clumsy but will do the job for now.
One wrinkle is that our REPL is just another Erlang node, and so it’s cookie must match the server’s. To enable this, I wrote some elisp to look at the setting file of the node we are connecting to and extract the details we needed to initiate the connection. With this done, we keep the ability to connect and quickly query details or push changes if required.
I’ve also spent a little time fixing bugs on the client but nothing of significant interest right now.
The next step is to set up the production database, get two production servers set up, and check that the messages being sent across erlbus are being delivered to both nodes. I’ll also need to set up something simple for node discovery. Hopefully, that won’t take long.
Just a small Corona related update to wrap up. I’m definitely starting to come down with something, but I have no idea how much it’s going to interfere with development yet. Hopefully, it won’t, but I’ll keep you posted.
I hope you are doing well,
Back soon with more
Hi again folks!
Work is going well. For the last two days, I’ve been focusing on handling network failures and testing board sync.
Sometimes you just lose connection for a moment; in that case, we want to get you back into the game as fast as possible. It’s quite likely that, since the client dropped out, nothing significant on the board has changed. In those cases, we can quickly confirm we are still up to date, make a couple of small tweaks and carry on as if nothing happened.
If there have been changes to the state of the board that aren’t trivially to reconstruct, we simply reload the board, which will bring your client back to the state it should be.
Another thing that can happen when someone leaves is host migration. When you have multiple people playing in TaleSpire, one client is the ‘host’. The host does the book-keeping that decides the order that all changes to the board happen in. When that client leaves that job moves to another client.
However, what if there were messages that were sent to the host before it dropped out. Did they make it? If they did, have they made it back to the clients yet? How do we know when to check? Then, if we are the sender of a lost message, we need to undo the change it made. Getting that nailed down was another of today’s tasks. I have a first version in, but it could do with more testing.
I also found some smaller bugs in board sync, which are now fixed, and I squashed a few other little things on the way.
Next, I’m finishing off the changes to board-sync to make it handle failures more gracefully, and I also need to have a look at pasting of slabs saved as strings as I think that is misbehaving right now.
Until next time, Ciao!
p.s. I really was tempted to go into more detail about these fixes, but doing so would require a bunch more context, and I’d like to work on a couple more things before calling it a night.