Well, it’s the first dev log after the beta release and in a surprise to no-one except me I’m exhausted today :D I had a good sleep, but clearly my body knows the release is done and fancies a bit more.
Today I did start looking at board sync issues however, as anything that breaks boards is a high priority.
To aid in the process as we hunt this down, I’ve added a small icon to request a board upload. It’s temporary as we want all persistence of boards to be totally transparent to players, however, obviously that is not the case right now. It also doubles as an indicator that shows you when the game knows there are changes that need to be persisted. It looks like this
After the usual 5 second delay, it will automatically upload, and then the icon will change back to its default state.
I also found a very ugly bug that could cause boards to become corrupted if the upload occurred while switching boards. What would happen is that on completion of an upload, the metadata for the current board (including its id) would be updated. However, uploads are async, so if you created a new board and if the upload took a while (multiple seconds), then it could have concluded when you were already in the new board. This would then overwrite the current board info with info from the previous board. This is daft behavior that is a hangover from the alpha.
I doubt it accounts for many of the issues seen, but it’s certainly one I’m glad we’ll be rid of soon.
I’ve implemented the change to TaleSpire, a related change to the backend, and tested on my local setup. I’m not going to put out a patch tonight as it’s 22:00 here and I have a glass of port that needs a drink and ‘breath of the wild’ that needs playing (also it’s very much tempting fate to push the last thing at night :p)
I’ll probably push this patch tomorrow.
The release was wild. I should probably write an update on just that soon. There were very last second (as in 15 minutes before release) server changes, which have resulted in DNS entries not having propagated in time (a glaring oversight on my part). Given the usual 48-72 hour propagation times, we are hoping to see a drop in the number of cases of the ‘stuck at login screen’ error by the end of Monday. After that, other cases will be a different bug, and we’ll attack that!
I’ll also be going through the github issues during the week. For now, my focus will be on the most board-breaking issues, but in time, we’ll get through them.
Thank you all for the amazing support, both in reporting issues but also just being so good about these bumps as we ride our way through them.
I hope you are all doing well and have been able to get far enough to see where we want to take this thing. Some of the creations we’ve seen you make already have been mind-blowing. This year is going to be great!
Have a great night folks,
Seeya with another update real soon.
Quick one as I’m dead tired and am getting straight to bed after this.
No surprises here; we are working on the beta :p. Bug fixes everywhere. @Ree has been working on the GM blocks, UI, input system, cutscene mode, and more. I’ve been mainly working on fixing issues in creature sync, long-running server services that clean up old sessions & board snapshots, and other stuff that is a blur.
The production servers are up, and we are now testing internally on those. My focus is bug-fixing from now to the release.
Right after some sleep of course.
Yesterday I managed to get ‘unique creatures’ working in TaleSpire again. A unique creature is a creature which has has been marked as special by the GM, it is no longer tied to a single board, and it’s hp, stats, etc. are retained across the campaign.
We use Photon for the realtime networking in TaleSpire. Anyone who is connected to the same board is connected via the same Photon ‘room’. When you are in the same room, you can send and receive RPCs, events, and synchronization messages.
However, if someone is in another room and summons a unique creature, we need to remove it from any board it was already in. We can’t use Photon for that, so instead, we use our backend. When you make tell the backend that a unique creature has entered a board, we broadcast a message using erlbus and then all listeners send a message down to TaleSpire itself. Naturally, you wouldn’t want every client receiving all messages, so we use topics. When you join a campaign, we subscribe your client to a topic named by the GUID for that campaign, and we do the same when you enter a board. This means it’s trivial to send messages to any client regardless of if they are in the same Photon room or not.
All of that appears to be behaving itself (although it could definitely feel better), and so today, I’ve been working on a selection of client-side bugs.
- Torch state not syncing
- GMs couldn’t focus the cutscene camera of players
- GMs couldn’t remove specific dice groups (clear all worked though)
All of these are now behaving again. Next, I need to have a look at tile drop-in and some cases where things remain highlighted after the selection is removed.
I think tomorrow I’ll look at line-of-sight again and hook up the bit that actually hides and reveals creatures based on whether the creature you have selected can see them. Also, I’m 99% sure I’m sampling the cubemap incorrectly when collecting Ids, so I need to look at that and see what I’m screwing up.
Good morning folks,
Yesterday went pretty well. I was able to deploy the production DB, two servers, and show that they are communicating as expected.
One thing that took me a little bit to grok was how the two Erlang nodes were meant to find each other given only the node’s name. This article talks about how epmdless is great, but how you also need something extra when the docker containers are on separate machines. We use epmdless_dist like the article shows, but the example has this code:
epmdless_dist:add_node(‘email@example.com’, 17012). % Add node into epmdless database
And I didn’t understand how erlang would know the address of my other server unless it was assuming that the
@host1.com portion was the name for the server as well as the container. It turns out this confusion was valid, and there are a few variants of the
add_node function. The one I needed was
add_node/3. Here is the spec:
-spec add_node(Node, Host, Port) -> ok when Node :: atom(), Host :: inet:hostname() | inet:ip_address(), Port :: inet:port_number()
After calling that, I could
net_adm:ping(‘firstname.lastname@example.org') from the second server, and it correctly resolved the Erlang node. The two were then correctly connected, and erlbus messages started working across them transparently. Very cool :)
My job for today is to get ‘unique creature’ support back into TaleSpire. It got ripped out during the big engine refactoring as the code was so entwined with how we used to handle creatures. This time it should end up much more straightforward.
Until the next one, Ciao
Time for another little update on what I’m poking at.
This week I’m focusing on server-side work, and the first thing I needed to look at was the new server setup. Previously the servers themselves were in a private subnet with the load-balancer being the only thing that was public-facing. With the move the websockets, I’m no longer using Amazon’s load-balancer, and so now the servers are public-facing again.
I needed a way for messages to be delivered between servers as players might be connected to separate instances in the same AZ. This should be made easy by erlang’s distribution system. However, we also want to make sure that we don’t accidentally connect servers, which shouldn’t be connected.
The first answer to this is to use erlang’s “cookie” setting. The cookie simply stops and two erlang nodes connecting if their cookies don’t match. Knowing that we set the cookie in the settings file before we push the build to aws. We also generate a random name for the node at that point. It’s a bit clumsy but will do the job for now.
One wrinkle is that our REPL is just another Erlang node, and so it’s cookie must match the server’s. To enable this, I wrote some elisp to look at the setting file of the node we are connecting to and extract the details we needed to initiate the connection. With this done, we keep the ability to connect and quickly query details or push changes if required.
I’ve also spent a little time fixing bugs on the client but nothing of significant interest right now.
The next step is to set up the production database, get two production servers set up, and check that the messages being sent across erlbus are being delivered to both nodes. I’ll also need to set up something simple for node discovery. Hopefully, that won’t take long.
Just a small Corona related update to wrap up. I’m definitely starting to come down with something, but I have no idea how much it’s going to interfere with development yet. Hopefully, it won’t, but I’ll keep you posted.
I hope you are doing well,
Back soon with more
Hi again folks!
Work is going well. For the last two days, I’ve been focusing on handling network failures and testing board sync.
Sometimes you just lose connection for a moment; in that case, we want to get you back into the game as fast as possible. It’s quite likely that, since the client dropped out, nothing significant on the board has changed. In those cases, we can quickly confirm we are still up to date, make a couple of small tweaks and carry on as if nothing happened.
If there have been changes to the state of the board that aren’t trivially to reconstruct, we simply reload the board, which will bring your client back to the state it should be.
Another thing that can happen when someone leaves is host migration. When you have multiple people playing in TaleSpire, one client is the ‘host’. The host does the book-keeping that decides the order that all changes to the board happen in. When that client leaves that job moves to another client.
However, what if there were messages that were sent to the host before it dropped out. Did they make it? If they did, have they made it back to the clients yet? How do we know when to check? Then, if we are the sender of a lost message, we need to undo the change it made. Getting that nailed down was another of today’s tasks. I have a first version in, but it could do with more testing.
I also found some smaller bugs in board sync, which are now fixed, and I squashed a few other little things on the way.
Next, I’m finishing off the changes to board-sync to make it handle failures more gracefully, and I also need to have a look at pasting of slabs saved as strings as I think that is misbehaving right now.
Until next time, Ciao!
p.s. I really was tempted to go into more detail about these fixes, but doing so would require a bunch more context, and I’d like to work on a couple more things before calling it a night.
Alright, so yesterday I working on a few bugs in the code that handles the Unity GameObjects of the tiles (I’ll be referring to this as the ‘presentation’ as opposed the data-model which is separate).
I did a little work, so that undo/redo reuses portions of the
List that holds the game objects. It wasn’t too tricky, but as the presentation adds/removes assets progressively over several frames, I had to take that into account.
I fixed a bug where rotated slabs weren’t be pasted correctly as it was considering the wrong zones for the paste. This was simply because, when considering what zones to paste to, I hadn’t rotated the bounds of the slab.
I fixed a few bugs relating to how, when we have updated the state of a script, we locate the asset so we can update the visuals. In the end, I’ve opted to have a per zone presentation hashmap. This is a bit annoying as I wanted to avoid yet maintaining another data structure if we could instead navigate to the asset another way (even if it was a bit less efficient); however, this will get us out the door.
Fixed a bug where progressive loading was skipping operations that it was meant to apply to the presentation. This was a dumb mistake of just spuriously advancing through the queue in two places. Probably a screw up during refactoring.
Added a bunch of asserts performing sanity checks on the data-model. These are only included in the debug build, so I get crashes nearer the cause of the issue, rather than always having to infer what screwed up.
There was also other stuff, but it gets too into the weeds to cover in a daily dev-log.
Today I’ll be going through a tonne of the code handling the network connections and board synchronization and trying to add sensible behaviors to the failure cases. For example, when we join and try to pull the board from another player:
- what if they aren’t ready to sync
- what if they are already uploading the board
- what if they start, but then that fails
- what if they never receive the message to sync.
It needn’t be too fancy, but we need to have some logic dictating what to do, how many times to retry, and so on.
That’s all for now, Seeya
Things have been going really well.
The rewrites to fix the fundamental bugs are done, and I’ve been chasing down lots of regular bugs, mistake, typos, and other fuckups :)
It feels really nice to be back in a place where each issue is somewhat contained. A bunch are, naturally, related changes for the bug fix; however, as I’ve tracked each one down, it’s been more “Oh, of course” or “Why am I such a muppet”, rather than “Ah shit”. PROGRESS!
My plan for this next week is twofold. Firstly I’m going to spend some time finding and squashing the most obvious bugs, and then I’ll look at getting merged into master.
It’s looking likely that Ree and I are going to be able to meet up again soon. That’s going to be a bunch of fun as we’ll get to work on a lot of very user-facing stuff, and we’ll be inching closer to the build we’ll ship for the beta. Can’t wait!
Ok, that’s all for now.
Have a good one.
I forgot to write this last night, so here is the update from yesterday.
I’ve rewritten copy and paste. I don’t have proper test coverage of these yet so I will need to look into that. First though I want work through the obvious bugs that show up through play. This should keep me busy for a day.
Today work progressed pretty well. I’ve updated the code that handles tile deletion and have moved over to fixing up tests. Copy and paste still need to be updated, but I’m more likely to find fundamental issues in the code I already have, so I definitely want to get that fixed up first.
I’m fixing up small mistakes from the refactor, and once that is done (and all current tests pass), I am going to start adding tests to try and find bugs when multiple builder’s operations interleave. Ideally, I want to use my automated testing code to find the bugs, but there are some limits here.
To perform automated tests, I made a very simplified implementation of the board, and it’s core operations, we call this the model. I then take the model, an instance of the actual board class, and a stream of random board operations. I apply the stream of operations to both the model and the board and then compare the state of each to ensure we got the same results.
The model, as mentioned, is very simplified. One simplification is that it doesn’t support multiple builders. This means that while it’s great for testing against one board, it’s not so useful for testing against two boards that are being manipulated by separate players.
We still can get some useful data, though. If we take a board and have two players add and delete things, it should be the same as if a single player performed those same actions in the same order. This means for those operations, we can use our model. However, undo/redo operate on a player’s own action history and so can’t be modeled the same way.
Regardless of the limitation, it’s still useful to use the automated tester, so I’m definitely going to implement the add/delete case mentioned above.
I should have that working tomorrow if all goes well.