Why we suffered our first “crash”

2 Comments

Tornelo does something quite unique in the world of online chess platforms. We have the concept of a Tournament Lobby.

Imagine an OTB event and Tornelo is your Venue. The Tournament Lobby is your Playing Hall. Organisers have a choice; run one giant section/division with one set of pairings and put everyone in the same Playing Hall, or have different Divisions by rating, group or age and put each Division in a different room.

Our first major event last year had 750 players, split into 10 different Playing Halls (Lobbies). At the time this was a risk because we hadn’t run such a large event on the platform. Thankfully this first event was a success, which led to many thousands of events, of all different sizes, being run over the past 7 months.

Saturday we attempted something new. We hosted a tournament with 1200 player in a single division. All 1200 players in the same playing hall, on the same pairings, able to see all 1198 other players in the room at the same time. Yep, you could be playing on board #600!! Prior to Saturday we’d hosted up to 610 players in the one Playing Hall (Lobby), so this was an attempt at doubling the capacity of a Playing Hall.

Why would we attempt this? Most platforms don’t even have a Playing Hall. Other platform venues put 2 players in a small, private room with a chess board. The player walks in, sits down and plays the game. Then, moves to a new room to play a new game. There is no need to view the rest of the players (and in most cases no need to share a real name). Having 1000, 2000 or 5000 (virtual) rooms is no issue at all.

But, we don’t run ‘games’. We host tournaments.

Arbiter-led, scheduled events…. great experiences for players. Imagine the OTB example… you’re in a playing hall with 1000 players, and you feel a buzz! Imagine how different it feels in a small hotel room with one board, playing against one opponent. Then, move to the next room to play your next game. In theory it’s exactly the same, right? It’s still a 1000 player tournament. But, we all know it’s NOTHING ALIKE. There is something magical about being in a tournament hall with hundreds of other players – even if you are only playing against one of them. And that’s the experience we’re trying to recreate online in Tornelo.

What went wrong?

There are things in the real world you just cannot predict. Because we were the first to attempt this feat, there wasn’t even anyone who could share their experiences. I think I made the SAME mistakes in my first large OTB event too! No space for school bags, nowhere for spectators to stand, ‘traffic jams’ in the narrow aisle between the chess boards, not enough copies of the Pairings on the wall, microphone not working.

We experienced some of these problems on Saturday.

Round 1: Spectator server

1200 players are also spectators….each spectating 599 games (and playing 1). 1200 x 600 = 720,000. The same number of players in 4 divisions is 300 x 150 x 4 = 180,000. We had to deal with 4x the traffic before taking into consideration actual spectators. This ran much slower than expected! Add to the load that it’s a blitz event and the same traffic which goes through in 60 minutes for a rapid game is now being transmitted in 10 minutes for each blitz game. So it’s 4x the traffic in 6x less time.

Unfortunatly, this overload also contaminated the live-game servers and about 20% of games were really slow to get started.

Solved: During the round we disabled a number of spectator features, returning the server to normal speed before Round 2.

Round 2: Weird error message

A number of players were reporting a weird ‘Failed to execute ‘removeChild’ on ‘Node’‘ error message. Diagnosing this error was a bit of a distraction, as it turned out to be Google’s auto-translate feature interacting badly when a user tried to search on the Standings or Pairings page.

Round 3: White screen of death

When a player first visits a Lobby it can take a second or two to load. During this time the server is sending information to the client and the client renders that information on the browser. This round we had lots of players taking 3 minutes or more to load the page. But why? Once you’re on the Lobby page you should never need to ‘reload’… the new data shows up automatically. Execpt in this case we’d updated our code and also added an extra server – so all players in the event needed to ‘reload’ the page.

Reloading takes a lot of data. So 1200 players all clicked reload at the same time and guess what happened? The lobby server started running slowly and players were seeing nothing, so guess what they did? Yep, clicked reload…. which made the server run even SLOWER!

Players now were seeing the ‘White screen of death’ which is really just the loading page…. and sitting there for 3+ minutes waiting for Lobby or their game to appear. Once you get into this loop of reloading and queues on queues, it’s pretty hard to recover without everyone sitting on their hands for 5 minutes.

Round 4: Loss of confidence

By this point some Players were getting really annoyed. Annoyed players complain to the arbiters and even if only 10% of players were complaining, that is a FLOOD of messages that overloaded the senses of the organising team. Chess players love any opportunity to complain. But this time, rightly so, their experience was bad. But why so bad? They know it’s a massive event, surely they are expecting delays? What’s the big deal if you need to wait for 3 minutes for your game to load?

Ok, so here let’s go back to our OTB tournament analogy. Let’s imagine that in our massive playing hall there are 600 boards, 1200 players and 1198 spectators standing around each board. And now the arbiter puts the pairings on the wall and makes an announcement “Players, find your places, you may start your opponent’s clock as soon as you get to your board!”

Ouch, this is really pouring gasoline onto the fire! In a theoretical world every player sees the pairings at the same time, every player can click Play Now at the same time and be instantly teleported to their board and so everyone starts together. But, the real world doesn’t work this way. The pairings go up on the wall and some players are in the toilet, or outside chatting, or watching YouTube, or playing a casual blitz game…. then, there is a large queue infront of the pairings and it takes you time to find your name on the page, then there is a bit of a crowd in the asile on your way to the game and when you finally reach your board and sit down your clock is already ticking! Why? Because your opponent somehow got to the board first.

This seems pretty unfair – as a player you did everything possible, you ran to the pairings, elbowed your way to the front, pushed everyone aside to reach your board and sit down…. and still you’ve lost on time without playing a move!! In Blitz (3+2) this is particularly painful. Especially if the server is running slower that expected (as it was in this case), or a player’s connection is weak, their laptop is old or their mouse clicking skills are slow…. you can imagine any number of situations which delay you by 20 seconds…. and now you are now starting with a time disadvantage!

In retrospect we should have updated the tournament settings on Tornelo and forced players to wait until an opponent was present before the clock started. Here are the stats:

  • 60% of all games were started by one player BEFORE the opponent was ready to start (and in 58% of these games the opponent lost on time before they arrived)
  • 19% of all games were started by the Arbiter (before both players were ready to begin)
  • 37% of all games were lost before one of the players arrived at their board
  • 64% of all games ended due to a loss on time
  • Round 1 – 182 games lost on time before one player arrived
  • Round 2 – 173 games lost on time before one player arrived
  • Round 3 – 240 games lost on time before one player arrived
  • In 100% of games, at least one player arrived

With 80% of all games starting before players were ready – it’s no wonder the complaints were coming in thick and fast. This was preventable for sure. Maybe the slow server would have meant players never reached their games, or maybe there would be unacceptable lag on the games, but these are different problems!

What now?

Despite abandoning this event after 3 just rounds, I’m quite confident that we will be able to host 1200 players or more in a single Lobby at some point in the very near future! We’ve started working on:

  • Changing the way our settings works so it’s not possible to fuel the fire by allowing players to start clocks before the opponent is ready. Arbiters can still start the clocks themselves if needed.
  • Building an option for high-stakes events to use a Dedicated Private Server
  • Ensuring the Spectator server cannot overflow to any game server
  • Continue to optimise the Lobby so minimal data transfer is required, speeding up the load times

Thank you to everyone who was involved in this event, please stay abitious and positive. Don’t shy away from making mistakes. Mistakes are how we learn and improve!

If you want to be the next event to try a 1000+ player Lobby, please let me know!

UPDATE: The event was ‘redesigned’ and restarted the following day – this time in 4 Playing Halls instead of one…. as expected, this 4-event structure ran very smoothly! Our playing halls work really well for up to 600 players, but they still need some renovations to cope with double those numbers!

Previous Post
Release Notes 2021-03-02
Next Post
Release Notes 2021-03-17

2 Comments. Leave new

  • RAJAGOPALAN C V
    June 30, 2021 7:37 pm

    It is amazing how Tornelo is taking the Chess World by storm. It is Just like a 2 year old kid participating in an Olympic Long Distance Event. Many major events are on Tornelo now. There are issues like serve over load / traffic when the size of an event crosses 900. Up to 600 it is smooth and enjoyable. But 950+ issues slowly erupt. No one can stop the player from refreshing & reloading when the “Pairing – Gone Live” alert is on the “Chat” screen.

    Zoom is the platform to over see the players on line & whether this is adding fuel to the fire is to be analyzed. With the on coming National Schools Events in Asia & the FIDE cadets events of July 2021 / Aug 2021 – will be the litmus test for Tornelo.

    The local servers by the Event Managers will be one of the best solutions. On the contrary, is it possible to use some space from the players own machine & say 100 Mb or so & update the server once the game is over eventually erasing the local cache data – freeing up the space for the next game.. This will also prevent an appreciable amount of server traffic.

    Another point is the clocks. There are very good handles like pausing, time addition, arbiter token – beautiful features in Tornelo. Here can we simulate the OTB condition. In a OTB black starts white clock & if both are absent an Arbiter starts the clock. Can this feature – a player starting the clock after answering ” Is your opponent – present” affirmative.

    It is great to play or monitor the games played on Tonelo. Still it is a long way to go in establishing Tornelo as a sturdy, reliable & the most dependable Chess Platform.

    Best wishes !!!

    Reply
    • Thanks for your kind words!

      I agree that we still have some stability issues for 1000+ player divisions. Because they are quite rare, we don’t get to test the performance in the real world and collect data very often.

      The recent issue was resolved and we are really confident of 1000+ players. You should see on Sunday the 950 players will be playing in the one event and I think there will be no problems.

      Zoom certainly does add complexity – it steals a lot of bandwidth and can make the player connection slow. But as you say, local servers will be a good step to take soon.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed

Menu