The inefficiency of horizontally scaling websockets with the websocket-rails gem

I'm trying understand how to horizontally scale a real-time tic-tac-toe web app. I'm using Ruby on Rails to serve the content, websockets to update the state of a game, and Heroku to host it all. Since Heroku's websocket architecture is fairly typical, this question is not specific to Heroku, but websocket-based apps in general...

"The WebSocket protocol introduces state into a generally stateless application architecture. It provides a mechanism for creating persistent connections to a node in a stateless system (e.g. a web browser connecting to a single web process). Because of this, each web process is required to maintain the state of its own WebSocket connections. If application data is shared across processes, global state must also be maintained."

To solve this, Heroku, and others, recommend using a global message queue, to maintain a global state among the processes...

"Imagine a chat application that pushes messages from a Redis Pub/Sub channel to all of its connected users. Every web process would have a collection of persistent WebSocket connections open from active users. Each user would not, however, have its own subscription to the Redis channel. The web process would maintain a single connection to Redis, and the state of each connected user would then be updated as incoming messages arrive."

Using my tic-tac-toe app as an example, what if I had to horizontally scale my web app to 10 dynos/processes to handle a heavy number of users playing my game? User1 connects to my web app and is assigned by the load balancer to dyno1/process1. User2 connects to my web app and assigned to dyno2/process2. If user1 and user2 subscribe to a private channel using websockets, they will NOT be able to communicate with each other since they're on separate dynos. To solve this, I could use a global message queue that both dynos subscribe to. That way, when user1 makes a move, I would send that move data, and the the name of the private channel its associated with, to the global message queue. Then all connected dynos (including dyno2), could broadcast the move data to any clients that subscribe to the named channel, which most of the time would be none.

If so, am I correct to understand that this global message queue acts as a huge bottleneck that defeats the benefits of horizontally scaling in the first place since all dynos/processes have to process moves by users that aren't even connected to them? That's essentially the same as having a single dyno/process to handle all the users.

My questions are...

1. Am I understanding this correctly? 2. If not, what am I missing? 3. If so, is there a better solution to horizontally scaling a real-time tic-tac-toe web app? 4. One way I could optimize this solution is to bypass the message queue when two users are connected to the same dyno/process. How can I tell if two users in a private channel are connected to the same dyno/process?

I apologize for the long post, but it's a complicated question and that's the shortest I could make it and still feel comfortable getting my point across. Thanks in advance for your wisdom!