Collaborative AR:
Comparing Approaches

David A. Smith
17 min readAug 6, 2020

--

Bit-identical replicated computation using Croquet.

I think Croquet is the best thing I’ve seen in computing over the last 10 or 15 years. It solves some important and massive problems in extremely elegant ways. It really could become a new kind of operating system for the whole Internet.

– Alan Kay, Turing Award Winning Computer Scientist

Human collaboration mediated by computers is fundamentally about providing a “shared truth” to all participants, that they can then explore and act on together. The goal of a collaboration medium is to transmit information and ideas in a way that is not just clear but accurate, while incorporating modifications from any of the participants instantly and unambiguously. Simple examples we use today are a shared white board or text editor.

Augmented Reality will be the most powerful communication medium ever created and enable far richer and powerful applications. The AR systems of the near future will enable every application and every object to be shareable and collaborative including every bit of information, every user action, and every response by the computer. Sophisticated simulations, where we dynamically compute complex interactions and behaviors, including physics and system modeling, will be instantly communicated and explored with other users. In AR, I will see what you see. You will see what I do, as I do it. Collaboration must be a part of the kernel of the future AR-centric operating systems; one should think of this collaboration layer as the missing protocol of the Internet.

Augmented Reality is not just the next wave of computing, it is a fundamental shift in how we will engage with the world and each other. AR will replace your PC, your phone, your tablet. It will be an always on and always on YOU supercomputer. Most important, this powerful platform will revolutionize the way humans communicate with each other and with the computer ecosystem that we will be surrounded by. Collaborative AR is the ultimate symbiosis of human/machine/human.

There are several ways to create collaborative applications, ranging from extremely minimal replicated event systems with virtually no support of shared state, to perfectly replicated shared real-time simulations. The more sophisticated and powerful approaches provide richer collaborative interactions than is possible with the simpler models. Even better, it turns out that these more powerful approaches can also simplify application development.

True Collaborative AR has several requirements:

  1. Instantaneous shared actions. Actions must immediately translate into changes in the shared experience for every participant. Even the slightest latency in interactions ruins the user’s perception of the liveness of the experience. This is particularly true when the participants are face-to-face. Latency should be under 10 milliseconds when we each see each other engaging with the shared world.
  2. Shared state. The participants must see and maintain a shared world. This means that any set of actions by the shared world’s participants must result in the same transformation and view for all of them.
  3. Dynamic join. New users must be able to join a session already in progress at any time. This may not matter for certain applications like Zoom calls where you can get caught up by other participants or a screen share. In multiplayer games, a lobby is often used to collect the participants and then launch them all with the same initial state. However, AR applications will require that a new participant be able to join a dynamic session already in progress, which means they need to replicate and synchronize with the current shared state of the world.
  4. Verification of synchronization. Each of the peers needs a way of verifying that they continue to be synchronized with the others. If they find that they have diverged (gone out of sync), they need a strategy to re-join the shared state.
  5. Rich vocabulary. There must be an unlimited “vocabulary” of messages between systems and users. AR will require an extremely rich collection of these actions and ideas, many of them not yet understood, to ensure a vital and extensible communication medium. Further, unanticipated message types should be incorporated dynamically. AR is not just a consumption platform — it is a live creation/development environment. Virtually any kind of engagement and interaction must be allowed for and made visible and consistent for all participants. Multiplayer games have a limited vocabulary — often, something like “move”, “shoot”, “kill” or “die”. Rarely is there any live manipulation of the world beyond those elements, which means that the complexity of these games is limited. Creating and editing objects, when available, is performed locally. The objects are then injected into the world so any dynamic collaborative interaction and extension is impossible.
  6. Replicated simulation. Replicated, responsive time-based simulation is essential and will form the foundation of rich collaborative interaction. This means that the shared world is more than just a simple state system but can evolve dynamically while responding seamlessly to user events. Perhaps the richest example is a world built on a complex physics simulation — where replicating real-time physics in a multiplayer experience as part of game play is close to impossible with traditional approaches. Future responsive user interfaces will require this. Instantaneous shared simulation is what will enable AR to become the ultimate communication and cognitive exploration tool it is destined to be.

We are not just describing how some applications will work in AR — this is an operating system level capability that ALL applications and interactions will require. It is a computational foundation of AR.

There are many approaches and variations to implement collaboration. We will focus on four of the most common. It is possible for each of these to satisfy at least one of the requirements above in some way, but only one approach satisfies every requirement.

Replicated Events are by far the simplest to implement, but also the most brittle. Events are sent directly to the other users via peer-to-peer networks, or via a simple central server that everyone is connected to (sometimes referred to as a star configuration).

Replicated State is really a replicated database — every user maintains their own version, with various strategies used to achieve and maintain consistency. This typically requires a limit on the complexity of the events, and in the worst case would be unable to ensure consistency without having each user share their entire state with all others.

Centralized State (aka client/server) ensures a “shared truth” by placing it in a single, central server. Events are sent to this server, which computes the required state changes and sends them to the participants so that they can display the results.

Replicated Computation is the most powerful approach, and perhaps the easiest to program with. It is like the star network used in the Replicated Events approach, but each participant runs a bit-identical, deterministic virtual machine. Identical events, received by all participants at identical times according to a shared clock, trigger identical changes in their shared states.

Replicated Events

This approach is the quickest and easiest to create and can be the basis for simple multiplayer games and apps. A user event is broadcast to all the other participants, who all respond to that event in an appropriate way. The simplest replicated event approach is peer-to-peer, where each user broadcasts their events to all other users. The Photon multiplayer engine provides an excellent example of this. You must provide your own authoritative logic to ensure consistency — either with a central server or ad hoc management. This is very doable for simple applications — it is just extremely limited.

Though Replicated Events satisfy the first requirement of True Collaborative AR, instantaneous user actions, the simplicity of the event communication without any central concept for managing replicated state or simulation leads to complexity and increased cost in the management of the event-execution results. Ad hoc approaches must be devised to ensure and maintain the “shared truth” among participants.

This is also an extremely brittle approach to providing a shared world. The most significant problem is that order of events is not guaranteed. If user A sends an event at about the same time that user B sends an event, user C may see A’s event followed by B’s, whereas user D may see B’s event followed by A’s. This would mean that C and D may be out of sync. Even worse, if user A and B apply their own events locally before they send them so that they execute them immediately, then it is even more likely that they will be out of sync with their peers when their events arrive. This Replicated Events approach then requires each application, and perhaps each object within the application, to have a strategy for maintaining consistency across participants.

A simple fix to the ordering problem is to route all user events through the same point — a star network, where every event is sent through a server at the center of the star and then redistributed to the participants. The star server must process one event at a time, which ensures that all participants receive the same events in the same order. This process must include the local messages as well. That is, A must send its event through the server, which sends it back to A to be acted upon in the same way the other participants would receive that message. Communicating events between the client and server increases latency. This is somewhat offset by locating the server nearer to the participants.

Some approaches co-locate the server with one of the participants — but this is also a problem, particularly in gaming applications, as it provides an unfair latency advantage to the hosting participant.

In addition, the lack of any model for replicated state or simulation leads to enormous challenges in creating a robust scalable system required for collaborative AR. It is impossible to maintain shared state without introducing considerable complexity. Developers create ad-hoc solutions for each situation, that are typically unstable and do not scale. It is also hard or impossible to determine when peers have lost synchronization — presuming they ever had it.

Replicated State

The Replicated State approach transmits changes between users with the goal of “eventual consistency”. When multiple users make concurrent changes, their copies of the state may be different for a time, but eventually all users’ systems will converge. The approach has evolved to solve a relatively narrow problem set, such as document editing, and does a fine job at that. Examples of Replicated State-based applications are Google Docs and Sheets, Office 365, Figma and Trello. These are all document creation/editing tools, so they have well defined requirements and interactions.

An important advantage of this approach is that you can operate immediately on local data and objects. The overhead is in synchronization and stabilization. You will see what you do instantly — the remote participants will take some time.

The best way to think of Replicated State is as multiple, mostly consistent, databases, where although each user event may result in a divergence of the state, over time these databases will converge. There are several variations of the Replicated State approach. Some are based on peer-to-peer communication, such as conflict-free replicated data types (CRDT), and some use a server coordinating the state, such as operational transforms (OT). Some use a simplified CRDT with the aid of a server, which is more robust.

Replicated State suffers from several challenges when applied to real-time collaboration for AR. First, it is slow to see what others are doing; its objective is not speed, but consistency. Second, it is difficult to test for divergence. Since the convergence process operates without creating a single, shared state, there is no efficient way to compare state between users. A central server that maintains the “truth” of the world state can help with this. Third, it cannot be used for any kind of simulation. The events and transformations are atomic and transactional. Fourth, its full set of operations must be defined in advance. It has a static vocabulary.

One advantage that can be claimed for this approach is that it enables the merging of offline edits or even edits from multiple disjoint groups into a single shared state. Of course, in today’s connected world, the importance of this capability is diminishing and certainly does not address the literally immediate requirement of live collaboration.

The Replicated State approach can be and is used for shared document editing, which we will certainly do within our future AR world, so it has a place in the AR ecosystem. But there is no way it can be used as the main orchestrated interaction approach.

Centralized State (aka Client/Server)

Centralized State is based on using a server as the authority on the shared “truth” of the world. Each participant is somewhat free to do what they want to locally, sending relevant changes to the server that then integrates them into its own instance of the world state and transmits updates back to the participants. Most AAA multiplayer games utilize Centralized State, or world server, to manage the interactions and maintain consistency of their worlds — World of Warcraft, EVE Online, Fortnite are a few examples. The Improbable game engine is an interesting, large scale example of not just a centralized server maintaining consistency, but itself running dynamic simulations of the world in the cloud.

These multiplayer systems typically employ large server infrastructure to manage the state of the simulation. Interactions from each client are sent to the world server, to arbitrate and integrate into its world state. The server broadcasts world updates to all the clients, which then update their views onto that world. This creates a client/server bottleneck that limits the complexity and frequency of updates that can be handled without overwhelming the server or its communication channels. These systems are usable only for relatively simple user interactions.

The server plays several roles. It arbitrates the validity of the users’ actions, which are either accepted, modified, or denied, and then communicates the results to all users. As an example, a user in a shared multiplayer game might attempt to walk through a wall. The server computes the wall collision and sends an update requiring the user to take a half-step backward. Some of today’s games have this behavior — you move, but shortly afterwards your avatar bounces back. The remote users do not see the bounce — they only see that you stopped in front of the wall and could go no further.

The Centralized State approach is an improvement over both Replicated Events and Replicated State — primarily because it contains the concept of a “shared truth”, which is by definition identical to that of all participants because it exists on the server, a single shared resource. It also, by definition, provides a model/view architecture — the server maintains the model, and the clients maintain the view. Its biggest challenge is that it must somehow communicate this truth to the participants. This alone dramatically limits the complexity of the interactions and simulations that can be generated and experienced. Shared physics simulations of any complexity are problematic simply because of the enormous bandwidth required to update the simulation on the participants’ machines — aside from simple examples, it is not possible. Local interactions with these simulations are even more difficult, as the results of these interactions cannot easily be shared.

There is also complexity in managing the various state changes that are required on the client by changes in the server — in particular, rollbacks of actions and sometimes of entire local simulations. This often leads to frustrations in that the perception of truth (“I thought I shot you”) does not match up the server’s authoritative truth (“no, you missed”).

Using this approach, the server requires substantial physical infrastructure which is located remotely in large data centers that have varying distances from all players. As a result, this worsens the latency between user actions and observed outcomes, which comprises transmission time of an action from user to server, plus the server compute time, plus the return transmission of the result back to all users.

Initiatives to place real-time processing on the wrong side of the latency wall have always been doomed to failure because, even though bandwidth and latency are improving, local computing performance is improving faster.

–Tim Sweeney, Epic Games CEO

Replicated Computation

Replicated Computation enables a much richer medium of interaction and user experience for collaboration. The Croquet TeaTime protocol is the best example of this. The core TeaTime concept is that every participant is running a bit-identical instance of a shared virtual computer — a computer that runs in software emulating a physical computer. These virtual computers run in lock step on all the participating machines using a shared clock, provided by a lightweight external server called a Reflector.

In Croquet TeaTime, these identical virtual computers are running exactly the same program at the same (virtual) time. Even a complex computation such as a physics simulation is guaranteed to proceed identically for all participants, indefinitely. Ensuring that all events capable of affecting the computation, such as user interactions, are processed by all participants in the same order and at the same logical times guarantees that the events affect each participant’s experience in exactly the same way. Perfect replication of even complex, dynamically modified computations do not require the heavyweight state transmission needed by other approaches. This freeing up of bandwidth means that the level of complexity of shared computations and interactions are significantly greater than that supported by any other approach. This delivers an “as real as life” collaborative AR experience.

The Reflector is a simple edge-based server that does not manage any application state. It has two jobs. The first is to generate regular, timestamped heartbeat messages that it sends to all participants, so they all move their computations forward in time at the same rate. The second is to receive an event from a participant, add a timestamp, and then redistribute that timestamped message to all participants, including the sender.

The Reflector ensures that all events are well-ordered and have a unique timestamp. There is no possibility of collisions of user events, or questions about order of operation. The Reflector processes messages using a first in/first out method (FIFO).

Note that the Reflector only governs the pace of the heartbeats and any interleaved events. The participants’ computations can, if needed, proceed at much finer time granularity. One mechanism in support of this is future messages — messages that will be executed by objects at a specific future time. These messages are placed in a sorted queue that is part of the shared virtual machine. Since the execution of the action that generates a future message is guaranteed to be replicated on all participating virtual machines, the content of this queue of future messages is also perfectly replicated. The virtual machine not only contains the current state of the simulation, it contains the future transformations of that state! The arrival of a timestamped heartbeat or event message from the Reflector is the trigger for executing the computation up to and including that timestamp, including the ordered evaluation of all suitably-timed messages in the queue.

What this also implies is that if we take a snapshot of the virtual machine at any instant and provide it to a newly joining user, that user now has all the state needed to advance the computation, identically to all other users, in response to the Reflector’s ongoing event stream. A new user can dynamically join an accurate complex simulation “on the fly”! Snapshots are also used to capture the state of a suspended session, so that when users rejoin, they can resume the experience exactly where they left off.

The identical operation of the virtual machine for every participant means that a programmer can approach the development of their experience as if programming for a single computer, dramatically simplifying the programming task. Having all the code for user interactions and simulations run on the client is a simpler and, in general, far better performing approach than centralized systems. This is especially important for upcoming AR and VR devices, where zero latency in event and simulation updates will be required.

There are numerous benefits in Croquet TeaTime’s Replicated Computation approach versus the other three approaches:

Minimal server footprint. Near zero server compute cost. No need for centralized compute servers to maintain state. All computation is performed locally on each client device.

No server code. The clients define everything needed — the data servers exist for persistence (save and load) and the Reflectors exist only to manage replicated time and events.

Decentralized. Reflector servers can be placed anywhere on the Internet — most importantly, on edge devices like 5G cells.

Instant server migration. A session’s Reflector can be instantly moved anywhere on the Internet to minimize latency between it and its users. This switch is performed instantly. This is the 5G edge compute model.

Shortest possible latency. The only latency is round-trip time between client and Reflector. The Reflector does not perform any computation except to add a timestamp to the message that is broadcast to the clients.

Minuscule bandwidth. Only external user interaction events and heartbeat events are distributed to the clients. This is a nominal stream of data, far less than other systems require.

Model/View. The client uses a model/view architecture. The Replicated Computation is defined within the model, while the view responds to changes in the model state. This means that the view can be implemented in almost any way in any system — it just needs to be responsive to the model.

Complexity of simulations. Deterministic simulations can contain any degree of complexity that is locally computable. This enables far more powerful collaborative applications and a far richer user experience.

Broadcast simulation. Croquet TeaTime can simultaneously share any world, view-only, with any number of users. The basis of replication is shared messages (a very lightweight stream), so given a copy of the virtual computer and a data stream from the main participants, observers can explore the world along with those players. Because this is a broadcast message stream, there is no limit to the number of simultaneous viewers.

Secure. TeaTime messages are encrypted end-to-end. Reflectors do not maintain state and have no ability to read the user messages — they only add a timestamp and broadcast the encrypted message to the other participants.

Verification of synchronization. Taking a snapshot of the current state of the virtual machine is also a replicated event which is run on every user’s system. Each participant generates a hash of the state; by confirming that all users have provided the same hash we ensure continuing synchronization.

Croquet was designed to power the Augmented Reality collaborative experience. Like any good protocol, Croquet TeaTime is simple but extremely powerful. It is the foundation of the next wave of computing and more importantly of communication.

Not Mutually Exclusive

Each of the four approaches discussed has its own strengths, and they are not necessarily mutually exclusive. The wealth of the cloud in compute resources and data is still a critical part of any interesting platform. Replicated Computation is client side, so it is not particularly good at managing extremely large-scale worlds or extreme simulations on its own. Certain kinds of large-scale multiplayer games will still require large server infrastructure to manage the “world” environment. To be clear, large server infrastructure cannot be used to achieve instant live shared interactions and certainly not True Collaborative AR experiences.

A Replicated Computation model like TeaTime may still require a Replicated State to handle certain kinds of multiplayer interactions. For example, Croquet uses a simplified operational transformation model for managing collaborative text editing. However, TeaTime alleviates most of the burden to maintain the integrity of the document with multiple users issuing commands. The number of possible combinations among different types of commands are reduced from nine to just four.

The key thing about all the world’s big problems is that they have to be dealt with collectively. If we don’t get collectively smarter, we’re doomed.

–Doug Engelbart

Collaboration and the Augmented Conversation

Augmented Reality is not just the next wave of computing, it is a fundamental shift in how we will engage with the world and each other, and how we will understand and solve the massive problems we are faced with.

AR is the ultimate communication medium. This next generation of computing capability will not just allow us to extend and annotate the real world — far more importantly, it has the potential to allow us to create and explore completely new worlds that we build from scratch as part of our everyday discussions. It enables the Augmented Conversation, where human communication mediated by computers will enable us to dynamically express, share and explore new ideas with each other via live simulations as easily as we talk about the weather. It is essential that AR provide a perfectly replicated experience to all the participants — not only must they be able to view this evolving idea space, but they must be able to directly contribute to it. Every human interaction and response must be immediate, and the shared simulations must be equally responsive.

Human thought is defined more by how we communicate than any other feature. The amplification of human communication by these new symbiotic tools will help us not just in solving the problems of today’s world, but in constructing a far better one. Croquet and TeaTime were created specifically to enable this powerful collaboration/communication capability. Their focus on immediate and perfect “shared truth” removes a fundamental barrier to finally enabling the Augmented Conversation.

You can check out Croquet and the SDK here: https://croquet.io.

--

--

David A. Smith
David A. Smith

Written by David A. Smith

AR, VR, AI, 3D Pioneer I invented 3D portals and crates in games. I wrote the first 3D adventure/shooter, and created Croquet - it redefines collaboration.