Updates to Repository Sync Semantics

Published on: Aug 24, 2023

We’re excited to announce that we’re rolling out a new version of atproto repositories that removes history from the canonical structure of repositories, and replaces it with a logical clock. We’ll start rolling out this update next week (August 28, 2023).

For most developers with projects subscribed to the firehose, such as feed generators, this change shouldn’t affect you. These will only affect you if you’re doing commit-aware repo sync (a good rule of thumb is if you’ve ever passed earliest or latest to the com.atproto.sync.getRepo method) or are explicitly checking the repo version when processing commits.

Removing Repository History

Repositories on the AT Protocol are like Git repositories, but for structured records. Just like Git, each commit to an atproto repository currently includes a pointer to the previous commit. However, this approach has caused a couple of pain points:

  • Record deletions are difficult to process. If a user deletes a record, that commit needs to be erased from their repository to match their intent.
  • Increased storage cost. Maintaining repo history can cause anywhere from a 5-10x increase in repo size.

We attempted to resolve both of these in the current model through rebases (discrete moments when the history of a repository is deleted/mutated, like in Git). However, this is a tricky and sensitive operation that is expensive to conduct and complex to communicate across the network.

Using a Logical Clock for Repositories

To address the above issues, we’re replacing the prev pointer in commits with a logical clock. We originally published our intention to do so a few weeks ago. These are the changes we’re making to the way we handle repository history:

  • Incrementing the repo version to 3
  • Making the prev field on repo commits optional
  • Adding a new required rev (revision) field which is a logical clock
  • Removing or adjusting commit-aware repo sync mechanisms

Note: If you explicitly verify the version of a repo commit or do strict type checking on commit repo commits (which you shouldn’t — the spec allows unspecified fields!), you will need to make that check inclusive of version 3.

To facilitate backwards compatibility with software that is still running repo v2, we will continue setting the prev field on commits in the interim.

Even though we are setting the prev field, this can be considered a “hint” and the history is no longer considered a canonical part of the repository.


Repository Revisions

The new sync semantics for the repository rely on a logical clock included in each signed commit.

This “revision” takes the form of a TID and must be monotonically increasing.

The included revision serves a few functions:

Ordering

The clock provides a simple ordering mechanism for encountered repos or commits. If a consumer encounters the same repo from two different sources, each with a valid signature and structure, the revision gives a simple mechanism to determine which is the most recent repository.

Sync

When syncing a repository, revisions give a series of signposts that allow you to request everything from a given repo since a previously seen version. Because revisions are ordered and monotonically increasing, the provider does not necessarily need the exact revision that the consumer is asking for (as with a commit hash), rather they can provide all repo contents from the latest version of the repo that they remember that is before the requested revision.

The PDS for instance will track the revision at which each repo block or record was introduced into a repository. If a consumer asks for every block or record since a given revision, the PDS has a simple mechanism by which to give that information, without needing a complicated sync algorithm.

Stale Reads

Finally, a logical clock on the repo gives us a mechanism through which we can detect stale reads. (We actually already snuck this in with an optional revision field on v2 repos!)

Repo revisions may be returned in response headers to most requests. A client will know their own repo’s current revision and can compare that with the upstream service’s revision.

We use this today on the PDS to paper over some read-after-write concerns that are inherent in eventually consistent architectures. Some clients may use these headers to alert their users that their PDS is “out of sync” with other services in the network (for instance an AppView).

Available sync methods

Below is an enumeration of the available sync methods in the com.atproto.sync namespace along with the changes entailed in this repo update and their deprecation status.

getRepo

This is the primary RPC sync method. It allows a consumer to download an entire copy of a repository. Optionally, it allows them to signal the last revision they saw so that the provider may be able to send less data.

Changes

  • Remove optional latest & earliest params
  • Add optional since param (rev of the last seen commit)

Backwards-compatability path

  • If a consumer sends latest or earliest, they are simply ignored & the consumer will get the full copy of the repo

Implementation notes

  • With the optional rev param, there is no expectation that a service provides only the blocks created since that rev. We call this a “coarse diff” as additional blocks may be provided.
  • The PDS has a simple way of calculating blocks since some rev, if a service has no such mechanism, they are free to send the entire repository along.

subscribeRepos

This is the primary streaming sync method. It provides a stream of repo commits and their related diffs.

Changes

  • Added new required rev field to the commit event (rev of the current commit)
  • Added new required since field to the commit event (_previously_ emitted rev for the repo of the current commit)
  • We no longer send out rebase events (though they are still technically supported in the schema)

Backwards-compatability path

  • We continue sending prev in events
  • Now events will validate against the previous schema

Future Changes

  • Deprecate support for rebases
  • Possibly deprecate the required prev field
  • Possibly deprecate the full route in favor of a new streaming v2 endpoint (TBD)

getLatestCommit

Takes the place of getHead (we’re moving away from “head” as a term).

Changes

  • Changed name of root property on response to cid
  • Added new rev property to response

listBlobs

Same changes as getRepo - switch from latest & earliest to rev.

getRecord

No changes.

getBlob

No changes.

getBlocks

No changes.

listRepos

No changes.

Deprecated methods

These methods will continue to be supported for an interim period but will eventually be fully deprecated.

getCheckout

Deprecated in favor of the new getRepo.

The functionality is the same as getRepo with no rev set.

getHead

Renamed to (and thus deprecated in favor of) getLatestCommit.

Fully Deprecated

These methods will be removed immediately upon release of repo v3.

getCommitPath

The method no longer has meaning with history-less repos.


If you have questions about these changes, join us on GitHub Discussions here.


To receive atproto updates in your inbox, subscribe to the developer mailing list.

Join the Federated Sandbox

No waitlist or invite code required to join this self-hosted sandbox environment. The production network will open to federation soon.

Guidelines and Instructions