Applying the Saga Pattern

Caitie McCaffrey

Recorded at GOTO 2015


Get notified about Caitie McCaffrey

Sign up to a email when Caitie McCaffrey publishes a new video

I'm Katie McCaffrey I'm a distributed
systems engineer that's my disco hoodie
I program in that sometimes so I work at
the Twitter is currently on
infrastructure and platform specifically
observability and I just joined in
January and I live in San Francisco
prior to that though I spent six six
plus years working in our team in
industry on games like Gears of War two
and three and then Halo four is the one
that I spent the bulk of my time on
building like services and networking
and distributed systems that essentially
power a lot of the entertainment
experiences that when you turn on your
Xbox and start playing that you see so
that's my Twitter and I have a tech blog
if you want to talk to me on the
Internet I'd do that so feel free okay
like I said I worked at halo I joined in
2010 when 343 Industries was a studio
created inside of Microsoft to take over
the Halo franchise from Bungie who
decided they want to go make other games
and they just made destiny right so I
was like web service dev number two
hired in this tiny studio at like 30
people to figure out what we were gonna
do with this old halo stack because the
original games had services there was
like Halo one two three and then reach
and ODST all had additional services to
help you play and experience the game
but they were you know the way we were
used to building old services where you
had the game talking to a static service
that you could just like linearly scale
out and then you had one giant sequel
database that everything that housed the
source of truth for all of like Halo all
of the Halo games and so we realized
that that wasn't gonna scale for us when
we looked at what we wanted to do for
Halo 4 and the future of the franchise
going forward
so like we knew or we actually hit
numbers of 1.5 billion games for Halo 4
played and then 11.6 million unique
users and so that wasn't gonna fit on a
single giant sequel database anymore so
we ended up going and rewriting all the
services and moving to Asura's cloud and
we ended up using Azure table store as
our as our largest form of like no
sequel storage and so that's a key-value
store for those of you who aren't
familiar with it and so now in this
world we had this challenge where we had
used to do things with transactions on a
single database since we were very used
to programming against transactions
giant database and that kept our system
consistent and that was really nice but
now we no longer had this because all of
our data was split across multiple
partitions so we decided we thought
really hard about how we're going to do
this for me thinking really hard is
definitely like going to a bar and
reading a bunch of papers and drinking
bourbon but we came across the saga
pattern which is a paper that was
published in 1987 that I'm going to tell
you guys about and we actually went
implemented that so that we could
process statistics in sort of a
transactional way and I'm using it in
quotations because we are giving up some
things if you saw in a house talk we do
not have serialize ability we were
giving up some things but it does give
us a way to guarantee consistency like
from the beginning to the end of the
transaction and what to do in failure
scenarios so I'm going to talk to you
guys today about sagas give you a little
bit of motivation for why we need them
and then we'll talk about the original
paper its contribution to the
distributed or to the database space
because it actually came out at the
database community and then we'll talk
about how you do this in a distributed
world because it is a little bit
different there are additional
challenges and constraints that you need
to impose and then we'll talk about how
we actually did this in Halo 4 ok so
right like our systems used to be really
simple we used to be able to have like
an app or a website that talked to a
stateless service I talked to this giant
canonical database source of truth and
so you got transactions you got
serialize ability and acid so right
things look like they sort of happens
quench alee at Thomas City is this idea
that it either all happened or it didn't
and we do not see the state of things
being processed in between consistency
is application level consistency it's
not linearize ability consistency but
it's basically that our system is still
in a valid state after we have us when
we've started the transaction at the end
of the transaction the system is in a
valid state in that it either like
totally whether the transaction
completed or did not isolation is that
these transactions do not affect each
other if they are running concurrently
and then durability is this idea that
once we've committed something it
persists in as long-lived and like we
don't lose data so right but now we're
in this world we have service-oriented
architectures and micro services and you
don't have canonical sources of truth
anymore often
when we are writing applications we were
interacting with maybe some of our own
api's or we are interacting with someone
else's api's and using them and so we
actually can not and they all have their
own backing stores right so there is no
like one canonical source of truth we
don't get transactions anymore we just
can't have them and so some ways that
we've tried to solve the distributed
transaction problem as two-phase commit
so two-phase commit is an atomic
commitment protocol and basically it's a
specialized version of consensus
protocol so sometimes it's like okay and
it's been implemented and done at small
scale and basically I'm gonna go very
quickly through it but you have a
prepare phase where you'll have a single
coordinator the coordinator is very
special in the system he's a single
point of failure and he'll propose to
all of the resources involved hey we're
going to go do this thing and then all
the resources get to vote yes I want to
do this thing or no I do not want to do
this thing and the coordinator collects
so once the coordinator has received all
of the votes we now enter the commit
phase and if everyone has said yes then
we say go and do the transaction commit
and so that is how they achieve at
Thomas today because everyone is
committing at the same time or if anyone
says no then they say don't do it right
so there's no like state where you could
have the car reserved and not have the
hotel reserved in your travel app
everyone at the end must say done in
order in a real distributed system so
that you can the coordinator can know
what has happened so this is kind of
nice and this is actually people use
this but it doesn't really scale all
that well and failure conditions you can
have up to Oh N squared messages going
through your system your coordinator is
once again a single point of failure so
if that thing dies at any point during
the transaction like it's just aborted
and so the more resources you have the
more latency there probably is and
you're bounded by the slowest resource
in the system and so therefore that
coordinator is a larger chance of
failing and there's just sort of reduced
throughput through your system because
you're still like holding these locks on
these resources as you're operating over
them and so that's kind of slow also
it's worth pointing out that none of the
large cloud providers implement
two-phase commit protocol and they
actually specifically say they don't
like azor has a blog post on this
because it just doesn't scale well and
it's not something that they want you
using
their system because they're worried
about all of these things okay and then
Google has spanner right this is how
they do their distributed databases this
is also another paper that came out of
Google I highly recommend going and
reading it I'm gonna cover it briefly it
also is a great point of challenging
your notion of time because it doesn't
really exist or we can't have one
logical use of time so spanner is the
way that Google does globally
distributed databases and transactions
between databases across the world and
the way they sort of make all of this
work is they have their to time API and
they have GPS and atomic clocks
installed in all the data centers
it's also worth pointing out that Google
has fiber between all of their data
centers so they're not going over in the
network like they're not going over the
normal Network that most of us use and
so that's much faster and so what this
true time API does is it takes all of
the inputs from the system which
includes the GPS and the atomic clocks
and then it Cal is able to calculate a
bound of time when this event occurred
and so they use that to create a total
ordering in the system but they're only
getting down to 0 to 7 milliseconds like
it's still not synchronization is not a
solved problem they just get it really
really tiny window and so it works
pretty well so obviously Google spanner
is not available to all of us it is
really expensive and it's proprietary so
write distributed transactions is not
once again solved for the masses and
it's really you know synchronization is
not solved so what I'm trying to get at
here is that distributed transactions
are really hard and they're really
expensive especially if you want
serialize ability and acid so can we do
better right can we get distributed
transactions with serialize ability an
acid well I'm not a researcher I just
build large-scale distributed systems so
right now the answer is no we can't
because of the cap pyramid we can't have
nice things and distributed systems are
terrible but we can start to give up
some of these really tough constraints
like serialize ability and acid if what
can we give up and still program in this
model that we're used to and still have
so this is where sagas come in Saugus is
actually from the distributors from the
database literature in nineteen
seven I was published by Hector Garcia
Molina and Kenneth Salem from Princeton
University and it's a paper that is
looking at just like how do we do long
live transactions on a single database
so they they came across this problem
where if we're doing a transaction that
consumes a lot of resources or runs for
a really long time like computing a bank
statement or something that has to go
through a large range of history that
creates a bottleneck on even just a
single database because it's holding
locks on a bunch of different resources
as it's doing this and so they they
wanted to maybe they knew they couldn't
get full acid and serialize ability but
they're like what can we do to make this
faster and still get some guarantees and
so they came up with this concept of a
Saga and they purpose this for long live
transactions on a single database so
they defined sagas as a long lived
transaction that can be written as a
sequence of transactions and these
transactions can be interleaved with one
another all transactions of the sequence
either complete successfully at the end
of the saga or compensating transactions
are ran to amend partial execution so
let's break this down so it's a little
bit more understandable right they said
a saga is a collection of sub
transactions you take whatever the large
thing you want to do and you break it
down into these little pieces of
transactions and these all transactions
can have asset on a single database
concerns the one gotchas they cannot
like depend on one another so you get
the interleaving part like t1 cannot
take or teach you can't take an input
from t1 and they don't have they can't
have any ordering that is supposed to
happen for them so if you can break your
work into sort of normal system or into
these things then you have as the start
of a saga each transaction in the saga
has a compensating transaction and once
again the interleaving and the the not
being dependent on one one another
principle holds here as well and the
thing about these compensating
transactions that they're going to
semantically undo the transaction that
they correspond to and so the paper
actually acknowledged is that because
we're dividing the work into these
pieces sometimes you can't you can't
return the system to the same state if
you execute a transaction and then you
have to compensate for it you may not be
able to return to the same state as it
was before the transaction started
like for instance if you were gonna send
an email like one of your transactions
was sending an email you can't unsend
that email but you can't send a
follow-up email to say like hey this is
what happened and so like we're
correcting it and so it's sort of like
this failure management pattern that
we're doing okay and then after we
define all of this we have this nice
guarantee that either all of the
transactions completed so our saga was
successful and like whatever unit of
work we wanted to do has finished or
some of the transactions have completed
and all of the compensating transactions
have ran so the saga failed but we are
now back to a semantically consistent
state that we were before we started the
saga this is actually a really nice
model to program against cuz it's very
similar to transactions right like we're
either it we're keeping consistency and
application correctness throughout our
system so what we did here though is we
traded off atomicity for availability
you saw these um these sub transactions
are going to execute independent of one
another and so you will see pieces of
the saga completing before the whole
saga has completed and so if you need
atomicity this doesn't work for you but
for a lot of our applications this is
okay and then I also really just like
sagas as a whole because they're a
failure management pattern and so
building distributed systems we should
just plan for everything fail especially
if you're in the cloud especially if you
don't like own your own networks even if
you own your own networks you should
plan for everything to fail and so
forcing your application developers to
think about failure as like the first
and foremost in your design pattern is
gonna help you build more robust systems
and it's gonna help you build more like
less fragile systems because you don't
just code the Golden Path right you
actually are having to think about like
what happens when we go off of the
Golden Path okay so how this works in a
single database is if you were gonna do
this we're gonna have our travel
application you would try and book a
trip and instead of having this one
large transaction that would try to
reserve all of the things we needed to
go on our trip we would split it up into
like these logical units of work like
booking a hotel booking a car booking a
flight and then cancelling all those
reservations in the case that the saga
fails the way this works in a single
database is you have this process called
the saga execution coordinator
or the SEC the SEC lives in process with
your database and so it sort of shares
the same face so we're not in a
distributed world just yet but his job
is to go and execute these sub
transactions and then in the case of
failure start applying compensating
transactions you also have a saw a log
that is just like your normal database
log and so it's going to do things like
you're going to commit the messages to
it like begin saga ends Agha abort saga
and then all the beginning and commit
messages for the transactions and
compensating transactions and so what
this looks like as you're going through
just to give you an idea is if a
successful saga would be like I want a
book a trip right I'm gonna start my
saga in the log I will start booking the
hotel at this point I now own that hotel
resource the rest of the world can see
this right we don't have atomicity
I'll continue I'll book the car I'll end
booking the car that completed I now
have that car resource and then finally
if I can get my flight then I can go on
my trip and we can say that we
successfully completed the saga with
also nice to note here is that in your
application you can order things in the
in like a risk centric way so like right
sometimes there's penalties with
cancelling a flight so maybe that's the
last thing we want to do because I don't
want to have to go and roll that back so
there is a there is a reason I'm showing
you the ordering in this way okay so
what happens when like things fail and
things break right we have unsuccessful
sagas and the paper actually discusses a
couple different failure modes there's
like backwards recovery and forwards
recovery and a couple other ones but I'm
gonna talk about backwards recovery
because I think it's the most common and
the most useful and this is the idea
that as soon as we have a failure or or
we can't complete a transaction or
something happens we just start rollback
and we start applying our compensating
transactions for anything that had been
ran so an unsuccessful saga will look
like this in the log will begin the saga
maybe we start booking the hotel and
that succeeds so we own that reservation
and then for whatever reason booking the
car will fail because maybe there wasn't
an available rental car that day so now
we need to roll back and start applying
compensating transactions so you'll
start to apply the compensating
transactions it will free any resources
that happen to have been taken and then
you'll free the hotel reservation as
well and now we're back into
semantically the same state
sad because I don't get to on my trip
and I can try and rebook later but right
like we are still in a consistent state
for application even though we like held
on to that hotel resource for a brief
period of time now someone else can go
and take it it's just like a normal
transaction a single database so it's
the transaction that's asset it either
complete successfully or it doesn't
abort in a single database is like
someone is canceling it okay so saga is
in a distributed system the paper
actually comments on this and it does my
favorite thing that early database
literature does where they say due to
space limitations we only discuss it on
a centralized system although clearly it
can be implemented in a distributed
database and I laughed because this is
my life so so thanks dr. Garcia Molina
so we're actually gonna go through and
implement this there are blog posts and
stuff written about this pattern but I
don't like any of them because I don't
actually think they give you enough
detail to implement the full guarantee
so we're gonna talk about how you
actually do that why it's different in a
distributed world and some of the
additional restrictions we have to apply
in order to like get the same year in T
so we're back to this world right we are
no longer on one canonical database
source of truth we are probably
operating with a bunch of services that
are holding on to different date like
using whatever data store they choose to
use and applying their own constraints
and so this sort of translates really
nicely right we can still break the
units of work into these like requests a
book Hotel book car book flight and then
the cancellation requests so so far so
good
I'm gonna do a little term redefinition
because I don't like using the word
transaction in the distributed sense
because it's not really a transaction
it's generally gonna be a request it's
also a transaction applies asset
semantics to me and you can get like
these requests could be acid if that
service like gave you that guarantee but
it's really up to the service to define
what guarantee you have here so I
hopefully they give you like consistency
and durability you're probably not going
to get isolation or atomicity from any
of these okay
maybe depends on what your API you're
interacting with so and so you still
have your sub requests as well right
they're gonna semantically undo the
requests that are happening so the
successful distributed saga
looks exactly the same and we still have
our log
we still need a log but now it's not
co-located with a database because we
don't have a single database so we have
to have a durable and distributed log
that lives somewhere so this might be
Kafka as your service bus or like
whatever you want to use it just needs
to function as a log
you still need a Saga execution
coordinator this is once again I process
and it doesn't live co-located with
anything it's not special I want to
point out that this thing is not special
does not like our coordinator and
two-phase commit it can die it has no
state it doesn't do anything special all
of our sources of truth is still in that
log so sorry execution coordinator is
this process that's going to interpret
and write two saga logs it'll apply our
saga sub requests and then in the case
of failure it'll decide when it needs to
start applying compensating requests so
let's walk through what this looks like
because things are a little different
now I'll have our service that will
commit a message to our saga log to say
I want to start a saga a large unit of
work it can commit a bunch of data to
this log these are not just like start
and commit messages now they can commit
everything that I need to know to
process this request a saga execution
coordinator can be spun up or will be
there to see that I need to start
processing this saga and and then it'll
start reading the saga and figuring out
what it needs to do
it'll first start by committing a
message to the log that says I'm gonna
do the first request in the saga and
that has to commit successfully before
it can do anything else it's then going
to send a request to the service
responsible for handling that and then
once that responds it will commit a that
I finished the saga message to the log
so I walked through this fairly slowly
to show you that now we have like four
additional points of failure that we did
not have when we were on a single
database where everything was co-located
and like essentially if it crashed every
the sec the log and all of these like
transactions shared the same fate so we
have a bunch of places where things can
now fail again so we'll walk through
like it sort of does the same thing
it'll commit the message to the log for
the second request and then it will make
and then we'll commit the message to a
log and then the same with the third
it'll commit the message send the
request receive a response and end the
saga this is a successful saga and it
still sort of all works the same and
like life is good
so what happens when things fail when do
we need to start applying these
compensating requests right we still can
have this idea of an aborted saga our
response like the services could say no
I'm not going to do this thing for you
because I just can't or they could say
like I'm not available as a service or
they could say things like you don't
have access to do this these start
requests fail right so this could be
like whatever HTTP or failure response
if there was a service it's not there
and then in some cases if that sec
crashes we might have to start
compensating actions because these
requests we have not applied any
additional constraints on there they
don't have to be idempotent there's
nothing special about them you can do
whatever you want with them and we'll
talk more about the sec crashes in a
second so what happens here let's walk
through like a failure case so we've
done the first thing we've attained
whatever the first request is
successfully I've now started the second
request and the service says no I'm not
going to do this I want you to abort the
saga or it gets a failure message and
the SEC will then commit an abort saga
message to the log so now we know that
we're in rollback right we need to start
applying the compensating transactions
or requests sorry so what it will do is
it'll do the same thing like it's
replying the regular request it will
commit the start message to the log it
will send the start compensating request
to the service it will hopefully succeed
and then it will commit the message to
the log and then the same thing with the
third one because it's just reading the
log to understand that like oh I still
to apply the conference any requests for
service one as well so that's great and
now we still sort of have the same kind
of a guarantee we had but what if
compensating requests fail right now
we're in this world where there were no
they're not transactional what if they
fail so because they can fail we need to
be able to retry them indefinitely until
they succeed and so this poses an extra
constraint that we didn't have in the
database world on our system which our
compensating requests have to be
idempotent I have to be able to replay
them until they succeed so this is a
little bit different than in the in the
normal single database case okay and
then finally what happens when our FEC
fails like I've said this guy isn't this
this lytic process is not special we can
just spin up another one too
continue whatever happens like it
doesn't even have to be on the same
machine we had to determine whether the
SEC was in it or the saga was in a safe
state when the SEC crashed it or not and
so a safe State is that all the sub
requests are complete so you have the
start and end both logged because at
that point we know like we know what
happened we just left off somewhere and
we can just continue picking up with the
saga wherever it left off and then if
you're in an aborted state because
compensating requests are idempotent we
know we can just keep replaying them
even if I committed to start one I just
send it again and so that's a safe state
as well where we have to start applying
rollback as when we get into this case
of uncertainty of is did we start a
request and we don't have the end
it crashed before I even sent the
request I don't know if it crashed
because it got a response back and then
crashed like I don't know the state of
like whether the service even saw this
but it's not safe to replay that request
because we haven't put any additional
constraints on them if in your system
you can also make your normal request
item potent you can just you never have
to worry about SEC recovery like you
just bring it back up and start
reprocessing but I think like making all
requests idempotent is a tall order to
impose on like your normal method of
execution and so what happens is you
just committed an abort saga message to
log if you if the SEC comes back up in
an unsafe State and you start the
compensating requests okay
so essentially what we've done here we
have to define some request messaging
semantics on top of on top of like these
requests so we're gonna define an eSATA
or sub requests are at most once they
will get delivered zero or one time and
then our compensating requests are at
least once so they will get delivered at
so that's you need to know that when
you're designing your saga to make sure
that the systems that you're making
these requests on can handle that but
now we're back at this world where we
have the same saga guarantee that we did
in no single database and that's really
great right because now I know that
either my whole saga has completed and
I'm in a semantically consistent state
for my application or my whole saga has
not completed and I'm still in a
semantically consistent state for my
application and so that's a big win to
be able to program with these bounds
okay so just to sort of recap
distributed sagas they're very similar
to the single one except for now you
need a distributed durable saga log and
you can use whatever thing you like you
need an FCC process he's not special but
you do need something that will like
continually spin it up and make sure
that it's running and then you need
these compensating or comments any
requests now have to be idempotent so
that's that's different but I think it
tall of an order for us as application
developers because like if you think
about like restful services like post is
not or like delete is idempotent right
or semantically it's supposed to be
idempotent
okay so let's go back to this guy this
is the Master Chief
he's the main character of the Halo
series in Halo 4 and like I said before
when we started looking at how to move
from the single database world into this
multi like partitioned world of azure no
sequel we ran into some problems so I'm
gonna talk about the statistic service
that's the main service in Halo that
controls everything about your player I
wrote the majority of this service and
so it we have very very detailed
analytics on what you're doing sort of
at all times in the game and so we do
this by getting a giant blob of data
uploaded to us at the end of every game
and sometimes while you're playing and
so one of the problems that we ran into
the system was that people care a lot
about these statistics like they get
really really upset when you screw them
up it's always the best game they've
ever had always shockingly and so so and
I mean like cause like right me gaming
is a thing people play game like halo
for money and competition and prize
money and then people like to should
talk with their friends so and then they
also like how like we have this whole
website that you can go and like deep
dive through and this is just like a
fraction of what we have per player so
some things to know about the statistics
service is that we could have one to 32
players per game all of our player data
when we did this migration to fully
cloud and storage as a service was on
Azure table which is a key value store
and each player in the key value store
had their own partition and so now
you've got you're talking about writing
data to maybe 32 partitions and it still
sort of has to look like one unit of
work and either all succeed or all fail
so we
you know a little like stumped but sagas
came in and helped us do this so this is
if you've seen any of my halo talks
before where I talk about our liens this
is what I normally show so we built the
bulk of the services using a MSR actor
framework called Arleen's it's not super
integral to this but the little diamonds
are actors or grains from orleans and so
what happens is the xbox will send us a
bunch of statistics the game actor or
like process you can think of it like
that if you're not familiar with the
actor model will like aggregate those
right it's a blob storage and then at
the end we'll say hey all the players
please go update your statistics and
then all of those players will write to
their corresponding partition and as
your table storage and then I do this
hand wavy song and answer I'm like it
doesn't fail just like there's a way to
deal with failure and so this is
actually what it looks like because it
can fail and so what actually happened
is we had our Xbox running into our
stateless front-end service that we
wrote in F sharp which would then log a
message to Azure service boss which is a
service it's a queueing service and
message broker in Azure that you can use
and it would commit the payload of stats
to the service bus and like hey like go
start the saga for this game so that was
like that's what we use as our saga log
we had these router grains these
stateless router workers that would just
notice that like hey something I got it
committed and so we need to spend
someone up to go and process this work
and we treated our our game actors or a
game processes like our sec coordinator
so all the routers did is they just
notice that like new work was there and
then they would spin up the right the
right game grain to go and handle it
so the game grain then acted like the s
you see at this point and would you know
do the whole like send all the messages
and then commit back to the ad room saw
as your service bus log so what happened
in failure right in failure you maybe
don't want to just roll back statistics
right because if I've committed your
statistics and you can see them even
though your friends can't and then I
roll them back that looks really jarring
to say like oh you have a hundred kills
and now you have like ninety eight kills
that's a really bad user experience um
also we just wanted to process this
eventually anyway it's not like we were
gonna throw that
data away if we failed to write for one
player so we implemented forward
recovery which is another pattern
specified in the saga paper which just
basically says this saga like always
needs to succeed so just recover forward
right like don't roll back and so the
basic premise here is you check point
and you say when am I at a safe state
you check point these safe States and if
safe state and then you roll forward
luckily for us any players succeeding on
writing was essentially a safe state we
didn't have to do any rollback so we
would just retry the saga later if
someone failed to write and I'll sort of
walk you through what that looked like
right so we had our game drain which is
our SEC he talked to all of the players
and the players talked to all of their
own partitions and so say like player
free couldn't write to its partition
because we blew our ups budget on that
storage account or something I don't
know
and so like three of the players will
now be able to see their statistics for
that game on Halo Waypoint and player 3
will not and that's ok we like people
knew that our statistics processing
scene with someone asynchronous
generally halo players are a little
narcissistic so they weren't looking at
their friends stats they were looking
more at their stats so we actually like
never had complaints about this so that
was fine and so we would then have to go
and replay throughout the system we
would have to go Bri play layer player 3
right so we would actually put this the
message on and like back off on
processing this for a while to give the
system a chance to catch up so you don't
just want to like hammer your storage
account again if it's already like
saturated so but because of this if
you're noticing I'm not rolling back and
I'm actually replaying the request to
player through again and because great
the game drain didn't know where that
failed it didn't know if it wrote and
then like just failed to like get it
back or like what had happened and so so
Ford requests or if you're gonna do
forward recovery in your system in a
distributed system then the sub requests
also have to be idempotent so we were
able to do some nice little tricks and
rely on our databases consistency to
ensure that we weren't like double
counting statistics or everything all of
our requests right opponent because it
was essentially just set operations
anyway cuz you're just like adding like
requests so then when the saga
got retried later and for recovery the
game grain would send it to our stats
grain player
three he would successfully store his
statistics and now we know we've
processed all of the data in in this
game and then we can just sort of like
we would discard it we didn't actually
have like a persistent log of everything
because that wasn't useful to us we
actually stored the data we needed and
then threw it away and so at this point
we now have that guarantee that like all
of the players in the in the game had
processed their to statistics and our
system was in a consistent state and
everyone could see their state this
statistics for the game cool so I'm
lying a little bit about this what we
actually did and this is where I like to
come in and tune systems is there's a
trade-off here right if you're gonna
write for 32 players back to the azure
service bus log which has to be
consistent right there's a little bit of
seep consistency and you're having a CP
log of hopefully because you don't want
to just like drop data or like have
errors in there that's like kind of
expensive to talk to that thing 32 times
per game so we optimized for our failure
case because writing this to a single
partition was not expensive because
players were only doing that like once
every 20 minutes when they had finished
a game and so for us it was easier to
just retry it was a better trade off
network latency and all things
considered wise to just retry everything
because we knew we were idempotent so
that's a trade-off you can make so when
we would have a failure we word-for-word
recovery would actually just like Reeb
last everyone and say process your
statistics and the ones that already had
them when it double count and restore
they were just say hey I process this
correctly and we were done so I like to
point this out because I think as an
industry we hide a little bit like some
of the shortcuts we take but I think
it's actually really valid to take these
shortcuts when you're going to tune a
very bespoke instance of a system like
this is a very like one-off instance you
could go build a general saga
distributed distributed saga execution
coordinator in system but like you can
also just go and build one that works
for your system and that's totally valid
this is a pattern for you to utilize and
building your applications and making
sure that they are correct so just to
sort of recap write sagas are these
long-lived distributed transactions and
you should use them in your system
because I think they're a really helpful
way to think and design and program
against in a model that we're used to
programming against you are trading off
atomicity for availability so if you
cannot tolerate seeing part
execution of the thing happening in the
saga and you can't do this but in most
cases we can do this right and then we
can take corrective actions I build a
like my skew in the world is towards
highly available systems right games
social media stuff like that there's a
very real world business consequence to
us not being able to take actions and do
things like if you can't play Halo then
that affects the amount of money that we
make and so that's bad
so like making this trade-off is fine
for us and we if we had to take
corrective actions we were going to go
do this and then finally it's a failure
management pattern right like it helps
you build more reliable and robust
systems and less fragile systems and so
I like this as a pattern in a hole
because of that finally I want to thank
a bunch of people who helped me out with
this talk
Peter Bayliss you know Sombra Thomson
Tarot Kyle Kingsbury Jeff Hodges and
Clemens basser's without them I could
not have done this and now if you guys
have questions I'm happy to take them