Shopify's Architecture to handle 80K RPS Celebrity Sales

Simon Eskildsen

Recorded at GOTO 2017

Get notified about Simon Eskildsen

Sign up to a email when Simon Eskildsen publishes a new video

thank you Kevin
my name is Simon as I said I work as an
infrastructure lead a company called
Shopify based out of Canada
a company that powers a fairly large
amount of online commerce actually both
here in Scandinavia but also in North
America so if you bought something
that's not been on Amazon there's a good
probability that you may have been
through a Shopify store and use some of
the infrastructure that we're going to
talk about today the talk today is going
to be a fairly high level and I'm gonna
be talking a lo of the architecture that
we built at Shopify to support these
very very large sales that we see from
some of the largest or biggest American
celebrities and some of the largest
brands both in Europe and also in North
America so we're gonna be talking about
over especially the architecture that
we've evolved over the past five years
and we won't go super deep into any
particular point but rather give a
comprehensive overview of all the
different components how they fit
together and hopefully inspire something
that I think can be leveraged in a lot
of other companies especially SAS
companies that can have a similar
architecture to ours
clickers having trouble so Shopify is a
company that helps people sell both
online and in many other places and
about five years ago we faced a bit of a
fork in the road we were starting to see
these customers that were sort of online
first when retailers went online about 7
8 10 years ago
often they'd had a physical presence and
then moved online but we were starting
to see a lot more customers that started
with an online presence and they also
started to require new ways of selling
with social media becoming very very big
over the past the past decade and
especially five years ago smartphones
were out in everyone's hands and
Instagram Facebook Twitter and so on had
become quite large and dissin able new
ways of selling online that people
hadn't quite been able to do in the past
what we call this a flash sale and yeah
idea of a flash sale is that you
announce a product sometimes minutes in
advance sometimes days in advance
sometimes hours in advance that you're
going to release some kind of exclusive
product at this day and time and as you
can imagine that can drive an
astonishing amount of traffic to one
site in from just one minute to the
other it's one minute before noon and
everything is good but at noon you have
hundreds of thousands of people coming
in trying to buy some of the same
products at the same time placing an
enormous strain on the system you can
imagine this from superball
Kylie Jenner if some of you know her
famous from for being famous and she's
out of the Kardashian family selling
lipstick and she can drive a very amount
enormous amount of traffic such as good
and Kanye can do that as well so about
five years ago is not these very large
Super Bowls and things like that that
we're starting to use his way of selling
but rather still customers that could
drive a very enormous amount of traffic
and we were faced with a fork in the
road do we become a company that tries
to support these sales or do we just
tell them to go and find another
platform a platform that actually didn't
exist but it's a very hard thing to be
profitable from takes thousands of
engineering hours to be able to scale to
these sales when you have an architected
for it from day one we chose the path of
trying to support these we chose to see
these sales as a way to identify
bottlenecks and it's a bit of a canary
in a coal mine
as to what it's going to break next the
flash sales of today tell us something
about what the traffic is going to look
like one or two years from now this was
what wrong direction this was such a
powerful decision at the time and so
important to our CEO who's very
technical that he wrote an internal
essay on why we support flash sales this
is really important in our company
philosophy that we want to as in the
face of adversity become stronger and
not weaker or just not use it as an
opportunity to grow when something bad
happens we want to come out stronger and
he wrote an essay about why flash sales
were such a case we have a very strong
culture internally of trying to inject
chaos into various parts of the
organization and try to come out
stronger as a result if you drop a piece
of glass it breaks if you drop a rock
it's indifferent but what can you drop
that becomes stronger as a result you
know if you go down to the gym you run
on a treadmill you come back a couple
days later and you can run further and
faster how can you build software in
that way and flash sales was such a
thing for our organization it was a way
for our infrastructure to become
stronger just a couple before we get
into the meat of the talk and with this
preface of how and how important flash
sales have been to the history of our
application here's some of the numbers
to keep in the back of your head for the
scale that we're running at we support
about half a million merchants paying
merchants all over the world
we processes almost six billion dollars
in the last quarter we run up to 80,000
requests per second to our Ruby on Rails
processes we do about 42 place deploys a
day to this and we have over 2,000
employees a large fraction of those
engineers deploying all the time the
mental model that I wanted to
incorporate for this talk is that we
have about three different tiers that
going to talk about we have the
trafficked here with all the traffic
coming in and which is responsible for
getting your requests from your home
network all the way to our
infrastructure serving back to page that
you're requesting we're gonna be talking
about the application and the data it's
here that work together to serve those
pages and we're gonna be talking a bit
about some of these arrows how do we
failover between regions with zero
downtime how do we shard our application
and how do we balance those shards as
some shops grow bigger so that Kanye and
Kylie are not on the same same chart but
rather spread out and how do we get from
the traffic layer into your application
layer and failover regions how does all
of this stuff work together you should
have a pretty good idea for how that
works for us by the end of this talk
we're going to be starting by talking
about the traffic tear as I mentioned
the traffic tear to me is the layer that
is responsible for how to get the
request from A to B in this case from
your home network to our network we're
going to be talking about how we do
global routing because this is something
that we face a lot of debate with
internally we're going to be talking
about a technology called open resti
which is absolutely incredible but quite
underused we're going to be talking
about how we protect ourselves from bots
doing sales how we serve cache hits from
the load balancers instead of the
application tier orders of magnitudes
faster and how we throttle the checkouts
doing very very large sales to
understand how your request gets from
the customer to our data centers or
regions as we call them we need to
understand a little bit about how the
internet works this is a very high-level
overview of how the internet works and
how their routes propagate if you are
Facebook or Google and you have one
domain and not a bunch of customers
pointing their dns to your domain then
you can do a lot of really complex
traffic engineering at the dns layer but
when you have about a million domains
from half a million different customers
pointing to your IP you have very
limited amount of control at your
traffic level you only have the IPS that
your customer that the customers are
pointing us to you do can't control
anything at the DNS level so we did
something that hasn't had a lot of use
in the industry for those of you who
know about networks it's called BGP TCP
any caste and I'll explain here how it
works how does traffic get from the
to our regions in with this TCP BGP
anycast so what happens is that Shopify
has eyepiece as you can see here we have
this slash 24 on this block of 255 IPs
and our regions will announce to the
Internet that these IPS can be routed by
these regions so if the ISP sees that IP
it should be routed to the Shopify
so what Shopify will do is that we will
connect our routers into our ayahs piece
and say hey we own these IPS to our I
Jason a is piece so we may have let's
say two ISPs do we connect directly to
from our regions and we tell them if you
encounter these IP send them to us
because that is Shopify traffic for some
Shopify store those eyes those eyes
piece then tell their adjacent neighbors
that the ISPs that they're plugged into
know how to route them and like that
it's propagating through this entire rap
you can think of this as a bit of a
gossip algorithm where every neighbour
or every our ISP is constantly telling
its neighboring ISPs which IP as it
knows how to route this entire
propagation - you can imagine ISPs
ranging from Asia to Europe to America
to South America of DS IPS that entire
internet broadcast takes about seven
minutes which is either amazing or very
slow depending on where you're coming
from so now we have this connected web
how does that customer then get their
traffic to the region while they make a
request right so in this case they're
going to do a walrus tour and there it's
pointing to the Shopify owned IP that is
part of that block that IP block that
the region's are announcing so it's
going to our ISP and then it either
needs to turn left or right depending on
which region is closest to that ISP in
this case we're going right at least
from where I'm standing and the request
is then routed to the region this works
really really well for when you're
pointing all these customer domains
directly to IPs and you don't have too
much control over it so now we know how
the traffic gets to Shopify now I want
to talk a little bit about what happens
when that request enters our network we
use the technology called open rusty
open rusty is perhaps one of the
best editions we've done to our
technology stack in the past four years
it is it enables you to script your load
balancers your nginx load balancers with
Lua you can do pretty much everything
that you desire you can customize your
load balancing algorithm you can
redirect to other data centers you can
change the headers you can inspect
cookies you can serve different content
based on requests and cookies you can do
external networks you can do all of this
stuff all in the comfort of inside of
nginx which is extremely fast and
extremely stable I think open resti is
quite underrated and I don't hear quite
enough talk about it because it is so
stable so fast and so scalable I want to
talk about a couple of modules that we
built with this it really has become our
hammer and nail to solve a lot of
traffic related issues and we've gotten
away with waiting off a lot of D doses
and doses just from simple Liuba modules
there's really powerful to be able to
script your low balancing tier and
here's just a very quick example of what
it looks like this is a very bare-bones
nginx config that is just listening on a
port and in this case serving some
content by Lua and you can do all kinds
of things here right you can do
customized low balancing algorithms like
you can imagine doing Network IO in this
block you could do pretty much anything
so one of the modules I want to explain
is that we have this big problem of bots
there is surprisingly as we learned
for sneakers so there's a lot of people
out there sneaker heads and they collect
sneakers and when to get these sneakers
you need to be there when the sale
happens and you need to be really lucky
because there's limited inventory maybe
there is you know two hundred thousand
people competing for 5,000 pairs of
shoes when that happens there is an
economy here where you can buy these
shoes and if you buy them you don't wear
them you sell them on eBay for maybe two
times three times four times the price
of course merchants don't want that they
don't want secondhand market charging
more than they were and not controlling
the customer experience so they were
asking us to come up with a way to
detect these bots and banned em so we
took out our hammer and nail open rusty
and built some modules
helped us solve this problem the CAF
Galaga module is one that we've had for
a long time every single request that
comes in to Shopify is locked into Kafka
a distributed message bus and all the
messages there are relayed into our data
warehouse but we can also consume these
messages in real time and make decisions
about them so you can imagine that
there's a Catholic luster here and for a
checkout we log that check out into
Kafka then there's a stream a stream
aggregator and this is the baud Squasher
it listens on Kafka and tries to in all
of these different events see whether
there's a suspicious pattern is there
something that doesn't look like a human
is it refreshing a page very very
rapidly in a way that a human in a
Chrome browser would never do or is
there something else at stake this is
fairly sophisticated algorithm to try to
figure out what is going on it is not an
open or SD module it's just a simple
consumer I think written in go in this
case and when it finds something an IP
or user rating or something else that
looks like a bot it can tell the rule
banner which is written in engine
accents just a simple DSL that allows
you to ban-ban various patterns very
simple patterns like user agents
eyepiece and the bot cog washer can then
based on the assumptions made from the
CAFTA stream ban an IP or something else
with the rule banner on all the low
bouncers now we can reject these BOTS in
few microseconds and people can write
very sophisticated algorithms that the
bot squashing layer to ban these BOTS
very rapidly when we deployed this we
saw a fairly sizable decrease in the
amount requests coming to Shopify and we
have to turn it off again because we're
so flustered like this was way too many
but turns out that a very very large
fraction of our traffic was actually
just bought another thing we do with
open rest is that we serve cache hits so
we can if you go and browse a Shopify
product there's probably someone else
maybe browsing the same page at the same
product later so we can cache that if
nothing changes we can go to the
application tier and serve that cache
hit which is stored in memcache but we
can do this orders of magnitude faster
if we do that the nginx layer layer just
because of the overhead of the Ruby VM
and a whole stack that we have there
which is very difficult to optimize at
this point so you can imagine a request
coming in it's getting some collection
of full filled with walrus products and
you have a cache miss you get to the web
application tier but when you are
fulfilling that request getting all the
data from the data stores and rendering
requests back to the user
we're also filling the recruitment
filling the request into the cache to
someone else conservative really really
simple stuff nothing novel but on the
next hit what can happen is that each
cache which is an open or SD module will
check the cache directly at the low
bouncing tier this can this is fairly
simple right so now we can get it
directly from the cache and not even
touching the application tier making is
able to handle orders of magnitude more
requests per second with a fairly small
addition to the stack this is more
complex than it looks at least for our
stack because generating and cache keys
and things like that has turned out to
be quite complex but for smaller
applications this might be quite simple
we don't have this turned on for
everyone but we do have this turned on
for some larger merchants and the reason
why this is hard for us to turn on is
that people can on Shopify can customize
their themes completely and getting out
to play very well with caching has
turned out to be more difficult than we
thought the last one I want to talk
about is the checkout throttle some
merchants come in and can drive so much
traffic that we can't handle the amount
of Rights that they can do on a checkout
and you can't cash a write so you need
to do something at some point there's so
many rides going into the database for
Eddy adding the checkout or running
scripts that modify the cart all these
kinds of very complex are operations
that happen when you're checking out
getting shipping rates processing
payments that we have to throttle it
very few merchants ever hit this I would
say probably less than 10 or 20 unique
merchants in the lifetime of Shopify
have ever hit this throttle but when it
does happen it's very important that all
the shops that are on the same chart as
you do not suffer under your sale so
what we do is that when a request comes
in for a very busy store to the checkout
we will put that request into a queue
again happening at the low balancing
tier now the while you're in the queue
you're redirected to a wait area which
is completely customizable for the
Virgin again we strive as hard as
possible to not put people in this at
all but we need to have a safeguard in
case it does happen which control the
rate of which we leak from the queue
with a throttle and then when the
throttle is done we
will redirect you back to the checkout
now we've talked a little bit about the
traffic tear some of the opener rst
modulus we do and how traffic gets all
the way to Shopify now I want to dig a
bit into how does our application layer
work and how does our data tier work I'm
gonna talk about both of them at the
same time because they're quite
intermingle a very important concept for
us is what we call a pod a cut
some companies make hole there's a shard
but a shard to us means a my sequel I'd
like a relational Sharpe iPod to us
encompasses more a part to us is a small
Shopify small isolated Shopify that can
run anywhere in the world can run in a
cloud it can run in the data center can
run anywhere
so a pod is an isolated unit of Shopify
that would one or more shops you can
look at it like this if we have let's
say pod 7 part 2 and part 14 they all
have a bunch of shops on them these
shops don't talk to each other the pods
don't talk to each other they're
completely sharted and isolated from
each other giving us some very nice
scaling capabilities this is a
fundamental isolate isolation principle
that we call the shop isolation
principle all shops must be isolated
from each other if you want to do
something cross shop you have to use the
data warehouse or some other mechanism
because doing things across hundreds of
shops hundreds of pods in multiple
region is not something that's going to
scale at least to do in real time so
another thing I want you to note here is
that these pots are not at the same size
some have more shops than others and
we're gonna address that problem later
but first I want to look at what's in a
pod so a pod is one or more stores what
does that mean well it means that all
the stateful layers of that pod are
isolated so they if they're a pod has a
my sequel it has a memcache instance it
has writers instances and it has cron
workers that are doing periodic work so
all the stateful data of all of these
shops is shared between or is isolated
from all the other pods in all the other
shops now there is a tier that is well
that is shared and those are the workers
so when a web request is handled or a
job is handle it's shared between the
pods now why is that why would that tier
be shared the reason is because of the
sales remember the profits of this is
that we have to architect for these
massive sales and
pod too is experiencing a massive sale
some kind of launch then they may need
to be able to get perhaps sixty to
seventy percent of all the worker
capacity to handle that sale so the
stateless stuff is shared and the
stateful stuff is isolated by pod now
why wouldn't you have an auto scaler for
this this sounds like the canonical way
it's a thing to solve with a cloud
autoscaler right so when you have a sale
you just start spending up more
instances the problem is that spinning
up those instances is going to take tens
of seconds and during that time you're
going to be serving a lot of errors and
have a very bad experience for all the
customers coming in so we don't do that
and maybe one day we can maybe we can do
some really crazy things so soda we can
spin up these workers fast enough with
an auto scaler but today it's not the
case another stainless thing that is
shared is the traffic here that we
talked about before where all the open
rusty modules are running and the low
bouncing is happening another tool that
I just want to introduce while we're
here to test the stay list here is that
we have a load testing tool called
Genghis what Genghis does is that it
runs Lua scripts that define a flow
through our application and allow us to
test with an astonishing amount of load
using cloud instances what happens when
we send for example 10,000 people per
minute through the checkout it can you
can customize this as much as you want
with Lua scripts that allow you to say
things like oh this customer is gonna go
to the product page you're gonna browse
around a bit they're gonna be a little
bit indecisive then we're gonna add some
products remove some products from the
cart checkout the entire thing and pay
for it and then you multiply that with
maybe 10,000 6000 the interface to this
is really simple we have a slack bot
which just responds when you tell it to
in queue some Lua script the JSON file
here that you were in queueing describes
with Lua script to use which store to
target how long to do it and at what
rate how many of these executions do you
want per minute so remember that problem
from before of all of these pods being
of different sizes this year that's a
problem that we've been trying to
address in the past what we've done is
that when we needed to provision a new
pod because we've had too many shops we
replicated everything in that old pod
that we wanted to
split in half over to a new pod den for
a short duration of time we would change
the routing layer to have half the shops
go to the new pod and the other half to
the old pod and then start cleaning up
very operational manual amount of work I
think the checklist for doing this was
like a hundred items so we wanted to
bake make this better and less risky and
have less like big hand wavy movements
to move all of these shops between shops
with less downtime so we wanted to be
really really good at moving shops
between parts and we wanted to see if we
could do this with the minimal then the
smallest amount of downtime possible we
shouldn't punish large merchants such as
a large Shopify store that may take
hours to move and give them hours of
downtime even if they can schedule it
that's a terrible experience for a
merchant that maybe has an opportunity
cause there of perhaps several million
dollars that they're losing in that
transition so what the pod balancer does
is that it looks at these pods and it
looks at the shops and then the pod
balancer will make some decisions about
how to balance these pod so you can
imagine in this case it sees that part
two is a bit bigger it moves from shelf
to pod 7 but in interim pod 14 grew one
of these shops is occupying three units
three shop units this could be a shop
that has large sales it could be a shop
commerce solution and has grown
tremendously in size some of these jobs
are bigger than others just in terms of
size scale sales whatever they do and
the pod balancer just continues to work
right so it moves some other shops from
pod 14 over to some of the other pods
but the complexity increases because you
also have sign ups you have new shops
coming in and you have other shops
growing and the pod balancer just keeps
working at it the pod balancer keeps
trying to to fill up all these positive
equal size before we had this we would
have some pods that were maybe twice as
big as others giving you a bit of a
noisy neighborhood problem where on some
of these shards they were you were
experiencing more incidents than before
so what happens now when they're all
filled up
well you create a new pod and we're
working this year on making that really
easy and the pod balancers simply making
a decision to create a new pod and then
moving the shops over
wanted to be something that is always on
and something that we never have to
touch it just creates pods continuously
this is not something that is perfect
yet and there's still a little bit of or
still some manual labor involved here
but this is what we're going and it's
going quite well so far I want to just
talk a little bit about how we move
those shops from sharp - sharp how can
we do that with minimal downtown how can
large so if we just look at two of the
data stores here like red isn't my
sequel we can imagine if we want to move
a shop from pod 9 to pod 23
they're very naive way of doing that is
that you copied a shop by going through
every single table and you find all the
records that belong to the shop with a
simple query like this give me all the
orders where the shop ID is guess give
me all the products where the shop
idea.this put it into the target shard
and while that is happening you don't
accept any rights because the problem is
that if you have a shop and while you're
copying that shop it's inserting a check
up but you've just copied to checkout
table that checkout is going to be lost
so you need to basically lock the shop
while you're copying it for a small shop
that's not a problem that may just take
a minute maybe less than a minute maybe
this can take a very long time if you
don't allow any rights while you're
copying it what we're working on now is
not doing this lock move on lock but
instead we look at the my sequel bin log
and stream these events as we're copying
the shop to avoid locking the shop at
having downtime while we're copying it
the bin log is what is in my sequel
vocabulary where every modification to
the database is logged so if you change
a row insert a new check out it is
registered in the bin log as events you
know insert a new check out for this
shop insert a new product for this shop
so what the pod balancer is do is going
to do and we're starting to work on now
is that the pod bouncer is going to
stream the bin log and look at new
events that are coming in that haven't
been copied at the bottom so the bottom
tries to copy Alden all the old beta and
the bin log tailor part component of the
pod balancer is replicating new events
so old at the bottom new at the top this
means that we don't have to lock the
shop for an extended duration of time
now when that's done and we can now
imagine this is
discreet moment in time where the shop
is completely replicated for now we need
to move it to the new pod by updating
the routing information about it so
first we locked a shop no rights can
come in we interrupt all jobs all web
requests that's going on for that shop
for a brief couple of seconds then we go
into a routing data store the data store
that is dead is storing that the mapping
from shop to pod and from pot to data
center and we update the shops pot ID in
there immediately this is reflected and
things start going to the new pod we
unlocked a shop and the shop has been
then we remove it from the my sequel
layer we move the jobs and other things
that are in Redis over as well which are
asynchronous and now the shop has been
completely moved you multiply this by
running a thousand of these in parallel
and suddenly you can move a lot of shops
to a lot of pods very very rapidly and
you also add on top of this making
really good decisions about which shops
you're moving at what time to keep all
those shelves balanced so now we've
talked a bit about traffic we've talked
a bit about how the application tier
works with these pods now I want to talk
talk briefly about how we connect these
two how does the traffic layer know how
to route the request to the correct pods
for this we have a very well may named
component called sorting hat or sorting
hat does is that it looks at requests
that are coming in to the Shopify
network and tries to figure out which
pod that request belongs to it needs to
go to the region where that pod is
active and it needs to work with things
like failovers and shop moves and all of
these different components this is a
very very core part of the potting
architecture and one of the first things
that we worked on so again we have the
traffic layer and we have another open
resti lua module called sorting hat what
sorting hat does is that it looks at
these it knows which pods are active and
which parts are inactive so in a region
a pod is active or inactive there's
always a hot replica pair so that pot
two can become active in region a if
there's a cat a catastrophe or we need
to do some maintenance in that region
and of course the other way around too
so they're always set up in pairs then
we have this routing database from
before that we use when we move shops
sorting hat is the consumer of that
is the only thing that queries it and
asks about where shops are living so if
you have a customer coming in getting
their products for sneaker shop comm
sorting hat is going to ask the routing
layer to route sneaker shop comm sticker
shop comm is gonna return back a shop ID
it is active in region B so sorting had
been routes the traffic there this is
very simple it's I think a couple
hundred lines of Lua querying this data
store and doing this thousands and
thousands of times a second this is what
all what ties this whole part potting
architecture together and makes the
traffic layer marry the data and
application layer the last thing I want
to talk about is how we fill these
regions over right we have the active
pod and we have the inactive pod how do
we flip them over and how do we do that
with minimal disruption for this we have
the pod mover the pod mover moves parts
with between regions with minimal
downtime we're now at a point where
we're getting very very close to having
almost zero downtime regional failures
this is really important for us because
we think this is such a core thing that
someone wants to do maintenance in a
cloud region with without any
repercussions they should be able to
move all the pods to another region due
to your maintenance and move them back
without any risk this avoids this
problem of having very expensive staging
production environments that emulate an
entire data center when instead you can
just get really really good at moving
parts between data centers so what the
pod mover does is that if region B is
having some kind of incident it may be
out of power power we may be doing
maintenance it may be subject to a
hurricane it may be subject to flooding
whatever happens we need to be able to
move that region over this is the pod
movers job and it can do this very very
rapidly the way the podmobile works is
that it goes through a series of steps
to failover region the first thing it
does is disable cron in both regions so
we're not doing new periodic jobs the
second thing it does is that it goes
back to that routing data store the data
store that keeps all the information
between shop to pod to region mappings
and where the shop balancer or the pod
balancer was looking and modifying the
shop to Padma
the pot mover is responsible for the pot
to region mapping so it updates it what
happens now it's sorting hat that thing
that was looking at requests and routing
them to the appropriate regions routes
the request to the new target region
we then failover my sequel to the target
region that little animation that was
before we didn't enable cron and we
transfer the jobs that have not yet been
executed in the source that's in the
source to the target to then be executed
later it's fairly simple at a high level
of course there's a lot the devil is in
the detail here but this is at a high
level what happens when we do a pod move
so what about the errors right I
mentioned these I mentioned that we
wanted to minimize the errors when you
when sorting hat starts to route to that
in active region the database is not yet
rewritable there if someone is doing a
check out they're gonna have that
dreaded experience that everyone in this
room has probably tried your checking
out and you get an error back when
you've just entered your credit card
details and now you have to call someone
and ask what's going on we don't want
that scenario so we sat down and thought
about how we could fix this again we
took out our hammer and nail open resti
and solve this problem if you have my
CSI at my sequel replicant to data
centers they can't both be writable at
the same time one will be writable one
would be readable there will be a couple
seconds of downtime where people have
these terrible experiences but what if
you can pause the request that the
trafficked here if you can pause the
request of all they can't be served in
the new region and you know that they're
going to do a right you can do basically
zero downtime failovers so the Pasir
module will pause request in the middle
of failovers to avoid serving errors so
a customer comes in they're trying to
create a new check out doing a failover
doing those critical seconds where the
database is not yet writable in the new
region so we add them into a queue we we
drain that queue from a throttle to
avoid too many people coming in at the
same time and the throttle is of course
disabled while we're failing over the
region and my Seco has not fell over yet
then the posture will finally give you
an HTTP 5 a 200 response later so this
instead of this giving you dist dreaded
err page when you've entered in your
credit card details you're just waiting
five seconds this works now and it's in
production for us
so this changes the pot mover flow a
little bit all this stuff is the same
but sorting had when it rounds requested
a target region it inspects the request
and makes a good judgment on whether or
not this is a request that would perform
it right if it will it will hold the
request back at the low bouncing tier
then it will fail over to my Seco region
resume the request enable cron and then
transfer to jobs really really simple
and this has gotten us down to very very
few errors doing failures we want to be
so confident these failures that people
can just issue a command in slack and
fail over a region we already have this
working but we're not quite confident
enough yet to do it in production pods
but it will be just weeks before we're
there you type in a command and then you
fail over a pod you don't have to care
about the customer disruption because
there should be none and we work as hard
as possible to make that true this is a
core primitive of the pod architecture
if there's an incident if there's an
incident someone should just be able to
come in middle of the night failover all
the pods and go back to bed maybe some
day this will be automated but that dis
is still a tremendous step from where we
were years ago just the last thing I
want to note no dolphin is just another
one of these secondary benefits of
second-order benefits of this pods
architecture right now we're starting to
experimenting more with the cloud and
with this architecture it's just a
matter of putting a couple of clouds in
or a couple pods in the cloud of course
there's a lot of work going on to get
all these stateful tears and the
stateless tears in the cloud but when we
have it there we can just gradually move
shops there without disrupting the rest
of the platform so we have a little gas
pedal we're slowly we move shops to the
cloud and as we gain that operational
expertise with the cloud we can move all
of Shopify over there this really March
the end of my talk I hope you have a
pretty good idea now of how Shopify
works at a high level and maybe some of
this you can take home and and and think
about maybe some of this will be
applicable and especially if you're a
SAS company a lot of this should be
applicable I've talked about all these
components the pods the pod bouncer the
pod mover and sorting hat thank you so
really cool talk Thank You Simon
hey there's also some really good
questions and one of them is they
understand how you move parts but how do
you move them is the region is down
unreachable yeah quick question so okay
let me go back to this slide because
it's quite it's quite close here so the
question is what happens when when
region B here is on fire that region is
unreachable how do you fail it over well
there's certain things you can't do but
you can do checks right so the problem
one of the problems would be well the
database is not completely caught up in
the target - in the new region so you
have to make a judgment call right at
this point that MIT did process is still
manual and it will tell you hey you're
going to lose this much data if you
perform this failover are you sure you
want to do it so the pod mover is as
resilient as possible best case it will
do a pod move with all the jobs all the
data moved but indicates where it can't
it will ask you questions it will ask
you are you okay with losing this much
data are you okay with not transferring
jobs and it will still perform the best
effort thing that we can do cool there's
a one there's a question about like it
sounds like the pod more a respond in
multiple instances how do you avoid rake
race condition between them and how to
avoid smaller shells being moved all the
time yeah I think the slide is
appropriate to show that so when we
talked about the pod move herb initially
what we wanted to do is well okay let's
design it so that when you update the
routing information there's a bunch of
subscribers one that's responsible for
jobs one that's responsible for stopping
cron once it's responsible for failing
over one miss responsible for sorting
hat and then they all subscribe to just
like kind of pop subsystem but
distributive carbonate coordination is
very difficult so we didn't want to do
it it's too complex and the ROI is just
not high enough so this is literally one
script that is running on a machine
somewhere right and which machine it
will run on is of course resilient to
which readings are down and things like
that but it's one machine doing a bunch
of sequential steps that is resilient
sounds like how you would avoid that
it's a question in this another question
here is the customization of engine egg
a task for feature de Ville
of us or operation the Indian X task the
murse T oh yeah okay
so we have it's primarily been a place
where infrastructure engineers have made
modifications the opener st layer but we
have had some application engineers and
do various modifications for example to
redirect fractions of traffic and do
tests and things like that
so I would say it's probably like 90%
infrastructure engineering at this tier
and 10% product and product engineering
there's no question about like how do
you implement molten tendency within a
part yeah okay
this is almost a talk in itself but
sharding is essentially we shard by the
key so the shop ID and every time you do
a query you were always in the context
of the shop so you can imagine that if
you're doing a wipe request or a job you
can only curry data that is belonging to
that shop because every query is
appended with the shop ID so it does
like select all from orders where shop
ideas desk or select product where ID is
this and shop ID is this so they're all
in they're all in the same schema we
don't run hundreds or thousands of
schemas because that's not something
that that's not a very normal workload
for my sequel and we want the workload
to try to be as normalized as possible
that's the very high level overview want
one other thing I'll note is that we do
this charting at the application level
we don't have a proxy or something like
that when we implemented sharding that's
not something we were comfortable with
so it's application level sharding
sharted that the shop ID key schema is
shared between the shops on the pod and
every single table has a shop ID key
embedded in it so that we can do these
queries efficiently really cool like I
think one of the two last questions M
just there's something about really a
question about it's a cafe is it
something that would replace my school
pin lock for moving shops I'm not
entirely sure how to interpret that
question but I'll give it my best if the
entire my single bin log into CAF can
use that instead the answer is yes we
currently don't relate the bin log into
not sure not sure exactly why it's not
something that we really found it use
for tailing the bin log works really
really well for us right now moving to
Kafka maybe that's something that we
would do if if that's if there were
other use cases for it but right now I
don't really see that like the ROI on
using Kafka but have been longer that's
very high and I think the last question
somebody asked are you looking into
yes yes like those those little cloud
regions there on the right that's
series in production yes okay okay I
think there was a really a gift salmon
hand and remember to write in in the app
a few fingers good and also check out
Simon's blog he has a really cool blog
where he writes about of this and cookie
his name and you will find it it's
really good but give me a hand thank you