Self-Service Infrastructure

Daniele Sluijters

Recorded at GOTO 2016


Get notified about Daniele Sluijters

Sign up to a email when Daniele Sluijters publishes a new video

alright so I'm Danny I'm a site
reliability engineer for Spotify we like
to joke a lot of times that our job is
actually much more about site
unreliability than the actual
reliability of it a lot of times when we
talk about or when you hear the term
site reliability people go like okay so
you're one of the ops people that just
like sits there and takes care of the
fires and traditionally though that
would have been true in our case most of
our infrastructure teams are
infrastructure developers are what we
call site reliability engineers so our
job is sure help fight a fire every now
and then but usually it's build
infrastructure and build infrastructure
services so what we do as Spotify is we
try to provide services for other teams
that you know built their own services
so they can manage their own capacity so
that they can request more capacity so
they can do their own thing essentially
what we don't want to be is gary and
gary is that person that everyone has
had in the company that you have
physical hardware that you'd call up and
go Gary I want seven servers and then
Gary on the other line goes like puff
puff puff okay I can do this in three
weeks for you and then you wait for
three weeks and Gary comes back with
seven servers and they're assigned to
you for the lifetime of that those
servers or you're a lifetime whichever
expires first so that's kind of
something that it just doesn't scale
very well be and do things people build
stuff on top of each other just go for
it i end up with Gary being the little
boat around there trying to like you
know put out the fires of and people go
mess around with their configurations do
kinda stuff so we wanted to do when we
originally started was okay let's make
sure that people can just very easily go
like okay I want five servers clickety
20 minutes later we're ready to go so
essentially you want to go from this
complete chaos to a world where you just
have happy developers using a very
simple tool to do exactly what is that
they want to do which is you know manage
their capacity in this case so there's
two very important things when that we
figured out when we design
infrastructure tools the what we call is
like the
common case is frictionless if what we
want to provide them is the ability to
manage their own capacity then it needs
to do exactly that it needs to be a
thing where it goes like AF servers X I
want capacity why I want to click a
button and I want the service to deal
with it the fact that the service can
also ask you about like hey um what's
your rack diversity what color front
panel do you want all those kind of
things it's califican do that that's
probably not their primary use case so
make sure that what you're designing the
the end thing that you're offering to
them is actually very pleasant very
simple experience that does that one
thing and it does it very well like we
said though I'm just because it does the
one thing really well really easily
doesn't mean that it should prevent you
from doing everything else it should
still be possible it just shouldn't
necessarily be the default path that you
take one of the other things that's been
actually very interesting is that
sometimes it's totally fine to introduce
a bit of friction in the past that you
know are possible but not desirable it
tends to be that if you just make
something it's a tiny bit more
complicated even if it's an extra click
or two people might just go like well
you know it's not worth it let's just
use a default way and adapt to that so
you can get too much more um you know
similar environment where people do
home of genetic environmental movement
system is solved you don't need to deal
with hardware anymore people just go
into the Google cloud console and
magically they'll just manage their
capacity that way and we were like
awesome so we can throw the system away
and then we went into the Google cloud
manager and we looked at it and this is
kind of what we saw so this is what you
need to do to start up a compute engine
instance now the defaults that are there
don't actually work for us because we
have our own base image and other things
and then there's you know the second
column which is second part of the page
and there's actually a third one and
then there's like all the different
little sub tabs where you need to set
things like firewall rules metadata
additional things that you might need so
figure well that doesn't work because
what you end up with is people that go
nuts over figuring out all these options
and then you end up with 300 400
different instances configure slightly
differently
for you for every of your things we
don't want that so instead we went back
to our system and we upgraded a bit and
what you get back is the capacity tab in
our marker services dashboard so Spotify
is build up of on about six seven eight
hundred I've lost count at this point
waker services and every team owns one
or multiple of these and they manage
capacity for these services individually
so what you can do here is very simply
say like okay for my service in this
case the pod service which is a magical
this all JSON endpoint that returns
information about pods um I want
capacity in the little blurred out thing
because I wasn't actually allowed to you
know show you that um which is basically
like sites and things that we have
available I want it in the April or the
people that's just like a grouping like
if you have two different types of
instances that you want you can group
them in different pools and treat them
as such you select the type of instance
I select how many of them you want now
you click create and the pool manager
will just going okay I'll deal with this
for you um if this is on google cloud it
will be nearly instantly it'll take less
than a minute for your instances to be
available you see the ticker go up and
get all your instances if this is on
premise because this also works for
on-premise including the whole like
instance group kind of thing because we
ported that back over it'll take
somewhere between 20 to 25 minutes for
all your service to come online because
a provisioning system is actually fairly
smart about it it does things in
parallel so if you ask seven servers it
takes 25 minutes and not seven times 25
minutes we we took a lot about these
services and these things but sometimes
um you know all you want to do is build
a simple CLI is sometimes the whole web
service isn't necessary sometimes all it
takes is just like okay I want to do
this thing where it's like just
provision machine service and that's and
that's it but a lot of the time what we
run into is he lies that are extremely
complicated where you just type a thing
and it goes like no wrong arguments and
you type help and you get like three
pages worth of options and they need to
set like okay so I have a service and I
have a location and I have a pool and
then I need to care about the types and
the cores and this and that and you end
up specifying a whole command line super
cool if you can do that
would be what people just want to do is
hey I have the service I want X more
capacity I wanted in this location maybe
and just deal for it with it for me so
at that point we just got like okay so
we'll look up your current service we'll
figure out what kind of instance types
you're using we'll figure out a bunch of
other things you've told us where you
want it and we'll spin it up for you so
in the end week end up with is a bunch
of very simple tools that people can use
to manage their own capacity there is no
more Gary there's no person that does
this for you and there aren't any
tutorials to follow either it's not like
we have a thing that goes like okay you
need to click through these things in
the compute engine it's like no you have
a web interface that's part of the thing
where you manage your web service that
you know very simply allows you to scale
capacity up and down if you want more
you just increase the little counter for
one of your instance polls if you need
less you just remove it and we'll take
care of killing the instances for you
and the clock is blinking at 30 seconds
so I guess um we go some time for
questions so how do you how do you
determine something is not being used
are you actually monitoring network
traffic or something like that there's a
few things so our our microsoft is
dashboard has a lifecycle management
part of it so once the service goes
deprecated we can actually also
matically clean up because no one should
be using it anymore it's the
responsibility of the team to at that
point have dealt with their stakeholders
and make sure that okay no one is using
this so we can safely remove it the
other one is actually one is wearing
looking at if the if the instances are
actually doing anything looking at if
the network traffic is going there ok
are you planning on open sourcing any of
these tools yeah so our marks are like
our provisioning systems there are
they're currently on the road map
there's a few Spotify specific things in
there that we need to get rid of before
we can open source them but it's being
worked on cool thank you