Keeping Your Cloud Footprint in Check

Coburn Watson

Recorded at GOTO 2015

Get notified about Coburn Watson

Sign up to a email when Coburn Watson publishes a new video

so I'm Coburn Watson glad to be here I
work for netflix and I'm going to talk
about a large cloud footprint and how to
keep it in check if I can just work rate
me I'd love to hear what you think
whether you think it went well or poorly
so I'm coburn w I manage the performance
and reliability team at Netflix I've
been there for about three years i
worked with adrienne there for a period
of time My organization handles a fair
amount of things in the area performance
and reliability primarily on the cloud
component so we reduce time to detect in
time to resolve so our major goal is to
improve the availability of the service
for netflix end-users some anybody use
netflix here alright higher than I
thought I don't know there are there are
people but we have plans as a team we we
build a lot of tooling on my performance
team we optimize usage of the cloud I'm
going to talk today from a lean
perspective primarily around efficiency
of our cloud footprint and some of the
strategies we have there I have a team
that steers global traffic so we also
own a number of the monkeys down here
there's chaos Kong that's the big one
and then wheeled chaos monkey within the
team Kong is where we actually take
traffic and shift it from run AWS region
to another and we do that about once a
month then we run it through peak and
then we fail it back and so use a
combination of rerouting and dns to do
that we also use that to inject chaos in
the environment yesterday in the papers
we love there was a discussion about
failure injection and we do that quite a
bit where we actually exercise our
services on a daily basis to make sure
that we can short-circuit them out of
there and I'll cover that and then we
drive best practice adoption but I'm not
talking about that today so as a company
we have about we're approaching 70
million subscribers in over 50 countries
back to the value that you know what
what our value is is a company which is
fairly commonly shared internally is we
want to win moments of truth so when you
sit down in your living room and you
have the choice of selecting something
to watch we would hope that you select
Netflix and that's that's our goal at
the end of the day is to make it a very
great experience in terms of content for
you to drive to and it's great to make
money at it as well
we do about 3 billion hours of video a
month right now we have a pretty big
cloud footprint I'm going to talk about
that we have our own homegrown CDN so we
deliver about 99% a little more of our
content off of a freebsd based appliance
that we generate me or we create me give
it to people if they want it and in the
u.s. I think we're close to thirty seven
percent of the Internet traffic at night
we just passed porn which is a big big
big milestone for us got to got to keep
that difference there and we have a we
have a strong original slate so from a
Content perspective I enjoy a lot of
these but we continue to release more
and more shows I'm we're really seeing
more and more originals that our
regional specific and then we try to get
global rights so we just released one
from Mexico as an example that's a
global which I think it's a great show
our open source offerings are a big
focus of what we do as a company I
picked a type and picked a bunch of them
out of here Atlas is our large
monitoring framework it collects
somewhere on the order of about two
billion time-series a minute from our
cloud and it lets makes it queryable
there's usually about a two minute
minute to two minute delay and you get
response times in about three seconds
for most of your queries so it's very
efficient from that perspective if
people have heard of asgard that's sort
of our amazon web services console
replacement for infrastructure as a
service there's going to be some great
news this fall knock on something that
we're going to be releasing a new open
source project which is our continuous
delivery framework that takes asgard but
it's like on steroids it's really
awesome so keep your keep your eyes up
for that ice is where we look at our
cost so we get a big amazon bill each
month i think at the end of the month
it's somewhere on the order of maybe
like 400 million lines and we consume it
and then we generated into nice pretty
graphs and data to understand where our
cost is vectors a performance tool my
team created which is on on host high
resolution metrics check it out it's a
lot of fun in history is our framework
for dependency command so we talked
about the letting services fail hystrix
is the framework that lets you on open
circuits when a service is misbehaving
I'm just going to pick a water here
when you open a circuit you stop talking
to that service and you throw a fallback
I don't have any examples here but a
good example would be you see your
personalized rose and when that service
has a problem you see on personalized
rose we're actually that lineage lineage
based fault injection we're working with
that one of those professors from
Berkeley on that right now so I want to
talk about our priorities as a company I
think in general when you look at the
engineering efforts you have priorities
in your organization if you keep the
lights on organization maybe you have a
product you just want to maintain
sustain continue to get maintenance
revenue off of when I was at HP that was
a number of our products I don't think a
lot of you are in that space so at
Netflix we prioritize in this way from
top to bottom so innovation is really
our foundation we feel that if we don't
innovate quickly think about Adrienne's
discussion of the hypothesis based
testing there's 500 test a thousand
tests going on every day if we don't
innovate we're not going to win in the
space and that's key to us and that's
probably you know eighty percent of our
investment engineering license and
innovation reliability is on top of that
if the service isn't available and it's
not much use and then efficiencies up
there at the top and so I I sort of hang
out in the upper to I don't spend as
much part on the lower innovation area
and there's not many of us that spend
time in the efficiency space which is
where I think we're fairly lean so we
try to do it behind the scenes and I'll
go into that a bit but the organization
recognizes that these are both important
and so everybody is a stakeholder in the
game freedom and responsibility it's
just there's fewer of us trying to move
the needle forward on those so those
previous those two on the bottom their
innovation and reliability have costs
associated with them and that can lead
to significant inefficiency so I'm going
here's a little sub screenshot of our
new continuous delivery tool we make
capacity on demand for all Netflix
engineers there's little box that pops
up if you want a thousand instances you
just type in a thousand instances and
they show up in production so our commit
to cloud goal is to let people get their
code out there in a matter of
we have this gigantic single production
account it allows us to maximize reuse
which I'm going to talk about it also
simplifies how we purchase capacity but
by being in one big account it just
allows people to have the capacity they
want and we also burst into on demand so
I want to make sure that people
understand we depend on the elasticity
of the cloud we have a certain
percentage we try to keep it under but
in general on any daily basis you know
we're probably going into on demand some
percent and then we backfill on top of
that so you might see that this could
actually introduce some fairly
significant cost risk if there weren't
some strategies to constrain it
reliability brings with it a lot of
costs as well right we have a red black
push model so when you push your code to
production through asgard or through our
new tool that's coming out you'll
actually stand up a new version of your
code direct traffic to it tear down the
old one if something goes wrong you just
point the traffic back right away you
deal with it the next day so having that
overlap of capacity you need to be
efficient in it but still it's
additional capacity we have multiple
levels of redundancy in our system
within an AWS region we actually over
provision across three availability
zones to support the loss of one of them
so you have about a fifty percent
additional capacity and each AZ and that
again is more capacity you have to buy
and then the big one of the big spenders
is our global redundancy so as I talked
about we exercised this capability once
a month of failing user traffic back and
forth between regions this year we're
actually working on failing it across
the three-week regions equivalently by
them by Christmas times that's our next
big goal but when you think about it we
purchase these heavy reservations to
cover our capacity I call them
guarantees I don't know if Amazon refers
to them as guarantees necessarily but
you want to have the capacity there when
you're at our scale that when you fail
over it's available so we actually have
enough capacity behind the scenes in the
regions to support failure our load from
any region to another at any given time
it's it's something we pay for 24 by 7
now purchasing heavy reservations has
some implications in that you're
actually paying an upfront cost usually
for like a three-year reservation and
then you're paying a 24 by 7 fee whether
use that capacity or
and the reason we do that besides for
reliability is on it gets us probably a
70 plus percent discount and giving it a
low in there but it actually compared to
on demand it's a big financial savings
as long as you use it otherwise you're
just paying for stuff you aren't using
so bucket of cash falling over if you
don't keep an eye on it so those are the
two big cost implications of how we
maximize our velocity for engineers
efficiency so at the top of the mountain
I'm going to talk about how we look at
efficiency how we set goals and some of
the strategies we put in place to
control that right I should not control
that guided long context so having these
efficiency goals is really key I've up
you skated some of the metrics here but
at the top we have our KPI right which
is you probably can assume it has
something to do with streaming right cuz
that that's sort of or it could be
logging so we're big logging framework
to be probably tracks right with it and
know what we do is we decompose costs by
these are at the I think the director
level and this is based on I status so
we get the Amazon billing file we feed
it through a system it goes into highway
generate tableau reports on top of it
and what this shows is your cost
contribution to the cloud costs overall
as a function of your key performance
indicator and you can see how it shoots
up and then suddenly there's a big spike
in our activity and suddenly it goes
down and it fluctuates a bit and so this
gives individuals at the highest level
of the organization the context
understand what percentage of the spin
they are and if it's tracking with the
business metrics or not and this gets
published weekly there's also another
tab you'll click on it will show
absolute cost so you can see how much
you actually cost so that's really handy
and it there's a lot of discussions
about VP saying hey look how much you
cost look how much I cost you know sort
of i think it creates little gaming
system there between the between the
divisions which is good its healthy
healthy competition but usually once
someone realizes that their cost profile
is not what they want it to be the first
thing they do is they turn to the
engineering managers under them and they
say what is going on here like why are
we growing so fast and so this was our
first cut and we
we had a pretty pretty good but we
realized was that to be effective we had
to push that cost context down to the
engineering managers at the first level
and their teams and the services they
own because they can't Barry's like yeah
and so you know they would turn to their
managers and say why did we grow thirty
percent over the past week or month or
what have you so we leverage some ice
data and then internally we use our
Atlas monitoring system which tracks
instance instance is used by our by
auto-scaling group and then we use a
normalized cost we don't want to
penalize someone for on demands there's
a normalized cost in here but if I was
an engineering manager and I receive
this this gets sent out monthly I would
see that my services together cost about
close to two million dollars a year and
here's my distribution of my services
and think of micro services we've
grouped them behind the scenes so
microservice might be say a servlet t r+
memcache plus cassandra we sort of
lumped them so we have this mapping
behind the scenes that lets us aggregate
it we maintain the reporting system and
so they can pretty easily see what the
largest cost component is of their
footprint they actually see
month-over-month what the change was so
this one dropped by seven percent
whereas this one went up by twenty-three
percent and so it lets them start to
understand where that spend is going if
they click on one of the items on the
left where they can do a breakout
they'll actually get it broken down by
those tears I talked about you know the
serviced here memcache cassandra they
can actually see which element is
growing within there when i first
started doing this a couple years ago
you know I someone's like hey our cost
is running off in the weeds why don't
you figure that out so i go to ice and
i'd start pulling down all this data get
some CSVs i start generating
spreadsheets and I'd say here's where
we're spending right and in about two
weeks it was completely invalid because
in our architecture and the rate at
which people push it's only valid for a
couple days right and it could be there
was an instance running around for an
a/b test and then it went away so we
found that we had to make sure we had a
system that was generating this
dynamically and updating it otherwise it
was always out of date so as you
increase your velocity of your
engineering efforts you have to increase
the velocity of your tracking and cost
tools as well or else is just going to
monthly spreadsheet people are like you
know first thing they'll say is oh we
aren't using that much anymore right but
again it's sort of that context /
control we try to push the data to the
teams they can make better decisions and
we see that in most cases most cases
they do maximizing sharing so when I
talked about buying reservations so in
terms of the size of our footprint I try
not to give exact numbers but across the
three regions we sent reruns somewhere
between say fifty and a hundred thousand
instances in a given day on Amazon of
all different sizes reservations are by
I didn't ask who uses Amazon here you
seem okay so you guys to moderate level
know the terminology so for ec2 which is
the compute that's our biggest component
we use if i go to amazon and i say i
want to buy an instance from you and i'm
going to pay you up front for three
years to get this great discount they
say great no problem what region what
instance type what availability zone so
you suddenly have a lock-in where if
someone goes and uses another instance
type in another AZ I can't leverage that
capacity for their service because it's
very finely fine-grained and how the
accounting is performed so another
reason that's why we have a single
production account we have all these
people pushing on a regular basis they
can share with each other now and we
have fewer larger pools since people are
doing red black pushes on a daily basis
some teams are doing it on a daily basis
then we're doing it maybe once or twice
a week outside of that they have these
bursts into their pools and we end up
having capacity and discount to cover
that so we don't micromanage it I don't
look and unlike look at people and say
well you're going to push today but
they're pushing tomorrow we just assume
if we do burst beyond that we go into on
demand and that's the benefit of the
cloud for us and so we maximize that
shared capacity I'm going to talk a
little bit more about that but if you
look at all of Amazon's ec2 offerings
seventy-five percent of our reservations
are only on eight ec2 instance types and
these could be like m3 xlm 3x you know
em three large we have some big ones in
there as well but this helps us behind
the scenes manage fewer larger pools and
I think in the next couple years
mentioned we have a maze O's scheduler
based framework that does some of our
batch activity we're actually
considering looking out a year or two
how we can container eyes a large part
of our workload and actually abstract a
lot of this away I have some concern
yeah me I have some benefits around
workload preemption shared capacity but
from a instance migration planning we
migrate instances today it's a real pain
it makes us less agile and it's sort of
a hassle to go to a team and say hey
Amazon's released a new instance type
it's time to spend a bunch of time
getting you on to the new instance type
and we want to get away from that but
that's just another aspect of having
sharing and this is updated fairly
dynamically so we build a lot of these
little tools that give visibility to
teams and you can drill down on the ASG
on you know the addictive different
zones but it gives that real-time view
for everybody to look at where their
capacity is outlaid so we encourage
borrowing and borrowing goes beyond just
within the single product out so when
you set up an appt set of amazon
accounts for your business yesterday
someone mentioned i think the dev
suck-ups talk that they had 1400
accounts for security purposes that
might be a little complex to accomplish
what we did in terms of linking but
amazon will give you the consolidated
bill when you set up your account before
you launch the first instance you want
to say to the AWS people make sure that
these accounts have my availability
zones lining up because what happens
then is when i purchase a reservation
and prod if i have a linked account like
my encoding account that's doing a video
encoding if they launched an instance
and the same availability zone during
that period and i'm not using that
reservation in the prada can't I get the
billing benefit so that's what we call
our internal spot market and we really
try to maximize that that's how we
recover a lot of that cost of 24 by 7 so
I'm sorry we dynamically autoscale this
is a period of a week if you guys can
see it that's one of our big online
services that is running every day in
one region us East this is in u.s. East
changed colors you can see there is an
overlap of capacity but they basically
reactively auto scale with load we also
have something we have an open source
but Adrian or someone mentioned which is
our predictive auto scaling and
so we run like FFT look at the previous
week's workload set them in set our
floor for these to a certain value and
so for the most part we have almost as
much capacity as we need on a given
daily basis and then we converse to be
on that if we need it but if you look at
these gigantic troughs over here that's
500 up to about 2,700 and these are
pretty big like 16v cpu systems and so
all of this all of this dead space in
here is capacity that we've paid for
that we're not using and so my VP looks
to me and says coburn why are we not
using that capacity and that becomes a
challenge that becomes a challenge for
me because then I have to turn to other
teams and say why can't you help me use
this capacity we recently created a new
API that actually allows a user to
consume all of the reservation usage all
of the on-demand usage in the billing
data or sort of the borrowing data so
they can actually look real time semi
real-time and say hey this AZ has a
bunch of spare capacity and the encoding
team will spin up a bunch of jobs and
then every five minutes they'll look to
see if that buffer is getting too small
and they'll tear it down this uh this
auto scaling group on the bottom that's
a precompute service so everyday Netflix
calculates your recommendations it's
probably no surprise but you can see
that it fits in the trough in general
now this is fairly primitive and that
it's a scheduled action within amazon
and we just stand it up and tear it down
but they're almost non-overlapping for a
large amount of the capacity I don't
have the data for the borrowed accounts
say our data end which runs large Hadoop
jobs or are encoding account which
encodes the video but if you overlaid
that you would see a lot more of this
spare capacity going away which is
really what we shoot for and they get
that billing benefit that's over about a
week so optimization plays apartments so
I have the performance engineering team
as well and we do direct consultation
for really big services we have about
four hundred micro services and their
distribution of size there's probably
ten percent which represents a fifty
percent of the capacity when they have a
big shift we can engage with them and
help optimize for the rest of them we
create tooling so they can look at the
performance of their service whenever
they want because it netflix every
engineering team is
about 40 engineering teams that own
these 400 microservices everybody has
someone on call they handle their design
development operation push page like I
don't know anybody's operational stuff
for their services right like I'm sort
of a coordination point for a liability
but every team owns every aspect they
can implement it in JavaScript they can
use closure they can use Scala doesn't
matter right we are heavily JVM based
but I'm having that decoupling allows
everybody to handle their own
operational aspects here's an example of
two separate flows the one on the left
is when you have a new feature or a new
service coming out and that tends to be
a little bit more of a special case in
which it's developed its deployed we
keep a little closer eye on it and as we
scale it up we optimize if needed
sometimes at our current scale there are
services which come out of the gate that
are very inefficient and we have to get
in and profile and optimize in a number
of ways the one on the right is
day-to-day development so you develop
your code you push it through a red
black but as part of that through our
continuous delivery framework you have a
canary and our automated canary analysis
framework is sort of our gate for a
production push teams adopted to various
levels that edge team I showed you here
the scales up to 3,000 instances that's
our most complex service they sit on the
edge you know tens of billions of
devices are talking to it every day they
have to incorporate about 700 jar files
every time they build from all the
clients that they talked to in the
architecture they push every day in most
cases i grabbed last week data issue but
in general they push every 24 hours on a
schedule it goes out so that that tier
gets replaced every day and it rolls in
a global way across the production
environment but we run Canaries that
helps with efficiency as well because we
don't stand everything up and see if it
fails we you know stand up a canary in a
baseline the edge team probably has
about 3,000 metrics that they've grouped
into various categories and they have a
score and if that score I think is below
seventy percent the push won't go out if
the scores above seventy percent the
push goes out and they fine-tuned it
over time but it automates it and it
rolls it the canary oh
the the canary will catch both
functional and performance deviations
right like big bang performance
deviations or things that we catch right
away and then we optimize before they
actually deploy it out so they don't
cause any capacity issues but it's up to
the teams to implement canary analysis
and catch that on their watch so we tend
to be too big for most commercial tools
we've brought a few in before we've
killed we've killed them and I don't not
in a bad way right i mean we're sort of
their performance lab if you have these
fifty plus thousand things somewhere
between fifteen hundred thousand running
it's a lot of data for people for these
systems to consume and the system's
themselves tend to have very broad use
cases right we want to create a CTO
dashboard and we also want to show you a
response time and look at your rum data
and we tend to have very specialized use
cases around hey I have an engineering
team that wants to look at the
transaction flow across tears using a
framework that's much like Google's
dapper has you know response time
information transaction tracing this
this utility slalom that the performance
team created actually shows demand on a
service this is a function of your
request rate any response time you can
double click on the bar and it will show
you your demand from upstream into your
downstream microservices and it lets you
see the flow through the environment the
the value here is if i needed to
optimize my service i can fairly quickly
understand how much time is spent in my
service versus my dependencies and go
work on those optimizations as need be
but you know this probably this view
right here probably incorporates maybe i
want to say about 9,000 instances worth
of data and it aggregates it up on the
right sometimes we find there are
aspects of our stack which don't give us
the visibility we want Brendan Greg's on
my performance a team you might have
heard of him he does a lot of
performance work and he gets very
frustrated when he has to use multiple
tools to solve a problem like hey I'm
going to go use perf to look at system
data and then I'm going to go find out
where CPU time is spent in the JVM so
I'm going to install an agent's I'm
going to start using this profiler and
he gets really frustrating see this
wants to do it from the system level all
the way through system and user space
right so he he started doing some
sampling against the JVM and all this
used to be blank and here the screen
stuff that's user space code
and the problem was is that the JVM
clobbers the frame pointer registry it's
an old optimization but as a result you
get the stack frames from / if you can't
really reassemble an accurate stack of
where time is spent so he went in about
six months ago and he found the assembly
code where that was happening he changed
it and he started working with the Sun
hot spot developers on the on the
mailing list and got the patch in so I
think two weeks ago son released Java 8
Update 60 and now you kind of run with a
certain XX flag you actually can get the
full stack traces right there's full
stack frames and then you can also run a
mapping agent that shows you addresses
to memory so you can actually generate
this in a really lightweight fashion and
see where all your CPU time is spent on
your system from system for userspace so
that's an example of creating a custom
tool sometimes by I didn't think we get
it through oracle but we did so they
obviously thought it was high value and
I was a simple change but we're very
happy now that we have that cuz that
eliminates a lot of workarounds we had
to do before with multiple tools he's
blog about it quite a bit so you can
actually check that out one last thing I
want to call out is in our goal of being
lean right we want to purchase capacity
we want to have these reservations to
make sure that the capacity is there and
we need it which can lead to waste but
you'll have teams that will come to us
and say hey I did a push today and I
couldn't get it you know an hour 38 XL
for some reason and one availability
zone and the answer is really we're not
going to purchase so much capacity that
we're never going to have what's called
a nice not the ice atul but insufficient
capacity exception so we built a tool
that layers on I think it consumes cloud
trail events through another utility we
have inside called Chronos which is our
auditing framework but the the capacity
team actually gets reports that shows
when we're getting iced and what
availability zone if it starts to be
sort of a major situation where we can't
get a consistent instance type we want
on a certain availability zone will open
up a ticket with Amazon have a
discussion with them and there might be
a shortfall it also gives us good
visibility and to win our deployment
models changing we need to start
purchasing more reservations in that
family but the data around these ice
events is actually available for you
directly from amazon so
it's handy to keep an eye on so it was
mention that I should talk about some
wins like I actually put some data in
here I tried to be a little bit loose
but in terms of internal borrowing so
this is a big deal for us because we've
been working on this for probably two
years I've been have had some requests
for me for a while to make this happen
but I've had dependencies on other teams
being able to consume the capacity but
our coding team we had a data point from
June so they used about 130 235,000
instance hours from the product count
that they didn't really have to pay for
if they had had to buy that capacity
they would have had to spend about two
hundred thousand dollars a month so that
was a pretty big savings and we had
already paid for it so we actually still
paid the two hundred thousand but it
just didn't go away unused our data
platform team the Hadoop team that uses
EMR they they're switching instance
family so this might not still be the
case but they were saving on the order
of about a million dollars a year by
standing up a custom cluster at night to
run very special jobs they called it the
supercluster and then they would tear it
down at a certain time in the morning so
this is an example of you by the
reservations you lay the cash out for it
but then you make sure behind the scenes
you get teams who are interested in
leveraging that it's basically free for
them and it looks great because it
doesn't show up well the reserved hourly
rate shows up in their chart but it's
not the same as the amortized sort of up
front plus reserve so sort of funny map
inside but teams love borrowing this
capacity nobody's used it for Bitcoin or
anything there's a lot of it there I
think we just create an application
called something unique Coburn's app so
in summary I think it's really important
we did this we've done this for a while
now is to figure out what that balances
of your innovation versus your
efficiency right as an organization
you're going to put more focus on one or
the other innovation hands down is what
we're focusing on I try to stay as much
as possible out of the way of innovation
my goal is to get engineers capacity
when they need it when they want it and
as much as they want and then behind the
scenes I wouldn't call it playing games
but get the context balance all of these
different dependencies and make sure
that we're efficient at the end of the
day and it's more of a soft to go for us
like if I don't meet my efficiency goal
I don't think I'm going to get fired but
it's something where it gives us it
gives us a target to shoot for when I
have a discussion with managers I can
talk about a concrete goal so it's
probably not a real surprise but it's
good to understand where you fall in
that spectrum pushing that data down to
the team level you really want the
individuals who are closest to the
deployment model to understand what
their cost is we find a lot of cases
where someone doesn't have an efficient
say deployment pipeline they might be
doing it manually they stand up a
service or an ASG they stand up another
auto-scaling group they forget about it
and it's running out there right this
sort of data helps them determine how to
become more efficient in their
development models and it's not like a
you know finger wag at them it's just
like hey here's your data and most
people act on it appropriately I think
was a Ralph Waldo Emerson said the cloud
is not a destination it's a journey or
something along those lines is probably
life but for us it's really been a
journey Adrian was there gave birth to
this netflix cloud baby and that was
somewhere in the 2009-2010 time frame
and just this year we're getting our
final application out of the data center
so it has been a journey and so when we
talk about this stuff sometimes people
become a little bit in awe of how we do
all this at this scale it's been a
process right it's something it takes a
bit of time and so you know don't focus
on efficiency so much day one but as you
go in overtime build up that a strategy
and those tools to help you achieve what
you need to do as a business there's a
netflix spaceship take it off please
rate the session and i'll see you
tonight if you have any questions