Building a Distributed Build System at Google Scale

Aysylu Greenberg

Recorded at GOTO 2016

Get notified about Aysylu Greenberg

Sign up to a email when Aysylu Greenberg publishes a new video

all right so let's talk about building a
distributed build system at Google scale
and Before we jump into it let's discuss
with Google scale means usually when
people say Google scale it's meant to
sound impressive and you know like system proved itself because it
can support the Google scale so let's
see what it actually means in numbers so
we have 30,000 engineers in over 40
offices and the distributive build
system that I work on is supports all of
those engineers across the globe and
there is a roughly 45,000 Kamil's bad
are submitted to the codebase every day
15,000 of those are by humans by Google
engineers and then 30,000 are by
automated scripts so these are the
release scripts these are the
integration automation scripts that
change the configuration and so on and
so on you may be surprised to hear the
following but even though we have lots
of talented engineers it's every time I
works on any large system it felt like
we never had enough people so if I have
to do something twice this just becomes
very annoying and so I try to automate
that as much as possible and that's you
know where our reliance on the automated
scripts comes in the source code is
roughly 2 billion lines of code so that
means the distributed build system needs
to be able to support building all those
2 billion lines of codes in the binaries
and so on and every day about half of
the codebase changes so that means 1
billion lines of code changes need to be
constantly compiled tested builds
released and so on so distributed build
system at Google is code build rabid and
it roughly gets 5 million build and test
requests per day and it generates
petabytes of artifacts these build
artifacts are the binaries that your
build system generates these are the
binaries that the tests are being run
against and so on and all of this is
done in one repository
when they sow the open source bill tool
from Google came out and there was a
mention of the monolithic repository
that we use there's a lot of questions
and a lot of confusion haven't we moved
away from monolithic repositories what
are we doing here why why does Google
use one repository so let me take a bit
of an aside and clarify the advantages
well for us so what it's like to work in
one repository
so first of all it's very easy to have a
linear revision history so when am I
when the system is failing in production
and I want to figure out which of the
reveal versions of the binary was
released and which of the code did are
going to this release and which didn't
it would be very confusing how to try to
match up all the different libraries and
all the difference other binaries that
will lengthen into my binary to figure
out what code is running but having
linear revision history just makes it
very easier to be able to tell okay this
is where was the cutoff for what we
release and everything is grants
reference as you might imagine Google's
code base is polyglots so the code is
written in Java C++ Python girl and a
few other languages and being able to
reference and being able to find use and
Co sites across the code base across all
the different languages it's incredibly
important for us to be to be able to
track the history figure out what the
use cases are how our clients use the
code and so on and you may be wondering
how exactly do can we prevent that
releases a work-in-progress versions of
libraries get from getting out so for
that we use components it's similar to
get sub tree or get sub components can
you raise of hands for people who are
using it okay I see if your hands yeah
so it's basically it's to be able to
tell what is the vetted good version by
the library team and what is the
work-in-progress version so if I'm
realigning some sort of library that
does string manipulation I don't really
care to use the you know the previous
version or differentiate between the
versions oh I care about what was the
last vetted good version for this
and then just move on and if I'm
interested in building for ahead for
their unreleased versions I can also do
that by just relying on the
work-in-progress versions as well and
work in progress versions are similar to
snapshots if you are in the Java world
and you're using maven and the whole
ecosystem so honestly if you think about
it having a central repository of
artifacts the way maven is central for
instance does it and having something
that you can build from source from a
monolithic repository it's just two
sides of the same coin
there are some advantages to having to
being able to build from source such as
having predictable repeatable builds
from source so for instance if the
reporter if the artifact gets corrupted
in maven repository then we would have
to be able to detect this and we could
store md5 hashes and prevent and it
prevents us downloading something that
we don't want to download corrupted
files or prevent us help us detect if
the binary is what it claims to be with
doing it from source we know that every
single time at the given revision it's
going to be the same binary over and
over again and also when we give it
different flags when we're compiling
with different options we want to make
sure that it always produces the same
binary there's lots of optimizations to
avoid having to compile the same
artifacts over and over again so we
don't actually end up paying the cost of
building from source over and over so
for a lot of base libraries we'll just
end up storing them and cashing them and
reusing them for all the other builds
that advantage of building from sources
as much as possible so if I'm interested
in writing a new service that relies on
the unreleased but in the future very
useful service to me or using a client
library that was not released yet I can
do so without having to negotiate
without having to coordinate with many
other different teams and figure out
what exactly is that
when is that client ready so that I can
use it for my own use cases you can
learn more about how the source system
works in Rachel Parsons talk that she
gave at scale I wouldn't go too much
into detail about the source system
because today we'll talk about
distributed build system which is built
on top of that and uses the source
system so we discussed for Google scale
means and what challenges it presents
just given the sheer number of people
working on a sheer number of people
contributing to the code base as well as
the size of the code base so let's talk
about the distributed build system how
many people in the audience have a
pretty good idea for the distributed
build system does alright well hopefully
a little bit more of you by the end of
this talk so honestly not every team not
every organization will ever reach the
point of needing a distributed build
system it serves a very specific problem
for people so let's look at what the
evolution is generally up from the
single desktop bill to to a distributed
build system so to begin with we'll have
one a software engineer working on a
project and the project is in green and
maybe they're doing some sort of
computer vision projects they'll be
relying on the opencv so they built two
will go fetch if OpenCV dependency from
the clouds and then link it and build it
for them and now they'll have the local
version of open CD on their desktop so
now to make sure we're on the same page
let's talk about what build means what
does it mean to build a binary and what
build scenario goes as follows so we
have a project with dependencies defined
and your dependencies may be defining
some sort of domain-specific language
usually people start an XML or llamo and
then a build tool which is worse than
one single desktop will find those
dependencies and this bill could be
make Gradle I'm guessing the audience's
familiar with those tools and his use
them can I see short of hand
whoo okay great everybody awesome so
then so this bill to a wall build a
project with the dependencies and then
it will download the build artifacts and
now once I have the build artifact
that's the binder that I can use to
release it and so on now for the test
scenario who basically you go through
almost the same steps who are will have
it defined a project with dependencies
and the Builder will be able to find
those dependencies and build them for us
but the binary itself is not quite as
useful for us in the testing era but we
want to do is we want to run the test
and then the output that we interested
in is getting the results of the test
okay so now that we are on the same page
about what I mean when I say building
and testing let's move to the Hulk the
build system evolves over time so we'll
have one person working on a project
that gets very very popular so or maybe
gets organizational support and now our
team is working on it and the team will
have to maintain its own machines own
serve on the side that will be doing all
the builds and releases and would be
integrating with all the continuous
integration frameworks like Jenkins and
so on so now a different team in the
same organization if it's a computer
vision company I will probably also rely
on open CV and I will have to build that
in have a dependency on that and build
their projects and then get another team
and so what we see here is that every
single team has to have their own
machine set aside were now they need to
understand and deal with the overhead of
maintaining operating and keeping
up-to-date and fixing all the bodies on
their build system at which point the
team spent some fraction of their time
instead of working on the project where
their skills and talents are most useful
on doing this so this is the time when
team like mine will come in and provide
the shared infrastructure for the teams
so now we will actually maintain operate
the system the distributed build system
and then the teams can just come in and
request to build and test their projects
and they will just get the results in
the Carib
so if you have teams of machine learning
experts they can just focus on building
the best models that they have instead
of worrying about all the infrastructure
details and this infrastructure that I'm
talking about here is very similar to
the kind of infrastructure that Travis
CI would be using people use Travis CI
in the audience anybody okay yeah great
tool so let's talk about build Abbott
the googles distributed build system is
you can see it in a way as being based
on the clouds so bezel is the open
source Google's internal built oh that
was open source last year and build avid
basically builds a distribute layer on
top of that so what does build a Badou
barabbas it's in the middle of the build
and test stack of the infrastructure so
the types of teams and projects that
would be using build rabbit our
continuous integration system to be able
to run the test release infrastructure
the artifacts that the release
infrastructure will later on deploy and
so on individual Google engineers as
well as teams that are there can't use
any of the other existing builds and
test infrastructure and they're trying
to do something else with the build
and also the integration testing
infrastructure that will be able to call
the build and then set up all the
environment that you need for the
integration testing and run the tests on
it and I'll build a bit itself relies on
many different components so first of
all and you it will need to get fetch
the codes the source code at the
revision that a client requested from
the source system and as I mention you
can learn more about the source system
at Google from Rachel pots let's talk at
scale and I'll link it later in the
slides and so build rapid sits on top of
blaze basically it's a distributed layer
on top of blaze that's the internal tool
known as beso and external world
and then blaze will perform the builds
for us and then it will put all those
artifacts into the build artifact
storage and though I would be able to
serve those artifacts to the user
requesting them so bill rabbit sits very
nicely in this middle of the stack
performing specifically the builds and
tests and dealing with a distributed
layer for the build and test needs so we
talked about what distributed build
system is we talked about all the
different challenges that it has to deal
with because it has to work at Google
scale so let's talk about actually
building they build system and
specifically we'll talk about evolution
of build Rabbid you can think about it
as the architecture has evolved from
being a push to pull model and we'll see
why in a second so bill rabbit started
out as an experimental project and it
very quickly turned to become a key
piece of Go Go's developing
infrastructure so there was a lot of
different teams that needed this and
apparently they needed soon and we
thought and so the initial design of it
the initial architecture was very simple
and rightfully so because it didn't need
to be any more complex because at the
time I was being built we didn't know
who'd be using it and where the need
would be and so it was built as a simple
push model where we have a client for
user as a pretty sick client and oh it
needs to do send the build work at the
test request to the build I worker and
given that the client library soda
pretty thick thick then we upload a lot
of the discovery and load balancing
concerns to the build a bit scheduler
now there are couple issues with this
design that we discovered later as we
started pushing the ceiling of the
scalability that we could provide and
that's the routing so given that we have
a thick line it needs to understand a
lot of different build system specific
concerns and also in order for it you do
routing it needs to do a lot of their
understanding of the scheduler responses
and figuring out what it means for the
systems avail
ability and so on so the internal
routing becomes very difficult there is
a pretty complicated a communication
protocol because all the different
outputs that the client might be
interested in such as artifacts that
they will request from the worker and
also they build progress information so
if we're running a test and there's 100
tests and our test suite we want to see
which of the tests actually failed of
those 100 and they would get this
information again over the same channel
from the worker and so when we started
hitting the scalability ceiling we will
move to the pool motto which would call
boat service so it roughly looks this
way so let me walk you all through this
so basically now the user instead of
making the recognition to the schedule a
direct connection to the worker they
will be putting it into the persistent
queue they build a quest because at the
end of the day user doesn't actually
care to talk to the worker immediately
what they care about is that their
builds and tests request gets executed
at some point in the near future and
they just want to know what's going on
with it is it still and cute is it being
processed what's going on
and so now they will put it into the
persistent queue and then at some point
when there is capacity build a bit of
work what would you cured and then it
will start streaming they build progress
information to a separate service that
is responsible for storing and serving
the build progress and also any of the
artifacts that are created by the build
system will be put into a different
service so that the user can retrieve
those build artifacts later and so now
the user if they're actually interested
in the build progress then they can
subscribe they can opt-in to get the
information about the progress of the
build and they may may not be interested
in getting the binary back so if they
doing the test in the testing scenario
they don't actually care about the
binary so then they will just request it
if they need it if they're actually
trying to do releases from the build
artifacts storage so for once I can go
back specifically for the rest of the
talk I'll talk about how we re
architected and specifically how we r e
the request into the persistent cure and
then the bilderberg workers started to
queuing those requests after the
persistent queue I actually led the
launch for this part of reworking the
execution engine and it was an efforts
across five different teams in three
different geographic zones and two
was very tricky for us is we didn't have
the luxury of having a downtime we
couldn't just tell people just wait a
couple hours like maybe on Saturday on
let me just rebuilt our system and
launch a new one so given those concerns
it very much felt like having to replace
a jet engine and it flies and I tried to
search online for the images that would
really communicate well what that task
was about and the only thing I could
find is Aero refueling so apparently you
can't refuel a plane in midair but you
can replace a jet engine so I had to
draw something to represent really what
it felt like it that was probably me
trying to you know figure out how to
launch that thing and replace that while
I was still flying and keep it flying
so to recap a bit so the architecture
tube went from is very simple client
push model request the scheduler finds
the worker pushes the work onto the
worker and then gets the results back
and then the architecture that we move
to is there this built service the pool
model where the user would just put the
work on the persistent queue and then
the worker will pull the work and then
reports all the different things and
then when the user interested if the
user is interested in their different
outputs then they would take them when
they need them if they need them and
actually I should say so as I mentioned
a little earlier client was able to get
they built progress information as well
as their artifacts by connecting
directly to the worker and so now we
split that up into two different
components so they build artifact system
as well as the build progress enforce
service all of those are done maintained
and developed by my team so how did we
reimplemented rhe architected build
zero downtime how did we manage to
switch out our engine and with fluid for
the rest of the talk I would talk about
some of the challenges and how we
actually achieve this with actually zero
downtime which is honestly surprising to
me ice I'm still recovering from the
fact that it actually happened so so the
first thing that was very important to
us is to be able to migrate back ends
first because we started hitting the
scalability limits we're now our users
were no longer able to use to the extent
that they wanted to it was important
actually able to withstand the load and
are scalable to the when we need them
accounting for growth as well so by
migrating in the backend first it allows
us to prove that every single part of
the system is actually up to speed and
has room for growth the opposite would
have been if we decided to just rebuild
the client and give it to the new user
what would happen in that case well
first of all if we realize that our
implementation or our even architecture
of the backends was not sufficient for
the load then what have to tell the
client to go back oh can you use the one
from two weeks ago oh no no that one is
also bad can you go back from four weeks
ago and given they all the different
Google engineering teams that are using
this they which would create enormous
amount of communication overhead that
was just not feasible that did not make
sense so we needed to focus on the
backends and prove that they can support
the new loads before we consider the
launch successful and then we needed to
create a lot of throw of the codes
because our clients rely on our system
to be running and I will talk about
compatibility window in a second but for
backwards compatibility we need to make
sure that with new features and with new
implementation we the outputs that we
produce to the clients from the users
perspective are no worse than what we
used to and our compatibility window is
actually several months so what that
means is if the user is using a client
library from six months ago they expect
that their outputs are at least as good
as even if we switched off the backhands
and the other challenge is that not
always can you actually tell can you
actually compare apples to apples the
different outputs because you're
architected system the system operates
very differently and so and so it's very
important to make sure that we can run
this analysis and figure out what does
it even mean to compare the two
different types of outputs in order to
enable this we needed to target launch
friendly clients before actually
transitioning everyone so as I mentioned
there is three different backends that
we work on and I worked on the bill
rabbit like the execution engine and it
securing from a persisting queue and
then the builds off facts and build
progress service and what was
interesting is if we discovered their
different clients they were actually
more amenable to us making this changes
so for the secure we found that a
another developer infrastructure team
was willing to put the effort it was
actually needed this to happen and so
they were able to help us out and by
having a shared experience shared lingo
shared vocabulary about the different
challenges and also shared concerns we
were able to do the launch successfully
because both the user and us were
interested in the success of this and
for build artifacts on the other hand
there was a different client that was
actually friendly towards the launch
which was the deploy system and the
release system and we were able to work
much more closely with them so for each
of the components there was a different
launch friendly client and we needed to
make sure that we do a good job for them
first before we can say okay the system
is good enough to launch for everyone
else and what this meant is we needed to
decouple launch of services if we had to
rely on all the three different
components being ready at the same time
what that means is that would probably
not have launched anytime soon because
every single component would have its
different robots at different points in
time and so relying on everyone to be
ready at the same point and providing
the same
quality which is unreasonable so and
given that we had all the three
different types of users that we
targeting for the launch we needed to
work under the assumption that none of
the other news services will be ever
ready for us and that also went with the
lots of throwaway code that was written
for the transition when we needed to
anticipate all the old scenarios and
then if these two services were up if
one of them was up and so on and so on
all the different permutations just all
these bridges on top of bridges just so
we could launch with high confidence
another important thing for the launch
was having maximum visibility into the
system state as a distributed systems
engineer the least interesting thing I
could find out for myself is that the
service the system failed I expect that
it would fail it's surprising to me when
about is how did it fail oftentimes it's
very tricky to ask the user to reproduce
their belt or even have to reproduce it
for them because sometimes you have to
wait for that particular scenario and
all the places fall into the perfect
space where it causes the failure and so
having maximum visibility into this is
essential what that means is we needed
to figure out the whole debug ability
story but that's also expensive to set
up making sure that all the logs are in
place in making sure that given any
requests we can like do distributed
tracing or any of those approaches are
costly and so for the initial launch
they're also not necessary not necessary
many times but having a good idea about
which logs am I going to look at if this
type of failure happens for a look at
info logs binary logs where do I find
what happened how do you put pieces
together was very important and then the
rest of the investment into debug
ability just followed up towards this
was surprising to a lot of us when we
declared that we were complete and we're
like okay let's just try to launch it
into the QA and just see what happens
and the system very quickly failed it
barely even was up and running
and that was surprising because at the
time every single team participating in
this had sufficient number of
integration tests at sufficient number
of unit tests we had everything that we
knew about in place ready and off high
quality were lots of us signed off on
this so practicing a lot this whole
release process allowed us to launch
with high confidence what this meant is
we had very specific checklist for we
followed and because these launches take
some time to prepare just giving the
sheer scale of it it was very important
that everybody was on the same page
about what would happen and sometimes
people do - their life circumstances
we're not able to participate at certain
times and we join in later and others
and making sure everybody knows what's
going to happen when we launch and what
will happen every single failure
scenario and what everyone would do so
that somebody doesn't have to be woken
up at 4 a.m. because they're the only
expert who knows what to do with it
was essential for us and that goes for
the rollback plan as well knowing how to
roll back our new architecture our new
release was essential to us otherwise it
would be like releasing a monster out of
the cage and having no way of you know
getting them back in and seeing what
went wrong so rollback plan gave us that
extra bit of confidence for now release
4 we knew that if something went wrong
the users will have minimal to no impact
whatsoever and we could continue with
our old infrastructure i with the old
architecture until we're able to try
again and prove that it works so that if
you becoming more interested in how the
tools and build systems work overall
there's a some of the references that I
recommend and I'll post the slides
online as well just to call them out
this is the the first one is the talk
that I mentioned a bi-racial Potvin at
scale conference you can learn more
about how the source system is able to
support the Google scale demands the
build artifact storage blog post is
talking about how we store in all the
different challenges that we run into
storing and
they distributed built actions I didn't
discuss it as much but it's also very
interesting so it's used by the blaze
system by the basil and basically we try
to get maximum parallelism out of the
builds that we're doing so every build
will have a bunch of dependencies and
then we can construct a build a penalty
graph where were able to figure out
which parts of it can be executed in
parallel until they meet the point where
they joined and everything needs to be
linked together so this is the system
that does maximum paralyzation for us
and then if you're interested how all of
this fits together into one of the users
of builder and one of the users of the
build system is the continuous
integration system there's a great paper
on that and I would like to extend my
thanks to people who helped me refine
this talk and I would like to call our
kidding McCaffrey thanks for inviting me
here it's been a lot of fun trying as
preparing for the to give this talk and
this is our high for today so I'll take
any questions you have and please
remember to rate the session when you
head out of the room thank you