The Evolution of Hadoop at Spotify Through Failures and Pain

Josh Baer & Rafal Wojdyla

Recorded at GOTO 2015


Get notified about Josh Baer & Rafal Wojdyla

Sign up to a email when Josh Baer & Rafal Wojdyla publishes a new video

thank you hello so hello come again in
this talk we're going to illustrate the
evolution of Hadoop at 45 so regardless
of the reason why you're here and maybe
you're here because you have your own
cluster or maybe you want to build your
new cluster or maybe you just want to
learn about Hadoop well we hope that
this talk we're going to be useful for
you because we're going to bring up with
real-world examples of incidents and
issues and bugs that we had and how we
tackle them at spotify but please keep
in mind though that every single cluster
is different so what work for us may not
work for you but we hope that next time
you see a similar issue you can take
this knowledge and make more data-driven
decision about your issues before we
dive into the details though let's first
introduce ourselves so I'm rafal and
this is josh dodge the product owner and
I'm an engineer at all Hadoop squad and
we both kind of fall in love with Hadoop
about two years ago when we move to our
beautiful Stockholm office and we
started supporting this one of the
biggest Hadoop clusters in Europe and
that was a huge responsibility but also
great privilege so that now that you
know who you are let's take a look at
the agenda of the talk so first we're
going to talk about the issues the pain
points how we were trying to a keep our
cluster up and running then we'll gain
focus will stabilize the cluster and
then at the end we'll finish it hard
with the future and our current features
of our infrastructure so let's get
started sure so to start off with says
rafal mention we're going to start
talking about the beginning of big data
at spotify when we first started with
Hadoop but before we do that we're just
going to go back up a second and talk
about what Spotify is I realize it's
probably pretty well known in
Scandinavia but just in case so we're
all in the same page Spotify is a music
streaming service some people think of
it as like iTunes on the cloud it's it
was launched in 2008 there's two main
tiers of free tier and a premium tier
that cost around 10 years a month it
gives full access to all our catalogues
of songs it's available in 58 countries
I just looked at over 48 of those
countries straight
average stream over 1 million streams
per day some monthly numbers we have
over 75 million monthly active users
those users chat choose from a catalog
of over 30 million songs over 4 million
artists and they combine to play around
1 million streams every day so that's a
lot of data and because it's a lot of
data we have a pretty massive data
infrastructure to process over it we
just upgraded we added another three or
four hundred nodes we got up to around
1,700 nodes we think we're probably the
biggest Hadoop cluster in Europe if
you're not we'd love to hear how big
your Hadoop cluster is we have 62
petabytes of storage we get over 30
terabytes of logs every day incoming
generated by users we run about 20,000
Hadoop jobs per day and those jobs
generate over 400 terabytes of data so
that's a lot of data you might be
wondering what we do with all that data
now let me ask a question real quick is
okay well if you Spotify you're in luck
a few months ago we released this new
application a new feature inside the
what's really cool about this is it
gives you a custom played list based off
your running temple we found that
actually running with a beat when you're
in rhythm with your cadence improves
performance as much as fifteen percent
and what's really great is that we
combine your listening habits your
personal listening habits and the
characteristics of all the songs that
you like to listen to things like the
mood and the energy along with the songs
be permitted to generate a bunch of a
different customized running playlists
familiar in interesting music at every
beat per minute for all our active users
another feature that we've recently
launched that you might be familiar with
if your Spotify users called discovery
weekly it's two hours of personalized
recommended music delivered every Monday
for all of our 75 million monthly active
users and this has really been one of
our most successful features that we've
ever launched every Monday we get a
flurry of new like Twitter activities
that say how great their playlist is
week and how much better was the best or
if we're late with the data we get a
bunch of Twitter complaints about how
their playlist isn't there yet so we
know that it's really important to our
users behind the scenes it's running a
variety different machine learning
techniques that's all based off of
extracted features from the data these
features are pretty cool but before we
dive into them its first take a look at
a very simple one so this is a top list
for Denmark top 50 songs that you the
users are listening to in Denmark so
let's take let's spend two seconds and
think about how would you basically
implement this feature so if you think
about it the most simple the most brute
force solution will be probably too just
like to throw this data into database
and then run a simple query like this
one over here so you select some track
ID some artists ID you counted your
goodbye you have a limit you have wallow
you have your feature right pretty
simple the problem is that as soon as
you start getting this amount of data
that Josh was talking about remember 30
terabytes of data per day that we have
to ingest then that starts to be an
issue basically so we knew right from
the beginning very beginning that we we
need something that will scale better
with our a scale of the amount of data
that we're getting and also tablas is
just one type of report we have to
create lots of different reports of
those different calculations that we
have to do on the data so we have to be
flexible when it comes to the type of
calculations we're doing and the third
reason we're getting different data from
different sources in a different
structure so we have to be flexible
about the data and the structure of it
we based ended something like Hadoop and
that's why we started playing with a
over in 2009 so very very you know very
early in the in the life of Hadoop and
we had a couple of learnings about it at
the very beginning so basically now we
know that it kind of it scales pretty
well and then it's flexible and when it
comes to what kind of wear marks you
want to run on top of it and also of the
scheming with principal it's actually
flexible and comes to what candidate I
want to store on it but then there was
also another cool findings which is
Hadoop streaming and Hadoop streaming is
a feature of Hadoop that allows you to
implement MapReduce jobs in a language
different than Java
and that was important for her for
Spotify because Spotify back then was a
big Python shop so that allows us to
take this extensive knowledge of Python
and put it on top of Hadoop and then we
started implementing this is intensive
pipelines using Python and basically we
got a lot of insights and data and that
allows us to make data-driven decisions
and make Spotify so successful the way
it is today but to be honest I'm missing
one important piece of the story I'm
saying that yes you can write these
pipelines and you can get this inside
and you can do lots of knowledge but
first problem that you're probably going
to get is how to move the data to Hadoop
so that sounds like a simple problem
well we just you know store it on HDFS
that's the simple solution but well it's
actually not that easy so let's take a
look at this problem from Spotify sper
spective so at spotify we have this
notion of access points which are
machines that all the clients connect to
so every time you open your client and
you want to request the playlist or
anything like that you want to get any
back-end service you go through the
access point and every time you request
something from the access for an access
point will log that on that specific
machine in a tab separated format and
because of that that's it in that
introduces a lot of a specific type of
issues so the first one is that the data
is pretty raw and dirty so we have to
make sure that our ETA pipelines are
rock solid and what we make sure that
every single record is clean and that
actually is not so simple it up
celebrated values another thing if you
have a tab separated value well the
scheme evolution is not that simple and
it's specifically difficult at spotify
because we have lots of small teams and
if you want to change something in the
structure if you want to add a field or
remove the field well you have to talk
to lots of different teams and it's not
uncommon for a team to change something
without talking to anyone and then you
learn about it only by having an
incident downstream in the pipeline so
that's another issue there's another set
of issues and I'll be having because we
have a hard clustering only one day the
center in London we have to move all the
data to one single place in London and
that introduces in issues with
networking and what happens if you f to
a duplicated data we have to duplicate
it later so on so on
so given the problem and we started oh
there's one more important issue also so
at spotify we treat our logs very
carefully and may we have to make sure
that we get all of them because we treat
them like financial transactions because
we have to pay back to our artists and
make sure that they're satisfied with
spotify so logs are super important for
us so with that problem we implemented
our first iteration which was log
archiver and that solution actually
lasted very long so that social acid
from 2009 2013 and you will be surprised
because it was very trivial it was just
a set of Python scripts that will
basically compress the logs and then are
saying or SCP files between machines all
the way to add to data center in London
and that jobs will basically be crowned
what kind of issues well because it was
drowned and because it was trivial
annual set of Python scripts you would
fail a lot there was no proper
monitoring there was no proper alerting
so every time there was an issue our
engines would have to go to a specific
data center SCP files manually all the
way to add to London and make sure that
all the logs are there and if we had to
scale our access points we had to add
more machines or we have to remove an
access point that would be another
manual and manual work for our engineers
it was a huge magnificent failure but
before you burn it too much it was also
I must also introduce an important part
of Spotify culture and that's best
summarized by two words embrace failure
that is is pretty simple basically when
you're failing you're learning a lot
moving slow and being afraid to fail so
in the case of log archiver we actually
learned a lot we learned a lot about
what not to do with log delivery for
example you don't make it rely on cron
if you want to expand your nodes and
scale up all around the world the
failures and learning or learnings
guided our next generation of log
delivery which we built on top of Kafka
now let me ask another question is
anybody in is anybody using Kafka in
their production volume
in this room is Ed is anybody familiar
with it ok so Kafka is basically a
messaging queuing system that was open
source by linkedin that has a
publish-subscribe model or Kafka's turns
producer consumer so we tried a few
different messaging systems when we were
evaluating the next generation of our
log delivery and we found pretty early
on that Kafka had just worked the best
we had latency go down from our access
point as rafal mentioned to our HDFS
cluster go from hours down the seconds
and this opened up a wide variety of new
use cases that we could do with the data
for example real time processing with
Apache storm so this is a bit simple
picture of what our architecture looks
like so we have the 30 terabytes of data
that's generated from users connecting
to our access points and that's passed
along side to site local groupers that
consume all the events produce from the
access point they compress them they
encrypt them they send over the internet
to London where it gets consumed into
our Hadoop cluster so we'll be pretty
honest we've had a lot of problems with
even this log delivery system but that's
riddle related a lot to the kafka system
that we're using we're using Kafka 0.7
which is a little bit older architecture
and it's also due to the fact that we
have Ende Ende delivery system that
we've built on top of Kafka that really
allows us to get the reliable delivery
because these logs as we said are pretty
important but our log delivery system is
constantly evolving as at every new
issue that we hit and in fact right now
we're in the process of evaluating a
bunch of different log delivery systems
so we can improve it and fix some of the
bugs that exist in our system so now
that we have data inside of a dupe you
might want to start doing data as you
data so you run a few jobs and you might
schedule them for example like in crime
like this and run the jobs of
predictable times this might work
initially but what happens
if your previous job fails you don't
want to process over incomplete data
because that's actually worse than
process did not processing at all
because you'll have inconsistent results
down your pipeline so faced with this
challenge one of the early engineers at
spotify his name was eric berne hudson
tackle it by creating a new tool it was
called luigi because it handles a
plumbing of Hadoop jobs and also because
it's green and Spotify screen that's
cool so it's Luigi's a workflow
orchestrator that's written in python
that allows you to define job
dependencies programmatically so if for
example in the royalty calculation your
royalty calculation pipeline depends
upstream on some ETL jobs it's going to
make sure that those dependencies are
complete before it runs the royalty
calculation jobs if it's not complete
it's going to schedule so it's been a
very successful project that was first
open source by Spotify in 2011 it's used
all over the world at hundreds of
different companies we're finding new
ones each and every day including some
really big ones like Foursquare and
stripe unfortunate we don't have so many
so much time to go down in the details
of Luigi but if you're interested in
this we encourage you to go to the
github page down there and check it out
so now that we have data on HDFS and we
have a scheduler that we can schedule
the job with we started getting more and
more engineers into crunching the data
that was pretty cool and what comes with
engineers that's perf I there is
feedback there's a good peace and
positive feedback and also construct
construct constructive negative feedback
and one of it was that there is no data
catalog so it's hard to find data sets
on HDFS and every now and then an
engineer would have to go to HDFS and
less some the LS some directories and
maybe at the annual cut it so like in
this case looking for the boat on the
data lake every single execution of this
HDFS client would take a couple of
seconds because it has to load a JVM has
flowed libraries and so on so on and
that was the pain point for engineers
and because we like to experiment we've
decided that we'll experiment with the
RPC protocol in HDFS and that's how
another open source project
that's why if I was born and that was
snake bite standby was originally
created but by wouter the beam and in
the nutshell it's basically a pure
python HDFS client which means that
everything happens inside Python there's
no java happening in this client and
it's actually very simple and very
intuitive when it comes to simple read
operations on HDFS and how fast is this
let me show you on another slide so here
you can see 100 executions of vanilla
HDFS client on top and below you can see
one hand executions of snakebite client
and you can see that it's roughly 10
times faster and then vanilla HDFS
client which is pretty cool but well
it's also a nice is it it will actually
use less resources than vanilla HDFS
client it will use less memory and less
CPU and that is actually pretty handy
especially if you have a service that
interacts with HDFS a lot and as Spotify
we actually have sub service and it's
luigi because luigi actually to schedule
the jobs it will have to do a lot of
existence each X on HDFS and when you
start running tens of thousands of jobs
that can actually upload the machines
actually had this issue so we've decided
that we'll move from HDFS client to
stake my client and what we've noticed
immediately afterwards is that our
checks are more stable are faster and
scheduling books more promptly which is
pretty cool and I encourage you to take
a look at the github page and a search
for snakebite so now that we had a
horrible cluster up and running and we
have the kafka that's loading data into
Hadoop and we have some tools like Luigi
and snake bite so that all developers
around Spotify could could access data
and run Hadoop jobs we started to run
into it a new issue and that was that
developers actually we're running a lot
of jobs they were running more and more
jobs and they were the importance of the
the problem was that the team that
managed the Hadoop cluster was also the
team that was writing jobs too and they
were also developing and supporting
tools like snake bite and Luigi and they
didn't have a lot of time for Hadoop
maintenance if you're familiar with
a Hadoop cluster without time for
maintenance is an accident waiting to
happen and after a particularly
particularly harry incident that caused
multi-day outage we decided we had to
change something and the decision was to
form a team so the team started with
rafal myself and another engineer at
spotify and we had a very simple mission
when we started the first part was that
we had to migrate to a new distribution
of Hadoop that included yarn the second
part and the most important part was
that we had to make Hadoop reliable and
so you might be wondering how we did so
let me show you in this graph so in the
first section this is when Hadoop was
essentially ownerless at spotify we had
a lot of random issues and outages that
caused down time around the company
Hadoop was kind of like a dirty word
because even though it was really easy
to write and run jobs on the cluster you
weren't so sure if it was going to be up
when you actually really needed the
results the second section is actually
when we started the Hadoop team at
spotify we address a lot of the
low-hanging fruit and started to improve
reliability right away the third
sections when we upgraded our Hadoop
cluster to a distribution that included
yarn so there were some complications
involved in that it was all so
complicated by changing distributions in
the first quarter 2014 we added name
node high availability and that also
introduced the problems of their own
because of our the scale of our cluster
in the size just nothing's ever as easy
as that I do books as it is so in the
third section this is when the fourth
section I guess this is when we actually
started to get pretty reliable and
predictable and we were actually around
we did this by by making our puppet
configurations that we use to control
the head of cluster really grok saddle
solid we added a lot of monitoring
alerting so that before issues happen we
were aware of the fact that could
prevent them and we also built some
infrastructure to make upgrades easier
so we were actually we're starting to be
seen as a model team around Spotify
because we were really improving the
reliability and dressing our users
biggest problems that was all cool
except at the end you can see the last
bar and this is the last quarter and you
can see a significant drop in
availability of our Hadoop cluster and
that is mostly because of our process of
scaling our Hadoop cluster from 1,200 to
1,700 notes and as Josh said yes we had
the pipette we had monitoring and
alerting and we could prevent something
that we knew about but as it turns out
there was another class of issues that
we're starting to heal there was a class
of issues were basically bugs in Hadoop
a code so the lesson is that when we
when you scale Hadoop you also scale the
hidden back that come in the code and in
our case they come and we can see them
as soon as we start scaling over let's
say 1200 note so in this case you can
see two specific issues that are super
deadly to us but in terms of small
clusters they wouldn't really be that
deadly or they wouldn't cost down time
in our cases both of them basically
brought the whole cluster down for more
than one day so let's take a look at the
first issue for example so in the first
issue and it's an issue with open files
and failover so when you fail with one
name look to another one in hia set up
and to be active named mode will have to
go through all the open files make sure
that they are still open and valid that
process was implemented in very poorly
way it was very inefficient way to
implement that and the whole description
is in the code in a small cluster that
wouldn't be an issue in our case where
we have thousands of open tens of
thousands of open files at any given
point that basically brings the whole
cluster down what is funny is that when
we that issue actually made us upgrade
from one a version to another from 22 to
2.6 actually right after the upgrade we
run into another issue the second one
that allows actually also brought the
whole cluster down and it's also related
to a failover and so remember when you
scalar how to cluster you also scale the
hidden box in the code yeah so there
were a lot of bugs that we ran into
that were in the Hadoop cut there was
also a lot of earlier challenges that we
ran into just with our pork
configuration and are needed to
constantly adapt to change but there
were also some some things some issues
that we ran into that were totally one
hundred percent preventable as I'll talk
about right now so a a little bit over a
year ago the dupe team was doing a
pretty good job and we were pretty proud
of the reliability of the new cluster
and the success that we've had and
around the company you know people
coming up to us and they were saying
you're doing such a great job so we
thought we'd go celebrate and we decided
to go out to a bar in Stockholm and you
know have a good few beers and just
celebrate our success so when we were
walking out to this bar we all got a
message coming in on our phones we open
it up and the title said something like
this I think I made a mistake now when
we open up the email we realized that
one of the users in New York had run a
command on the cluster that looks
something like this now maybe someone
can see what's wrong here maybe now
maybe now so when I when I realize the
issue I looked up and I saw her false
face and it was like this mother of god
what has they done so it turned out that
the user in New York had accidentally
put a space in the command that he was
running in between his team's name and
the folder that he actually wanted to
delete and he had wiped out his entire
team's directory of data this was over a
petabyte of data that was collected over
months and months of pretty intense I do
processing so we were standing there on
the sidewalk in sunny Stockholm and
we're trying to decide what do we do
should we go back to the office shut
down Hadoop try to recover some of the
blocks before they get permanently
deleted or do we just continue on to the
bar and you know deal with it tomorrow
fortunately we didn't actually have to
make that decision because another user
from the same team of this guy in New
York replied to the thread saying don't
worry about it ed we can actually
regenerate some of the most critical
data in just a few days you know hooray
we just saved a lot of space let her do
cluster so from this we learned a few
really important lessons the first one
is from our colleagues out there he
always says this is sit on your hands
before you type you know especially if
you're removing skipping trash in HDFS
or you're using some kind of super user
you know before you hit that enter key
make sure what you typed is actually
what you want to run on the cluster the
second one is that users always want to
retain their data and as we found out in
this specific case you know this team
could regenerate all their recruiter
told atta and they really only need a
little part portion of that one petabyte
of data and if we knew this beforehand
if we had actually challenged them a
little bit harder we could have saved a
lot of money and space on a cluster and
costs the third one is that you should
remove super users from your edge node
if you're familiar with super users they
basically have global axes they have
super permissions on your Hadoop cluster
and it's Spotify we have this you know
Swedish ideal of equality and and we
have which means that all engineers at
the company have sudo access on the
machines that they have access to so if
this user had used the superuser and had
that space a little bit earlier in the
command you know me and rafal might not
be up here giving this talk today so the
fourth lesson we learned is that moving
to trash is actually a client-side
implementation a dupe and in snake bite
we hadn't quite implemented it so we
celebrate a little failure we did a
little hacking and you can now safely
remove from snakebite that was it pretty
clean stem and the funny one but a few
weeks afterwards there was another one
so we had this external consultant /
let's play fi and his goal was basically
to certify our cluster which means that
he would go through all the different
parts of the clustering configuration
out the
say this classroom is a healthy or not
and this is what you have to improve the
first few days weren't actually pretty
smooth and we were kind of like happy
about ourselves and proud because you
know he didn't find any things were like
yes we're doing great but on the day
number three due to miscommunication and
Miss configuration one of the teammates
killed our stand by name node and I was
fine because it's a stand by name node
except then there was another
miscommunication and we killed our
active naked which means that there was
no master now the hdfs which means
there's no HDFS which means there's no
Hadoop there's no processing and that
add our scales means around two hours of
downtime and it was specifically bad
because there was this external
consultant / right but that wasn't as
nearly as bad as day number four so they
number four we're sitting in another
room there is dis cos Alton there's the
team and there's our managers and we're
talking about the incident and overall
about the certification the consultant
is saying something around the lines of
the fact that our our testing and
deployment procedures are like wild wild
west and that was very difficult to
listen to but in the end he was right so
and we knew that so right after the
meeting the whole team went to a room
and we've decided we're not going to
leave the room until we come up with a
plan to basically solve this issue and
we came up with something that may be
actually pretty simple and obvious to
you which is a pre-production cluster
and pre-production cluster is made out
of the same class of machines exactly
the same class of machines a very
similar configuration almost identical
and we created a set of smoke tests that
we can run out that pre-production
cluster make sure that every every part
of that system is well integrated and
works perfectly fine and then we can use
that both smoke this NP production
cluster to deploy changes first to
pre-production then around the smoke
test get instant feedback and then
decide whether we want to actually
deploy or not not actually change the
way we test and deploy on production
which work pretty well as specifically
for example for our recent upgrade we're
able to discover through it to issues
that would probably most probably cause
a serious incident on the production if
we have not discovered it before
while the Hadoop squad was trying to
make the infrastructure stable and was
working on pre-production all that kind
of stuff there was another and there was
another effort on going in data and that
was to move from Python to JVM so as we
said before boy if I was a big Python
shop and we implemented lots of
pipelines in Python on top of a Hadoop
streaming and over the time we realized
that that is an issue because we started
seeing lots of lots of failures of this
kind of pipelines because every time
someone made the change in the Python
code it was very hard to change it test
it so people basically throw it at the
cluster with the waste resources just to
get feedback whether they made typo in a
code or not or maybe there's a mismatch
in the type because of the nature of
Python so that was one class of issue
another one was that there was basically
no testing infrastructure so people
would throw basically a pipe lines at
the at the cluster and also the
performance was not there so that is
because of the nature and architecture
of Hadoop streaming and also Python
itself so one of the engineers at
spotify david whitening did an extensive
overview of all the frameworks you can
use on hadoop and you can find the links
over here and then afterwards with me we
there was a discussion about all the
frameworks and we've decided that we'll
choose apache grunt as a supported
framework for running MapReduce a batch
jobs at spotify and there was a couple
of reasons behind it i'm gonna bring up
three of them which we think are in the
most important ones so the first one is
that you get real type so we get type
safety which means you can get a
compile-time errors you don't have to
throw the job at the cluster to verify
that it's it's a proper job without
typos or a schema mismatch another one
you get a high level pie which means
that you can start thinking in terms of
drawings group buys call groups on our
kind of fancy a functions instead of
thinking in terms of this old map and
reduce a paradigm which is very nice and
makes the whole pipeline less verbose
and easier to maintain over time by
different engineers then the third
reason is the performance itself and
that comes from JVM and let's take a
look at the graph over here so on this
graph you can
see the benchmark of crunch and Hadoop
streaming and this is a benchmark for
our production workload so this is not a
synthetic one this is our these are our
production workloads and this specific
graph is a map of throughput in
megabytes per second on the left you can
see crunch on the right you can see how
tube swimming and you can see that
Apache crunch is roughly 8 times faster
on average and our almost seventy-five
percent of all the Apache crunch jobs
are faster than all the Hadoop streaming
jobs which is a pretty good reason to
move to JVM and also we were able to
came up with a pretty neat a testing
environment for Apache crunch on top of
a mini cluster which basically means now
we can enforce our developers to create
tests for the pie pans which is also
very nice and so that was a lot of
evolution so let's just quickly review
what we talked about so the beginning we
discussed the difficulties of even
getting data to our HDFS cluster and how
we solve that using Kafka then we talked
about some of the challenges that we had
when we were first starting out in
writing Duke jobs and using cron as the
schedule and how we fix those by using
tools like Luigi and snake bite then we
talked about some of the issues that we
had early on with availability and how
we solve that creating a team that was
really focused on the Hadoop
infrastructure only and also how we
improve the reliability by doing things
like mine proper monitoring and alerting
then Rafael recently talked about how we
we started to really focus on
performance and improve by moving things
from Python and Hadoop streaming to
Apache crunch and this last section
we're going to talk about the future you
know what we're working on now what
we're planning on planning to focus on
the next six to 12 months at Hadoop it
Spotify so this is the graph of the
growth of Hadoop versus Spotify end
users since 2012 when Spotify had just
crossed over 10 million users so since
that time Spotify has grown six hundred
and fifty percent which is pretty great
growth but the user growth doesn't
compare it all to Hadoop usage growth
the growth
and demand for compute resources at
spotify so it's which has grown over
four thousand percent as you might see
so what caused so much growth now we
attribute it's the three main things the
first one is pretty obvious with
increased spotify end users you're going
to have a lot more data all those users
are listening to a lot more song which
means that all the pipeline's that you
have that you used to have things like
the royalty pocket pipelines and the the
top charts have to process over a lot
more data and they need a lot more
compute resource resources to do that
the second one is increase use cases so
when we started out with Hadoop at
spotify we used it for just analytics
and are reporting pipelines we didn't
use it for a ton of things these days we
discovery weekly is is all at the end
driven by Hadoop and all these massive
machine learning and graph processing
jobs and we've really seen that just all
the different use cases means you're
going to be running a lot more jobs and
you're going to need a lot more
processing power the last one is
actually really interesting so we've
increased a lot of Engineers that
Spotify that process over data and
that's that's driven growth a lot in
2014 Spotify acquired this company
called the echo nest that's based in
Boston the US their their music company
that's just obsessed with music
intelligence and before they game to
Spotify they actually never use Hadoop
but when they started and had access to
our treasure trove of user data they
became hooked and now there's some of
our heaviest users and run some of the
most massive jobs now we learn through
all this growth isn't what we learned
through this growth is an important
realization that's that scaling machines
is actually kind of easy and we have
alerts we have monitoring proper
monitoring we have really solid puppet
configuration that allows us to add new
machines you know it's it's allowed us
to go from 120 machines a little bit
over two years ago to the 1700 mission
almost 1700 machines that we have today
without much trouble now we have run
into some issues as we
while mentioned with the recent adding
of 400 nodes and going or 500 going from
1200 1700 but for the most part it's
been pretty smooth but scaling people
that's that's actually really hard now
we're still a relatively small team at
spotify but we support hundreds of users
that are processing data and they have
different levels of expertise down from
the beginner to the expert and we have
trouble keeping up with all their
problems so you might wonder what we're
planning on doing about that we're
starting to automate feedback we want to
provide information about Hadoop jobs to
users immediately after job completion
things like did your job fail with the
specific error maybe we know about that
Eric can link to the jira ticket and
maybe the workaround for it maybe your
job is launching with the incorrect user
permissions so we'll try to put up a
warning sign before you run your job so
you don't waste all the compute
resources only to come up with nothing
but why we're working on this we've
deployed a few things that are already
helping us and might help you too if
you're running I do the first is a
project called inviso this was written
in released by Netflix a little bit over
around a year ago visa allows you to see
what's going on with a cluster in real
time and provide some really nice
visualizations that make it obvious for
example if a single job is dominating
the cluster and using all the resources
it'll also allows it also contains some
pretty cool visualization tools that you
can go down in the individual job level
and see how the life of the job in and
how its performing and you can use that
to improve job performance I'd really
encourage you to check it out the other
thing we've been doing seems so obvious
that we've been really surprised at how
effective it's been so every quarter we
publish a newsletter that contains all
kinds of different information and
statistics about Hadoop jobs things like
the growth of Hadoop compute demand the
increase of storage the top failing jobs
of the cluster or in that quarter or in
a month
or plain old hadoop reliability now we
always get really great feedback around
this newsletter so I'm ghetto cold but a
few months ago it was even more
effective because it identified a single
job that was running every day and using
wasting over ten percent of the clusters
resources because it was always failing
now before we publish the newsletter
they use that on this job didn't even
realize his job was running every day on
the cluster in failing but when he saw
his name on the newsletter the social
peer pressure alone had caused him to
immediately go unschedule his job and
vow to us that he'd never launched a job
on the Hadoop cluster without properly
testing it and without kind of
certifying with us so just from the
social peer pressure and improved user
performance which was great for us
because we didn't have to go in and and
you know manually tell him that he has
to stop doing that stuff which was great
since this is at all data talk we have
to mention spark so there's there's
sparks light so yeah we we are
evaluating spark and we've been playing
with it since the very beginning we had
some issues with it and pretty recently
resort to an experiment with Apache
Zeppelin on top of apache spark so
you've probably heard about spark but
how many of you have heard about
Zeppelin cool so there's a couple of
people great so you can think of
Zeppelin as the ipython on steroids on
top of spark right so well it gives you
is that you can basically dive into data
slice and dice a little bit and get a
result in a nice visual way right so
it's like an like a notebook experience
on top of your spark and the way we want
to use it is you want to use it as a
glue tool where you can connect to all
the different pieces of our
infrastructure fetch the data in quickly
process it get a nice visual result and
then the engineer can decide whether
there's value in that data and whether
production-ready pipeline on top of
scalding or crunch or anything
that so that's where we see it we have
we have some good results from the very
beginning and encourage you strongly to
take a look at a participant in this
actually brings us to a almost the end
which is two takeaways so this is kind
of the most important slide of the of
the talk so two takeaways one is that
there's no Golden Path especially when
it comes to Big Data there are no
patterns there are kind of emerging
defining it or up to define it so when
you have a problem when you have to
design a system and I will deal with
data we would encourage you to take a
look at the problem and try to implement
in the most simple solution that you can
come up with so try to avoid premature
optimization as much as possible if it's
an anti-pattern generally than in Big
Data this is like the root of all evil
and then the second takeaway is that
evolution is an ongoing process so
whatever you're going to plan to design
or implement has to be playing away
design in a way that makes it easy to
iterate over it so basically you create
something then you will have to most
likely just throw it away very shortly
right after and implement something on
top of it and you have to implement in a
way that makes it easy to swap it
basically so these are two kind of
simple take away its kind of kind of
generic a little bit too but if you kind
of live up to them and that will make
your life much easier as a Hadoop
administrator or data engineer and that
brings us to the last slide which is we
are hiring and you know we need
engineers in New York stockholm so if
you want to join you can talk to us
afterwards or you can write to us on
Twitter and that's the end all right
thank you very much I think we have time
for questions
alright thanks guys uh so who wants to
start this one because you're you happen
to be close hi how do you deal with FEMA
changes you mentioned it so this it's a
problem is a format how do we do with
schema changes there are the question is
how do we do schema changes um okay so
most of it is actually implemented as a
schema repository right now where we
have a project called lock lock parser
which basically defines what are the
fields and then every time you you make
a change you can change that in that log
parser but you have to make sure that
everyone knows about that change and all
the pipelines that are downstream are up
to date with that change mmm that not
that doesn't really work all the time
and in the most recent architecture
we're working on the more centralized
schema repository where you would fetch
the schema from from that repository and
that would basically be depending on the
data that you fed so let's say that you
want to process something that is from
two years ago and the schema will be
done in that represent that you can fed
but it's not in the in there it's not
there yet and I can't remember if we
match but we do use patchy Avro so that
helps out a lot okay there's one over
here you mentioned that the cluster was
not stable in the beginning what was the
reason for that at before the Hadoop
team was started yeah so we had a bunch
of different issues a lot of them were
really low hanging fruit like nobody was
managing the data growth so every once
in a while we'd be expanding the cluster
when you run out of disk space and there
was really no it was always reactive
that was the problem so we'd run into
ninety percent disk utilization which is
kind of the danger zone which because it
means that individual disks are one
hundred percent full or you have
individual nodes that are under
full when it has properly balanced and
that'll cause a bunch of job failures
and revenge it'll just run into all
kinds of crazy edge cases in Hadoop so
that was one of the big reasons there
were also some other really interesting
issues that we ran into I'd have to go
back and look but some bugs some
misconfigurations because Hadoop is not
the best documented code there was even
some configuration parameters where it
said this this parameter is milliseconds
but it actually turned out to be seconds
and that like really I remember that was
one massive issue as it was also all
sorts of little small it was kind of
like a small like josh is saying in a
small stuff like there was no alerting
so for example would run out of space on
there on the described so named nodes
will be like you know name it will die
and then a puppet was not running so
every time there was a change for
example in network that was kind of
serious and it would not be propagated
the plugin name node all of a sudden no
one can connect an email which is also a
problem all like a disks it would fail
like would not have a right set up on
the air on name notes and other problems
so kind of simple stuff that you can you
can do kind of long hang this actually
they were alerts they were just way way
too many alerts and they were totally
useless i remember when i first started
at spotify you get your email starts
like a few weeks before you actually
your first day and i had like over
10,000 emails in like all of them were
you know stand by name no disk is full
or you know just we had the over
learning problem so so many alerts that
is absolutely useless and they weren't
very specific like we made later there
was a there was a dead log data notes
too and I remember that we've made a
change in the the time that it takes for
name not to mark the data node is dead
and that one was that I think the
milliseconds and seconds so in the way
we wanted to make it longer but we've
made it shorter which made the whole
thing worse well sister but it was there
were some bugs there were some
misconfigurations they were looking long
okay
low-hanging fruit okay some more
questions all right how big is your
pre-production cluster right now it's 12
nodes and two we expanded it to I think
around 40 notes no okay oh it's only
recently it was 12 notes for a long time
but then we we recently had to upgrade
about I was a month and a half ago or
even a month ago so we had to bump up
capacity because we were 12 nodes in a
hundred thousand three hundred thousand
seven hundred nodes it was hard to
actually move data there to do
pre-production tests so we had to bump
up capacity I think to around 30 or 40
notes but it's pretty elastic we might
okay more questions for the guys running
the biggest Hadoop cluster in Europe
you're gonna you guys are just going to
be kicking yourselves when you get home
and you realize that question you wanted
to ask all right everyone's like oh so
the question is whether we use physical
now's our virtual one so we have
physical clustering how to a in that
London yeah yep I three actually had a
slide of the specs but we decided to
remove it yeah but we run dell servers
1700 we have I think three different
generations of servers our oldest ones
are about three or four years our newest
ones we got another three months ago
delivered they're pretty beefy they run
that we add we have 12 disks per machine
four terabytes of disk
and about twenty four cores depend
backups yes so it's it's kind of a funny
story so logout cover we talked about
before actually did our backups to so
would write to HDFS it would also write
to amazon i think it was s3 so that if
we needed to restore we could just
reprocess the data from raw you know if
we ever actually had to do it though it
would be a different story because
actually getting the data from s3 and
and reprocessing it with old formats and
stuff would be a nightmare yeah so when
we killed it we had to come up with a
new solution so what we do now is is we
don't copy our raw data because we save
like the recent history inside Kafka
there's some mechanicus mechanisms
inside Kafka that allow you to persist
data for a certain period of time so we
saw all of our core data sets we make
backups up we copy the link rip them and
we copied into the cloud so now if we
want to restore them hopefully it'll
it'll be a lot easier we can you know
find out what's been deleted just copy
that data set and reprocess from there
instead of going all the way back to
load the raw logs we also use snapshots
yes that's right which also is another
like there's a cool feature Schneider
snapshots which basically means that you
cannot read elite snapshot easily which
basically prevents you from doing
everything at least you will get you no
warning no error that you know you
cannot delete everything because there's
a snapshot there and so that's also like
and funny little story about snapshots
we were so we were at our shop we do
kind of scrummage style and we had a
sprint demo with a lot of our
stakeholders in the room and we were
talking about how we did snapshots and
how would make it harder for some user
to accidentally delete his tire teams
directory so we'd implement this feature
and you know we had tested out in our
pre-production
it worked fine we decided that in our
demo we were going to launch it on the
production environment but with we
didn't had talked about was actually
should we test her on a production
environment too so one of our engineers
kind of went a little rogue Edo and he
decided you know I you know I've read
about it it's going to work so he's
right there in front of all everybody
he's like okay delete root and there was
about a second or two we were sitting
there like oh oh no oh no what's going
to happen and luckily it said you know
you can't do this because of snapshots
so it was a successful demo but probably
a little bit unnecessary we could have
done that out in pre-production just as
easy I think we're glad that you didn't
have to learn from failure there because
we wouldn't get to hear you today so all
right everyone thank you please remember
to submit a rating and also remember if
you have other questions or feedback to
give these guys you can put it in the
app and it will get to them alright