Get notified about Chris Read

Sign up to a email when Chris Read publishes a new video

so yeah my talk today is bare metal
DevOps Who am I I moved here to Chicago
it'll be three years next week three
years ago moved over from the UK I work
for a local trading company called diw
yes we are hiring I suppose everyone is
used to that so you can come and talk to
us outside at any stage ask any more
questions I'll probably be at the
speaker's clinic at some stage today as
well but yeah bare-metal DevOps
initially my working title was going to
be heavy metal DevOps but I got a lot of
feedback that you know a lot of people
don't think that heavy metals cool
anymore and also Metallica at least when
I was a teenager was my favorite metal
band but I think the wheels fell off
with them when they sued Revlon because
Revlon released a lipstick color called
Metallica and it just it just went
downhill from there so instead of
talking about that heavy metal I'm going
to be talking about this heavy metal
this is a photo of one of the Isles
that's full of machines that I I look
after from from quite the earlier early
days of kind of DevOps when when it
actually got a name I started to get a
bit frustrated by how the assumption
whenever people were talking about tools
and techniques and and how that you need
to work the underlying assumption is
you're creating a web service that is
running on the on the cloud somewhere
and the DevOps only applies to you if
you've got those constraints for me this
is infrastructure I look after we've got
no internet facing web services which
actually makes things a lot easier from
general security point of view don't use
AWS anymore we'll get into that a bit
later and yet even
with this infrastructure I cannot do
what I do without applying what I like
to think of as the way of DevOps so the
core principles that i'm just going to
outline briefly apply wherever you're
doing devops but i'll be focusing
primarily on how these principles are
different within your infrastructure
because i believe that the people who
are doing the web apps out on the cloud
actually need devops culture DevOps
practices to make their lives better to
make their jobs more enjoyable and to
deliver more value to the businesses
that they work in so to me core
principles are agile infrastructure
being able to change your infrastructure
around you as you're working wrong
treating your machines as cattle
automation communication so starting
with the agile infrastructure you must
use configuration management whether you
using machines out in the cloud whether
think you still need to have some type
of configuration management it doesn't
care I don't care it doesn't matter what
flavor of configuration management
you're using as long as it fits in your
head and you can manage your systems
appropriately the whole thing being that
when you bring up a new machine or a new
type of machine or you reprovision the
machine to be doing a specific task you
need to be very confident that when that
machine comes up and is ready in
production that it'll be exactly how you
expect it to be without any human
intervention needed you need to be able
to test and rebuild and repurpose
hardware as well when you're dealing
with your own hardware you can't just
you know terminating terminate a running
instance and then forget about it you've
got this hardware that's actually taking
up space potentially consuming power and
is sitting in an accounting book
somewhere slowly depreciating but you
still need to use these principles of
testing and rebuilding regularly to
ensure that you don't have any bit right
within your machine images within
configuration with in how your systems
expect to run and ultimately it's about
getting getting feedback on your systems
and about getting deployments completed
faster and the next trick in this is to
still use your machines as cattle treat
them as you know disposable entities
that can go up and down that can fail
you can still not worry if they if they
disappear again in my specific
environment we're not using containers
containers still don't have the network
performance that we need so we don't
have any chorus or smartass OpenStack or
anything like that but the principle
still applies you need to be able to
treat your machines as cattle so part of
this though is when you're not running
in containers when you're running on
bare metal is how do you start managing
your operating system images how do you
handle going between either upgrading or
downgrading between releases of the
operating system in our case we
primarily use a bunt to at the moment we
still have some 1204 machines we have
some 1404 machines we've been testing a
little bit with 1604 but we're not quite
there yet primarily because of the great
evil that is systemd and so we still
think of the operating system image
management that we do is very similar to
how within a cloud context you'd be
managing your operating system images
your your machine images it's it's it's
very similar but how we bootstrap the
system changes a little bit so within
our boot process all our machines will
will pixie boot so straight away they'll
come up they'll pixie boot initially but
a plane pixie boot doesn't provide the
only information it really provides to
the DHCP server when it initially comes
up is its MAC address and especially
when you've got different types of
networks different types of interfaces
and network cards do fail and you have
to swap them out that's
longer a reliable identifier of this
specific machine is this MAC address so
the very first thing we do is we
actually simply chain load into I pixie
I pixie is a lovely piece of open-source
out there which allows you to do a lot
of scripting and a lot of automation
around how you pics the upper system and
so what we actually do is our very first
I pixie script that gets chained loaded
actually has a look at the DMI table on
the machine and the other thing that it
pulls out and sense to us is the actual
serial number of the machine because
that although not hundred percent
guaranteed is a lot more stable than a
MAC address and so what happens is it
then uses an HTTP call to a web service
with the serial number saying right this
is the serial number of the machine that
I'm on what configuration do i need that
again returns a separate I pixie script
which then gets chained loaded again but
that specifies the VLAN that the Machine
needs to come up on specifies the the
operating system distribution specifies
the specific kernel version that we need
to run and so that then allows the
system to to be chained loaded and come
straight up we then rely on the Debian
loop live boot packages to actually also
pull down the actual root file system as
a squash FS and so by having the the
kernel and the root filesystem decoupled
it allows us to test new kernels this
little surprising amount of bugs in
various kernels at the moment we've had
bugs primarily around network cards and
around file systems with some of the
abuse that we put the bits and pieces in
now you'll notice right now we're not
doing any disk configuration yet but yet
we still have have filesystem bugs and
we'll we'll get to those a bit later but
the debian live boot system then loads a
squash FS a squash filesystem over HTTP
n that is our actual root filesystem so
it's kept in memory and our root file
system is just read out of RAM very
quickly and that's our base machine and
then after that it then checks out some
configuration scripts from get which
will actually then
set the hostname set up the dhcp client
in the operating system and configure
the network interfaces so there are some
classes of machines that will possibly
do Network bonding on some of them will
be running with one gig interfaces some
of them would be running with 10 gig
interfaces sometimes we'll chop and
change in between them and ultimately we
rely on a set of script to two based on
based on the machine configuration and
back from the HTTP config service
configure up the machine wired up and
actually the very last step then is to
to run the chef's client to actually
configure the machine as we need it and
that chef client service will then if we
are using the local disks in the machine
at the time it will configure the
machine disks as it needs to be mount
that file system wherever underneath the
filesystem tree we need it to be and and
so that way we're kind of managing our
machines as disposable replaceable
entities if we get a new class of
hardware if we need to expand them we
just order more of the same type of
machine give them the same template to
run out and and just up there come part
of that Chef client we still have we've
been running on the system for for over
four years now and part of that is also
configuring the raid controllers in our
older machines we don't use many raid
controllers these days most of our our
system when we do need some kind of raid
will be using software raid because the
processes these days are so ridiculously
fast at software rate is so so cheap and
easy but some of our older bigger
machines we still have a mix of raid
controllers and so we've got a chef w
lightweight resource provider that'll
actually given a bit of Jason metadata
configure up the raid controller as we
need it this is probably the only step
that requires some human intervention is
if you change a machine from one class
to another we're a bit paranoid about
the state on disk so if the raid
controller configuration is the same and
the file systems got the same label
we'll just mount up the file system but
if it isn't we'll just stop and say hey
there may be valuable data on here I'm
not going to blow it away just need a
human interval
ancient to go and just remove the
filesystem remove the raid configuration
run the chef client again everything is
reconfigured as we need and so that way
we still still keep out our stuff quick
and easy which brings us into the
automation side of things bringing in
new hardware into our fleet so it's you
know we run quite a few different
hardware classes right now most of them
actually run with with pretty much an
identical configuration so we'll have
hardware from different vendors still
runnin with the same boot script running
with the same chef configuration but
they'll just brought up initially and
the only difference may be the type of
process of the number of processes the
amount of RAM and then you know as far
as configuring the machine and bringing
it up that that isn't really important
to the to the configuration management
but the hard part's come around firmware
upgrades and diagnostics so as far as
the firmware upgrades especially running
ubuntu on some of the commercial
platforms that we have we have to spend
a bit of time up front and it's even
worse with network card firmware
upgrades we've discovered network
firmware bugs as well in some of our
controllers so there's some Browns of
network card that will never touch again
purely because the quality of the
firmware and their cards is so poor and
they actually seems to be a correlation
between the harder it is to apply a
firmware upgrade to a network card
probably the more likely is that it's a
terrible quality firmware it's going to
be in there in the first place what also
collecting Diagnostics these machines
sometimes are still quite expensive we
do have support contracts on them so
being able to automatically send
hardware diagnostic information into
them into your monitoring system to to
tell you a that a failure has happened
how severe the failure is but also
getting enough dated back that you can
get back to the vendors to help them
acknowledge that there's a problem which
is sometimes hide and then actually ship
out the replacements and get it all
replaced
the other side is around communication
communication now is no longer with your
your cloud only with your cloud provider
communication now covers things like our
data center teams and network teams
security especially as we as we getting
bigger security becomes more of a
problem something people care more about
and also purchasing and vendor
management there is a terrible habit
with the kind of tier 1 vendors as they
like to call themselves they all run in
different cycles so what will happen is
that they will release a really really
good piece of hardware that a lot of
people like and they will use that to
try and get into people that aren't
buying the hardware from them and often
will then there'll be a feature in there
that we like a feature that we need and
we'll go and we'll we'll use that
software well we'll buy that that
platform from the vendor and then what
we'll do is they'll basically decide
right well we've hooked you as a client
you're going to buy from us forever more
we're not going to invest in the next
generations of the hard way and so what
happens is kind of their quality drops
down and they get really surprised when
we stopped buying hardware from them
because their quality is going down so
we try when we find people with in the
vendors on the technical side who kind
of understand us in the way we work we
try and keep a relationship going with
in communication with them as to these
are the features that are important
these are the problems that we're having
so that hopefully that they maintain it
and so some of the vendors are starting
to get a little bit better about
maintaining that relationship with us in
managing the quality of the hardware and
some of them just don't get the message
and and constantly we're getting a kind
of carousel of different hardware
representatives on the vendor side
trying to sell us hardware without
actually understanding our business and
how we work and what some of the
problems are that we face
they're also some additional
considerations that you need to think
about when you're dealing with your own
hardware that don't really apply often
you'll find the symptoms when it comes
to to seeing these problems but they'll
be very little if anything that you can
actually do about it and these things
are mechanical sympathy networking
monitoring and forecasting as far as
mechanical sympathy goes you really need
to to understand some of the mechanics
about what is is going on underneath
your your your infrastructure when you
specially when you're trying to get as
much performance as you can generally to
me one of the primary reasons we still
run our own infrastructure besides cost
every time there's a platform refresh we
compared our costs against what the
cloud providers have out there and we're
consistently about a third cheaper for
our workloads compared to to what the
providers have at the time but I'm
behind that is is actually being able to
to understand what's going on there good
example early on while we're building
our infrastructure was was purely around
disks we deal with a reasonable amount
of data not I wouldn't call us big data
I mean we've only got a petabyte or two
of data that we're dealing with but
there's a lot of it that's that still on
spinning disks and so there was a one of
our developers who started as a young
intern with us he was trying to move
data from where we ingest it converts it
and then pull it out this is primarily
around staging data and growing it out
and he was frustrated at how slow was
streaming data across we weren't
saturating out our 10 gig network at the
time and the bottleneck was the disks
and and it took him a while to
understand that especially with spinning
disks the more rights that you try and
write right in parallel to a spinning
disk file system the more seeks you
going to introduce the slow your latency
is going to get the slow your disk
Rupert's going to get because your heads
are now seeking all the time to kind of
keep up with the different right streams
rather than maintain the single right
dream on a file system if you can that
way there's a lot less jitter on the
heads and things get get better that way
there's also another analogy when it
comes to stacking up virtual machines
which sam Newman uses he's running the
micro services track but his is
specifically around this is the plastic
containers but he was speaking about
virtual machines at the time we're
working together a client when when I
was at foot works with him who was
trying to to it was a website they
needed to scale their web track and so
they had a bunch of VMware clusters at
the time that we're running their their
web application servers and they decided
well our website response is getting a
bit slow we're not handling enough
clients what we need to do is we need to
provision more virtual machines on
exactly the same hardware to be able to
cope with this load and Sam was like
well that's that's the same as saying
get more sucks into this drawer so the
way I'm going to get more socks into the
straws i'm going to put dividers into
the draw to take up more space to help
me wedge in more socks into the store it
doesn't work that way and unfortunately
you know we get the same approach when
it comes to threading often especially
junior developers who haven't had much
experience yet they'll think well if
I've got one thread doing something
especially when it comes to moving data
around I'm doing one thread then I'm
just going to and I'm running it on 12
core machine then I can just spin up 12
threads and obviously it'll go 12 times
faster but no there was a job just
recently was actually doing some data
validation where was spinning up 12
threads to do something a single pass
through a day took four and a half
minutes cutting it from 12 threads to
one minute the exact same job finished
in three minutes so so you can get some
quite big gains in performance and
throughput by just stopping and stepping
back and thinking about what's happening
underneath within your infrastructure to
to do this maintaining high availability
in these kind of conditions
can sometimes be hard as well because
when people are are spreading out
spreading out their load often they
won't think about the Thundering Herd
problem or about how load will shift if
you have a failure with some of your
cattle underneath underneath the load
the next piece to worry about is is
networking and you know it's not just a
matter of having an IP address to
connect to it's not just a matter of you
know how do I get my my data from A to B
we're on our third network topology for
just our research infrastructure that I
look after because as we grow as we
scale as we get more machines and
different types of workloads starting to
run within our environment the way the
data flows within an infrastructure
changes and so we need to be able to
keep an eye on how that data is flowing
and where the bottlenecks are and try
bottlenecks going to be so that we can
scale up appropriately which then comes
to to cabling we've got a fair bit of 40
gig networking within our infrastructure
as well and the original 40 gig
networking optics basically just for a
single sfp module right now they go for
about one and a half thousand dollars
just for a single small optical module
so if you're going to run 40 gig fiber
between two points you're spending three
thousand dollars on optics for each
slide on each switch and then you know
less than 100 bucks for a cable but if
you've got these switches pretty much
within a cab within 10 meters of each
other for less than a hundred bucks you
can get a direct attached cable which
will do exactly the same thing so your
topologies and your cabling are no
longer just about kind of how the
machines are interconnected between each
other but also you need to start to
think about physically in space how
you're going to space these things
around to optimize your costs in your
expense your expenditure on the on the
different types of things and this also
covers protocols what types of protocols
are you going to be run
within your network what kind of
protocols are you going to be using
between your network and other networks
and how you how you scale scale the
network's up other problems that we run
into with with cabling specifically are
again this comes back to two problems
with old firmware in old network cards
is is thinking about you know cleaning
fiber fiber relies on light and if you
get you know bits of dust or bits of
dirt stuck in the end of either the
optics or on the fibers you know being
able to keep them clean within your
infrastructure is is important often
you'll have connection downgrades or
connections drop from a fiber point of
view if there's if there's issues
tampering with the light limit but also
things with with kinks and kinks kinks
in the cable because why it's actually
refracting internally within it within a
fiber optic cable kinks also have an
adverse impact on copper cables on your
twisted pair cables because you know
often if you kink a cable too much one
side of a link may try and downgrade the
speed of the connection and if your
other side of the network card isn't
getting that message correctly you can
then get speed mismatches and you can
get all sorts of other problems and so
managing the physical side of things you
know a you need good people on your side
to understand the importance of managing
these cables and keeping things clean
and tidy but also when you see strange
behavior within your environment being
able to identify you know what's causing
that in the first place is actually
really important the next thing that
changes when you're running with your
own infrastructure is your approach to
monitoring because monitoring is no
longer about you know is is the instance
up is my application running am I
getting the throughput that I expect
from my application but it also comes
down to to your hardware itself ECC
memory checks are something that we've
noticed
seems to have some seasonal variability
kind of the move from winter into spring
generally causes a spiked with in ECC
error rates and so around April we had
probably about a six week period here
where we were pretty much every day we
see at least one ECC air alert I was
just single Single alert and you know I
was keeping track of the solar weather
it's like no the Sun was pretty quiet
back then you know the only thing that
we can think is you know maybe it's got
it got got to especially here in Chicago
the winters can be very dry with the
cold and so as spring comes along
humidity changes could this be you know
something to do with the different types
of charged particles floating through
the data center bits and pieces like
that so being able to tell the
difference between this is just seasonal
variability or oh I've actually got a
component that's failing you know you
need historical data to be able to to
trend and track and to figure out what's
going on when it comes to that same
thing happens not quite as badly with
with hard drive so smart is is the way
that you you monitor usage that while
kind of general health information from
from hard drives both spinners and SSDs
and so you know you learn to pick up the
different towels from the different hard
drives so so one of the one of the areas
that you'll start seeing from within a
hard drive when you've got either a
failing drive controller or a bad cable
is your driver stealth will start seeing
ECC errors from the data that's being
written to it as a you DMA transfer from
the controller and so by keeping track
of those if you see a spike in those
generally it means you've got a
capacitor that's failing within one if
you're your disk controllers and so
we've had we've had a couple of those
but you also get sometimes it could be
issues on the drive itself where it's
not actually data related but we've only
really seen this with with some of our
SSDs where an SSD may not be able to
internally maintain the power level to
maintain the rights that we're busy
hitting it with and so what will happen
is that the power will actually brown
out with on on the S
the system board itself and so one of
unexpected system resets and so the SSD
will then start reporting hey I've just
rebooted myself I don't know why but
this was an unexpected reboot within the
drive itself and so by tracking those
things you can start then tracking the
status of your of your disks fans are
important environmental monitoring is
our own data centers we also have
because we're a trading company we have
machines that are out in shared
collocation services and facilities
which are managed by third parties and
sadly not all data center managers are
created equal so we've had issues we're
out on the East Coast we actually had an
issue where a certain data center
outside of training at trading hours so
while trading was going on air
conditioning would be running perfectly
but once trading was done they they
assumed that the heat dissipation
wouldn't be a problem so what they were
doing to save on power costs was they
were actually shutting down the aircon
within the data center during overnight
hours and the thing is when during a hot
summer night the machines are slow
consuming some power and so we'd
actually see temperature spikes and so
we'd see kind of during the trading
hours things were cool but then things
would heat up overnight and after a few
few enough cycles of this you start to
see van problems you start to see other
problems and it was ultimately through
us providing graphs of us monitoring our
environmental temperatures back to the
vendors that we're actually able to say
look we can see you're doing this dodgy
stuff you can see the problems that you
are creating you know either kind of
refund us or or fix the problems that
you that you're in and so it's important
to understand kind of the impacts that
these environmental conditions have
begin it's the monitoring having the
data in place around not only your
applications and your systems but also
the environments that your systems are
running that allow you to keep a better
track of what's going on in your systems
because big power fluctuations you know
not only do they shorten the life cycle
of components but probably about
that well 15 years ago I'm really
starting to feel old about 15 years ago
they were there was an issue where as I
was maintaining a lot of Sun hardware
and we actually had an air conditioner
fail within a data center in a warehouse
as I was working in retail at time and
it actually we were running with big son
6500 systems so they they CPU blades
that mount sideways and so what had
happen it had gotten so hot with in the
data center that the heat sinks that
Simon glued on top of the processes had
actually fallen off and drop down and
that actually shorted out you know the
system board next to them and so heat
can cannot only cause you know small
bits of component damage they can
actually take out a whole system buyer
but by the wheels literally falling off
the next thing that's kind of different
when you're running your own hardware is
forecasting being able to understand how
your infrastructure is growing and
around what your needs are going to be
what your requirements are going to be
because when you're running out out in
the cloud you can just spin up new
instances and away you go and you and
retiring you in retiring machines when
you're done with them is is pretty
pretty simple but for us we've got
things like like purchasing lead time
for example is theirs we need it we
ordered a new batch of storage for one
particular system we put the order in in
April we still don't have the systems
yet they were supposed to ship last week
actually but we'd ordered a batch of
specific type of seagate drive within
within them and Seagate last week
announced that they're actually cutting
back production on this and so further
up within our supply chain our
forecasters have been expected while see
you guys going to be churning out so
many thousands of these these drives in
mind for all of a sudden seagate you
know unexpectedly said we're just
cutting back production of these because
we're losing too much money on them and
so now you know we're having to switch
drive suppliers we're trying to have to
you know that has a lot of knock-on
effect versus our capacity of needing
these systems versus them actually being
available you know hi drives are very
good good topic a couple years ago there
were floods in Taiwan which not only
delayed things quite quickly but it took
many many months for production to ramp
up back to the levels where they were
where people could actually get the hard
drives that they need it again and so
this kind of supply chain disruption you
don't know when it's going to go and so
as far as capacity goes the rule for me
is always make sure that you're you're
never running at a hundred percent
capacity because you always need to be
able to grow with your berths and your
spikes to be able to do it but also to
be able to run safe experiments when
we're trying new types of hardware new
configurations or if there's there's a
new type of research that we're doing
that may may or may not last the the
three years that we that we'd appreciate
our Hardware over being able to at least
repurpose the hardware that you have you
should always be thinking about that
when it comes to your purchasing
decisions amazon web services so within
our infrastructure we actually started
building and the research infrastructure
that that I look after out on Amazon we
started as was in the early days of epcs
from amazon and we started keeping all
our data in s3 all our research was done
on ec2 instances we'd spin them up when
we need them we'd shut them down when we
didn't need them anymore and that work
really well for for a good number of
months and we've got a lot of research
done and it paid for itself but as we
already have to have the data center
staff in-house we already have to have
the networking stuff in-house for our
primary trading activities running the
numbers on and on how our research was
growing
initially we had researchers were just
trying out a new models trying out new
ways of researching new ways of looking
at the data once we had a constant
stream of this is research that we need
to run every night to help us trade
tomorrow that reached the stage where we
were constantly having research running
24 7 and at that stage the Amazon model
kind of breaks down is really built for
when you need bursty availability so at
that stage the costs around amazon
started to get get quite bad even even
now while whenever we do new hardware
refute a refresh especially for compute
we compare the cost of buying that that
hardware ourselves versus what the same
cost over three years at our typical
usage load will be on Amazon and we're
still about a third cheaper but running
it in-house than actually running our
type of load on Amazon is and so we're
constantly keeping an eye on what the
costs are out there because if it does
flip you know some of these problems go
away and so so we should still be able
to to keep an eye on what's available
out there but there are still a lot of
problems at least in the early days when
we had our own circuits into Amazon
because of the amount of data that we
were using we actually had leased
circuits into amazon and the biggest
problem that we had there was was
primarily with s3 as amazon was adding
new capacity into s3 they would actually
roll out new IP address blocks to new
storage clusters and so maintaining
those IP address block changes within
our internal routing tables actually
became quite a quite an onerous process
because Amazon wouldn't simply republish
them through the BGP peering on the link
they would actually send us an email
every week saying all these ranges are
now available across these links and
these ones are not and so if we didn't
upgrade our routing table in in sync
with Amazon what would happen is if we
were either pushing up a whole new bunch
of data up to Amazon what would happen
is
in the office would start complaining
because that would actually end up
rooting out over our office internet
connection and blowing out our internet
trying to get data up into s3 fruit for
our research so so ultimately we spent
we spent a good while actually you know
sizing up how we wanted to store the
data internally rolling it out and then
peeling all our data back from s3 before
we we actually disconnected our Amazon
relationship there but you know every
now and again as Amazon rolls out new
services as our needs change you know
we'll spend time looking at what's out
there just to make sure that you know a
what we are doing is still cost
effective and and the right solution for
what we have so often though I've in the
past when I've spoken about DevOps I've
reiterated the importance of asking
questions about not getting stuck in a
rut of I'm doing it this way now so this
is the correct way to do it for all time
that is that is never the right mindset
to be and you should always be exploring
why are we doing the things that we are
doing what are other people doing to
solve similar problems and you know are
there better ways that I can change what
I do to make it to get better value for
the organization that I work for but
also to make my life easier and more
fulfilling to focus on on different
problems in new new interesting problems
gotchas I've spoken about some of these
before variable quality of hardware
being the top of the list there's also
the quantum state of old machines there
is you know and this is this is tightly
coupled with the perception from some
parts of the business that depreciated
hardware is free it's with old machines
especially they may have may be up and
running and you could actually run them
through a number of reboot cycles and
they'll be fine and so you can upgrade
and downgrade your kernel you can give
them different workloads but we're
nearing the end of actually moving our
research infrastructure from one data
center to another primarily around kinda
size and cost and cooling considerations
as as compute resources become denser
these days you know where we had the
infrastructure just wasn't able to
provide us the density of cooling that
we needed but what we had is we had
machines that had been powered on and
running for you know over a year to 18
months and we rebooted them a couple of
times but now after we'd powered them
down and move them you plug them back in
and they just refuse to come up and so
you know that to me that that
depreciated hardware so again as you're
treating them as cattle it's like well
you've had your free lunch this one is
dead you know send it send it to the
knackers yard and move on to the next
one but depreciated hardware is not free
because when you rely on the old
hardware like that when it dies
underneath you you've got to then think
about your forecasting your supply chain
people who are lying on that capacity
perhaps and so then you know you need to
make sure that when that card where dies
you can replace it and there's also the
the human cost of actually maintaining
that hardware diagnosing the hardware
diagnosing the problems and flipping
them over again the platform and the
vendor changes make sure that when
you're building your infrastructure when
blocks that you use to to get your
machines into the state that you expect
that you're not tightly coupling
yourself to any specific technology or
vendor tool make sure you can always
pull them out and ideally especially
when you're running within a system chef
puppet all the rest make it very easy to
inspect the type of machine that you're
in right now the type of system that
you're in and then just say well I'm
running on this brand of machine I need
or you know I'm using an lsi raid
controller right now I need to run the
LSI tools to configure it and so to me
this wraps up what DevOps is not to me
DevOps is not only about deploying to
the cloud or two containers or two VMs
there is value in all of those things
but you need to make sure
that the value you get Annie out of them
is valuable to you compared to the
effort that it takes to build and
maintain infrastructure around those
things it is definitely not only limited
to websites or web services although we
do have websites and web services
internally that we present to our
internal clients none of it is public
facing and very importantly it's not
just another name for your sis admins
your network admins your DB edmonds and
one of the other things that especially
when it comes to two people defining
what DevOps is there's a mailing list
the J's set up a while ago that there's
there's there's a bunch of us who've
been doing DevOps for a long time every
now and again an argument will flare up
like we need to define DevOps because
there's there's many snake oil salesmen
out there trying to to sell you the
DevOps and to say DevOps is this and so
probably the only rule of thumb is so
i'm going to say DevOps is not these
things because i currently feel that if
someone comes to you and says DevOps is
this they're trying to sell you
something and you need to run away
because you know we're still trying to
define i'm still trying to define to
myself internally what DevOps is and you
know to me a lot of this you know can
roll into the the whole no ops kind of
philosophy that's coming up there where
you know even when you're running out
within a cloud infrastructure you need
at least one person within your team who
is developing software to understand you
know the mechanical sympathy aspects of
where you are running how you provision
these things how you monitor them how
you maintain them and how you how you
clean them up at the end because you
know with us when we're buying hardware
that hardware has to be paid for it has
to be accounted for you know a similar
thing happens up on the cloud where
people will spin up instances and just
leave them running for ages and forget
about them someone still has to pay that
cloud provider bill at the end of every
month and so you need to ensure that
you're getting the value out of these
thank you any questions do you have
questions please put me into the app
first question is also please vote if
you haven't voted and please provide
comments to christen you haven't we have
seven minutes for questions number one
what kind of security considerations do
you have between the Machine and your
config service squash FS and so forth so
anyone can get hold of the squash FS
because it is basically a a bear machine
image so we actually use debian
bootstrap to create that initial squash
efficient in a directory and then bundle
it up is very much a minimal with a few
extra packages added on to it so there's
there's nothing sensitive with in there
going up the stack our chef clients we
use the the chef chef has built-in
security model for meant for managing
the EAP is and the bits and pieces we
use encrypted data bags as well so there
is a piece where we we use the encrypted
data bags for for certain bits of
information but also things like the
root password within the squash affairs
it's fairly well known but very early on
pretty much probably the first thing we
do in a chef client run as we set the
root password but then again that's just
exactly the same hash that's sitting in
et Cie shadow we've got the same hash
that we just as there as a plain text
going so you can see the hash but but
good lung good like cracking a big group
to hash moving up the stack when it
actually comes to to logging on to the
machines some users have access to the
to diagnose their jobs to get their data
what we do is part of the the chef
groups and the roles and the
environments that we have we actually
define all our machines are linked into
our corporate Active Directory
infrastructure and so we define well
these groups of machines I've been paid
for by this business unit so only people
within these groups are allowed to log
into the
machines again use ssh to to manage all
that you know we do we do keep track of
kind of local kernel vulnerability
exploits so we have rolling reboots
going on within our infrastructure to
upgrade the Colonel's to upgrade their
was the G lipsy stuff earlier this year
to kind of maintain those so we got that
but ultimately the value of of attacking
these systems internally is pretty low
because it will only be members within a
specific group you've got access to
those machines anyway so thankfully we
don't have as many people trying to get
into our infrastructure as well but we
do keep track of sudu logs one of our
researchers actually when she found out
that we get the emails for studio
actually kind of got very exciting to
your Center your Center from the xkcd
I've kind of where did the zoo do this
message has been logged emails go in
there an XKCD there's Santa keeping
track of who's trying to suit on systems
that they don't she was just really
excited that you know I was centered
within our infrastructure and got to see
that the pseudo alerts of people trying
to do sudo on systems that they're not
allowed to hopefully that answers the
question cool thank you next question
are you just using direct attached
storage or do you have a son if you're
using San how was your experience being
with reliability yes we only use direct
attached storage we do not use sands
within our infrastructure probably the
closest that we have is home directories
are stored on a netapp so we've got net
apps for home directories and that's
about it yeah generally sands i have
found to be overrated and kind of more
complication than they're worth but then
the flip side is the the primary system
that we have for feeding the market data
and the research data into our computer
notes is actually something we've ended
up building in house where it's it's
large clusters of local machines
basically using a key value store just
literally storing up
shooting out blocks of data across
across the network onto local
applications that are then actually just
consuming the data that has the indexer
so we've got we we've got our own
protocols and I rendered some pieces
around there to actually keep the
compute node saturated with data because
the complexity and the costs overhead of
sands just isn't worth it for us any
sand vendors in the room good good do
you have an internal virtualization
cloud if so what are you doing ah yes
internal virtualization cloud there was
a stage about five years ago maybe maybe
four years ago where we did stand up
OpenStack across some older machines for
a while and we and we tested it a little
bit there was an asynchronous routing
bug that we had between the actual
machines themselves coming out and then
that the response packets coming back
from the internet where it was just
causing some throughput issues for us so
as far as stability when it ran for
about 18 months with very very little
care and feeding but eventually we
scrapped it because we just needed to
retire that hardware and we couldn't be
bothered about rebuilding it somewhere
else I'm very interested in some of the
stuff smart OS is doing with in with
their their containerization I've been a
big fan of solaris zones and
virtualization since I was an alpha
tester on solaris 10 but you know the
only doctor instances we run in
production on our team our team fortress
2 server for Friday afternoon games sort
doctor was intended for right pretty
much so you touched on this briefly but
what are your thoughts on the platform
as a service options available in cloud
providers versus the infrastructure as a
service options to run your own setups
in the cloud on what type of services
just like what what's around what do you
think about pass a oh yeah
III think Paris is good i play around
with it a little bit for some of my side
projects there's some bits and pieces
like glacier i'm looking at into again
just because especially when it comes to
to tape backups we've had to pull a fair
amount of data from tape recently and
we've had something in the order of a
five percent tape failure rate was
actually pretty good for tape especially
tapes older than two years but it's
still it's too much of a failure for us
and so I like keeping eye on what's out
there like playing with what's out there
and and there's some good stuff that the
key thing is to ultimately focus on your
workload what is your work load profile
look like how sensitive is your data can
your data be out there can it not be out
there we've got regulators that the kind
of cover what we can and can't do as
well in some cases but there's there's a
lot of data like generally market data
is public data really after a certain
age at least from an exchanges point of
view and so there's there's lots of
stuff that we'd like to have out there
it's just purely the costs are still not
in the right ballpark yet where it makes
sense for us okay we're out of time it's
lunchtime now I don't know if Chris is
doing a speaker clinic i'm probably
coming bug him yeah yeah probably
otherwise i'll be over at the the DOW
stand as well thanks very much see