Tasty Topics for Distributed Sys: Novel Approaches Using Topic Filtering

Tom Fairbairn

Recorded at GOTO 2016

Tasty Topics for Distributed Sys: Novel Approaches Using Topic Filtering

Tom Fairbairn

Recorded at GOTO 2016


Get notified about Tom Fairbairn

Sign up to a email when Tom Fairbairn publishes a new video

so welcome to this session now I'm going
to be talking today about tasty topics
so the idea behind this session really
is just to give you some insights into
the kind of work we do at solace in
terms of helping customers solve the
problems that they see so hopefully here
they'll be some tools and techniques
that you can use in your development
tasks so that's the idea really it's not
a bender pitch if you want a vendor
picture we've got some great ones at our
booth please come visit so my name is
Tom fair been i've been with solace on
and off now for about four and a half
years i started off in now singapore
support team actually helped set that up
doing technical support so kind of
production support for customers that
kind of thing respect to London about
two years ago where I moved in to the
pre sales team at solid so helping our
customers solve the problems that they
have in moving data around so solace is
all about this idea of open data
movement so a flexible platform that
enables you to move data data around it
in any format you please without trying
to tie you in to a speculation okay so a
quick word from the tracks host please
do rate the sessions this isn't my first
go to but it's the first view have been
to go to so your feedback really is
important because obviously we want to
make video this material is useful to
you as possible if you do have questions
and I don't know you're shy or you like
that worth more than speaking out please
do feel free to ask questions with the
app if you want to ask questions while
I'm talking please just stick your hand
up and I'll come to you yeah quite happy
to take questions as I talk grubs and
weighted towards the end alternatively
if it's a more complicated question then
please like to the end okay let's get
started so kind of the base of this talk
is the idea of topics and the idea of
kind of publish-subscribe systems so I'm
hoping that most of you are aware of
what published subscribe means in terms
of a pattern for information exchange
but just to kind of level set here give
you some revision understand exactly
what I'm talking about we're talking
about exchanging information in
distributed systems okay so this could
be applications running on different
hosts it could be on the same host it
could be applications running in
different data centers running globally
okay so one of the key features of that
then is this idea of decoupling so a
failure in one application should not
impact the exchange of that data other
than obviously that application being
unavailable so how does that work well
we have a publisher who is responsible
for sending data and he publishes his
data to a topic to call it a subject but
essentially it's a tag a piece of
metadata that ideally describes what
that data is and that enables people who
need to receive that that data our
subscribers to pick and choose exactly
what the information they want to
receive now the key feature of this is
that to give us out the coupling maybe
our subscriber isn't available at any
given time maybe there are multiple
subscribers in which case it gives us
this thing called fan out so you may
have for instance or other people using
their mobile phones to update say stock
prices would you also have multiple
publishers publishing to the same topic
right a topic is not confer confined to
a particular publisher and that gives us
plan in right where you have multiple
producers you may have only one
subscriber listening to that data an
example of that might be some kind of
big data cluster
that listening to everything that's
being produced in your organization to
then run them later batch analytics on
it now something you don't normally
associate with publish-subscribe this
idea of persistence ok your persistence
of data data that is resident within the
publish-subscribe system until a
subscriber is ready to receive that data
normally what happens is it's kind of a
live if you're there you're there if
you're not kind of system but some of
these publish-subscribe systems do offer
you persistence and in fact we offer a
kind of hybrid you can opt to have both
so an hour and I'll talk about that in a
little while but the key here is where
subscriber is registering interest in a
topic it's saying i want to receive data
of a particular type and your publisher
is then tagging data with that metadata
ok so talk about a topic then what is
the topic I've already mentioned the
word tag but actually that's a bit of a
leader into this slide we just described
what a topic is then it's a piece of
metadata it's going to be a string so
that's easily understandable now the
temptations sometimes can be i'm just
going to label the data i'm just going
to give it a very simple tag that
describes it but there's a problem with
that in that if we have multiple
subscribers some of those subscribers
may be interested in different sets of
that data if you think about your Venn
diagram where you're kind of universal
set is all the data that's being
produced your subscribers are going to
be small circles within that Venn
diagram and some of those circles are
going to overlap so let's take this food
example imagine where some kind of
groceries some kind of food retailer and
we're looking four slices of things
sinuses of food so we might have apple
slices and we might be tempted to just
tag our apple slices data with something
like food apple slices but then if we're
listening to everything that is slice is
we have to make a decision if all we
have is that there is a simple tag so
what we could do is we could list every
type of slice right so we might have
apple slices on hand slices hmmm there's
a problem with that though we kind of
broke a nail at decoupling because if we
now start producing peach slices and our
subscriber is interested in all types of
slices we now have to update his list
and we've broken our decoupling right
because an update to the producer means
that we have to update the subscriber
ends and that's not what we want to do
hmm ok well maybe we could do a string
search that's a good idea what we can do
is we can go along to the end of the
string and look for the word slices hmm
well that's okay but the problem is
there you've generated a kind of
dependency on the exact structure of the
tag you know have a structured tag
effectively it's just the user using
string searches to find that structure
and then there's another problem to you
see you've got to do a string search and
that's going to be different in
different languages maybe you'll be
using a different language to do that
maybe or we'll be implementing their
string search yourself hmm that leads to
inconsistency one application might do
string searching in one way one
application might do string searching in
a different way hmm that's not ideal hmm
and there's another problem too but
before I go into that let's just look at
how we could do it if we make our topic
hierarchical so an example over there is
using a slash a bit like a folder
delimiter that gives us more flexibility
what it enables us to do
if we have wild carding represented here
by the asterisk or star that means we
can very easily select only slices and
we don't care what happens at the other
levels of hierarchy and that's key
because then it doesn't matter what the
producers do it doesn't matter if we add
a new project like peach slices our
topic matching will always match what we
want which is slices so this
hierarchical approach means that you can
be couple what's being produced from
what the subscribers are interested
another benefit of this is if you've got
a sixth other who's only interested in
peaches might be peach slices might be
whole peaches we can use exactly the
same kind of structure we can just look
at peach only and wildcard the types of
slices and only get features and that's
when that becomes really useful is if
later on after we've implemented our
topic we want to have a completely new
type of conversation going on that we
never anticipated when we designed this
it's just a matter of publishing to the
right topic and creating the right
subscription so the whole idea here is
that topic is not the same as a tag if
you use a hierarchical topic system with
wild carding you get so much flexibility
and you get this Trudy coupling between
the consumers and producers of the data
and the topic structure itself ok so
just to ram that point home why would I
care about this well it it's simpler if
you're using your publish-subscribe
system to do all this filtering for you
you don't have to worry about that in
your application you're not implementing
string search routine for instance it's
consistent no matter what you're doing
in your information exchange you're
always using the same pattern to select
the data you're interested in and that's
your wild card
and here's another point too if we go
back to this slide here oops not that
one all that one has I go backwards here
there we go in the example of where
we're doing for instance of string
research we're having to receive every
single message and search through the
topic and that's really inefficient
right you're getting all that data and
throwing away lots of it if you do the
wild-card approach and you let your pub
subsystem do that for you you only get
the data that you're actually interested
in and that means that your application
has to do a lot less work and in the
case of sorry in the case of like the
internet of things where you have low
power for instance gateways talking over
something like 3G there you probably
care how much information you're
exchanging and only swapping that data
thats of interest is probably really
very important too so just quickly let's
think about the topic itself okay we've
got this idea of wild cards and a
hierarchical topic structure where is
this topic created is it created when we
send the message is it an administered
object that we have to create somewhere
is it something we have to create on the
broker right do we have to say i want
the topic now and i'm going to set up
that topic it's the topic created of the
receiver well how we like to do things
that solace is that actually the topic
itself is just a property of the message
you create it actually on send and it
just becomes a piece of metadata on the
message itself it's not an administered
object in any sense it's all on the
broker what actually happens is the
receiver registers interested in the
topic and he creates the subscription
so you're kind of administered objects
are actually subscriptions and not the
topic itself and that means that your
sender is free to send any message to
any destination provided the user
permissions what it also means is that
the topic can be created on message send
and so it can be different for every
single message you sent it's completely
dynamic and when we come to one of the
use cases we'll see how they can be used
all right so much revision then there
let's get into some details okay data
formats sounds pretty tedious but
hopefully this is going to become clear
later so let's imagine we're working for
a funky dynamic open source backup right
so you know it's fairly informal you
know I'm calm I'm a magician or hey and
all you need to know really is my mobile
phone number to get hold of me when
they're when things go wrong that's
great it's a working very well and in
fact it works so well we're takin over
were bought out by a bigger company that
I want some structure they want to know
who you report to and who reports to you
so your data format is going to change
okay it's not a big problem right from
and just checking our code update all of
our applications everything's happy ah
but then you get the phone call turns
out there's a reporting application that
can't be up upgraded just yet don't know
why could be regulatory hmm so we need
to run these two data formats at the
same time ah that's a problem do we want
to write our code so that understands
both data formats hmm that's one
approach but probably that of course is
the UL end up with legacy code hanging
around the place maybe a bit of
technical debt nobody can be bothered to
go in and trimming off that code you no
longer need well here's an alternative
approach first of all let's just think
about our traditional see icd pipeline
we can use our basic format here Jason
what our dates of data format is
if we tag our source code files in this
case using gear we can in tank them with
which data format that is and then when
we come to do our bills we can use that
tag and here we do it right so in this
case i'm using jason so we tank our
class which in jason terms means that
for instance with Jason by tagging the
class where then tagging the data format
okay and then what we can do is we can
take out a good tag name and add it to
the topic so what we've done here is if
extracted information from our version
control system and put it directly in
the topic because it's all it's
completely done in it now what's great
about that is that then regardless of
what the data format is we automatically
create our topics and the data format is
been part of that topic so for instance
when i send my v1 data it will only go
over the factory one topic what's more
I've built all my code using all that
that v1 tag code so all of my data
formatting routines will understand v1
and I can run both of them in parallel
because my v1 applications are never
going to get my v2 data format because
if they've been implicitly built to
understand only the format that my topic
is for quite a simple idea but it's
quite powerful it enables you to be
completely confident that your
application is going to get the data in
the format that it's expecting okay
so that that's one idea that's quite a
nice neat one I talked about another
case study now authorization so imagine
in our pub sub system maybe we want some
people to be unable to see data that
they're not supposed to see so a good
example of that is let's say you have a
mobile application and you're talking in
and you're getting for instance an
account balance right your account
balance is coming over a topic but you
don't want anybody else to see your
account balance okay so a neat use of
kind of pushed style of technology in
this case is as and when Europe your
balance is updated we can push that
update to you rather than relying on you
requesting an update for your balance
okay so how we can do this then is we
can have a separate application this one
here what I've called subscription app
that manages your subscriptions on
behalf of other applications so let's
imagine that our any app over here is
out on the internet now the problem
there is that hmm if I request a balance
request I don't really want him knowing
which topics to talk to because he then
could a hat start subscribing to topics
he's not supposed to hear so for
instance you might say up I know my
balance updates are coming in on balance
update Tom maybe let's see what Harry
balance update is by subscribing to
do here is we have this separate app
that is under your control probably
running in your core network and went
our internet application here are any
applications connect he sends a request
on a different topic and says I want I
want to request a balance
and I'm calm now subscription and stays
up here then decide what topics any apps
with user tom is able to subscribe to ok
so that's a bit of core business logic
that you've got running in subscription
app but and here's the clever bit the
description f does the subscription on
behalf of any app so subscription that
tells the publish-subscribe system nef
is now subscribing to balance update
hump and what's key about that is that
any app doesn't even know what topic is
listening to doesn't even know what the
subscription is all he knows is that he
would be connected to the
publish-subscribe system and the data
starts coming toward him so if any act
tries to do something clever and tries
to subscribe to something that he's not
entitled to you can't even do any
subscriptions because they've all been
applied for him so that means that your
authorization can be completely dynamic
if we suddenly decide that any of it
needs to access other data and is
entitled to that we can then start of
applying subscriptions on his behalf at
any point during this conversation ok
what about monitoring then we've got
this publish-subscribe system and we're
going to be interested in what's going
on on there to make sure that our system
is healthy ok so maybe there's an
application there that's under really
heavy load can't cope with the messaging
rates maybe there's somebody listening
but they're listening for far too much
they got the volume turned up too high
how we're going to deal with that what
about somebody who's shafting to layer
the spewing out messages all over the
place and people actually aren't
particularly interested in what their
same what about him what's going on with
him well he's stuck
it just completely stopped all right so
there there are a couple of things you
need to watch out for you we need to
make sure that people aren't consuming
too much resource we need to make sure
that our applications are healthy and
can keep up with our messaging flow and
we need to make sure that applications
haven't stopped completely well here's
an idea what about if we use our
publish-subscribe system to distribute
that data in the solid world we have a
monitoring and management API and rather
than have in everybody who needs that
information firing against the API and
why not just have a single monitoring
app that listens to everything that it
needs needs to know and then publishes
that information over the pub sub bus
and that way anybody that needs that
information can just connect in as a
normal subscriber and receive that
monitoring information quite what I
would say there is it a case of kind of
us in terms of a pub subsystem eating
our own dog food right we're going to
distribute our information by our
information distribution server so what
you can do then is it's really very easy
to attach dashboards for instance so you
know you might have somebody who wants
to build a web dashboard so you might
want to do that using rest you might
have a mobile application and you know
maybe your operations people like to go
home on the weekend and they don't want
to have to connect it that they don't
have to come into your office and maybe
having something on their mobile phone
to see what's going on is a great idea
for them and so that's it so what we end
up there with is what's very quickly
moving slide this is an example of what
we can do if you can buy a stand will
show you this but here's an example of
where we're sending out is this
statistics on their publish-subscribe
system on our publish-subscribe system
and then subscribing to that information
so for instance we've got message rates
we've got data throughput we've got
operational status here
and in fact any of the management
information that we can produce on that
platform we publish over it in this kind
of format okay all right what about
another reason use case then replay okay
so this is quite popular in the micro
services kind of approach so in
microservices people talk about polyglot
persistence and one of the concerns
there is what happens if my application
papers do I want to have another style
of databases there that is going to
consist my state information so that i
can bootstrap myself an average start
well the point is in a micro services
kind of architecture as space is really
represented by what happened on our
message dream right so it just in terms
of where we are in the data flow but
also configuration information can be
sent as messages so for this we use this
idea of a cue that can subscribe to
topics that are published two so we are
mixing this publish-subscribe model
we're kind of queuing because as cues
are becoming subscribers and what we
could do with our queue is we can
subscribe the multiple topics it's fully
dynamic if you can just add
subscriptions during messaging operation
so what that means in this we play a
state idea is that we can create a state
Q that listens to all of the information
that's relevant for our application
state so it could be configuration data
and it's probably going to be our app
data to a data stream and the point is
then we had a cue that contains all of
the state information and if our
application fails and meets the restart
we can simply restart and start browsing
the messages on the queue without
consuming them and that way we can fully
replay the whole state of the world from
where we are
there's some other things we can do too
we have this idea of our last value Q
which essentially listens to what you're
publishing and stores only the last
message that you published so when we're
bootstrapping we replay our state of the
world and the last piece of information
we need is how far did I get in sending
my last message before I crashed an hour
lvq will hold that information for you
if this state information is prime bound
it's only useful for a certain amount of
time we can tag messages with the time
to live and so will then pop off the
queue when they're no longer useful so
for instance we might have a process
that is doing maybe overnight batch runs
for instance so configuration
information from the previous night is
of no interest so we set our Cpl to
expire those messages off the key so
last but not least I was talking earlier
about this idea of topic dime dynamism
and how how are our topics are created
on the fly and you can have a single
topic for message and this is a great
example of how you can do that as a
robot colleague of mine actually
presented this sister I conflicts you
let's imagine that we are a delivery
company and we have vehicles with
deliveries on them traveling around and
we want our user to be able to start our
startup is is these application and
discover what how close the vehicle is
to it now we could publish updates from
every single vehicle and have the
application process all those messages
and find the ones that are of interest
but that wouldn't I entails doing its
kind of geographic matching location
matching on every single updates from
every vehicle hmm what if there's
something a bit more simple than that
well let's imagine we put the location
of our vehicle in the topic now to keep
on here is that that means that every
single updates from every single vehicle
is in a completely new topic and that
makes you severe topic dining
okay so we've got our vehicle location
in there if we said earlier if we want
to identify a particular vehicle we can
use our wild cards to do that we can
please say don't care where car 21 is I
just want to know I just want to get it
but let's think about what we can do
with our wild cards then if we match
part of the location there let's say
we've matched the least significant
digit what that actually corresponds to
is a rectangle on a map actually the
equator it will be a square but in most
cases it's a rectangle because we're
matching everything from in this case
latitude 5150 2015 2015 9 but similarly
we can do the same for long to choose
and that will that give this is a little
rectangle on the mat well if you think
about it all we've got to do is scan up
and down the significant digits with our
wild card and that increases and reduces
the size of the rectangle that were
matching so actually we can match much
larger rectangles we can match a
rectangle of just about any size we
choose just by moving our wild card up
and down the significant digits hmm ok
well that gives us rectangles but we saw
a circular Leah how we're going to do
that well and it was familiar with for
your theory it turns out that any circle
can be in a circular shape can be more
or less approximated by successive
rectangles so all we do is add
increasingly refined rectangles and we
can get just about any shape we want so
if we were to look at our circle here
you can see that we've generated these
rectangular subscriptions that very
closely match the Ritz that match the
circle squatting we can do any polygon
we can get their accuracy down to the
meter level because that's the accuracy
of our latitude and longitude now the
key thing here is that for this circle
we generate the subscriptions once I
think we do that want our application is
then receiving any data that matches
that circle so what that means is that
now all of a sudden just by doing some
subscription generation we are now
geographically matching in that bound
there and we will get update through all
these vehicles within that match in real
time with no extra computation so we've
turned what is quite a difficult problem
of having to do lots of matching on
every message into something that's
simply a matter of generating the right
subscriptions so that's quite that's
quite powerful can we have to do it the
actual algorithm if you're interested is
we will take the world and divide it
into four quadrants which matched match
subscriptions most significant digit in
this case and we see if our area of
interest matches that particular
quadrant and if he doesn't we throw it
away if it does what we do is then move
down our significant digits generating
our smaller subscriptions and we just
keep doing it so we divide our quadrant
up into 100 smaller quadrants and throw
away those that don't match and keep
those that do and then we do that for
every single one of those mizzou match
and just keep going until we get the
desired level accuracy yeah it's quite
neat quite a neat little algorithm last
point on this thing is how are we going
to deploy our subscription logit could
do it via library right so every
application makes a call in to the
library and that generates the
subscriptions for it hmm that's a bit of
pain though because then you've got
dependencies on the library and you've
got different maybe you've got different
languages that you're using not ideal
well do you remember earlier I was
talking about our subscription at that
was used for security what we could do
precisely this we can have our
subscription app understand
our polygon algorithm and generate the
subscriptions for us so then what
happens is my user comes in and says I'm
interested in the information within a
certain area and tells the subscription
ask what that area is our subscription
that generates the subscription and
applies them on behalf of our user
application so as far as a user
connect send the area of interest and
all of a sudden he's got live real-time
data stream to organ inside to do no
computation whatsoever so it's
incredibly lightweight what's more any
changes we make to our subscription
algorithm the vns to apply them once so
it makes deployment really easy if we
need to scale this subscription act we
can just horizontally scale it because
we're just listening on our pub sub list
so that's really all I had in terms of
use cases we have obviously a lot more
some of the interesting aspects of
topics routing and filtering that maybe
you could apply to what you're doing
this is a developer conference so we
always have to have a code snippet right
so this is my last code snippet and it
goes something like this if your topic
is not excuse me your topic is not just
a tag then you get access to this topic
routing and filtering facilities like
you can do this to your location like
versioning like you can do message
replay and that's it thank you very much
hope you found that useful
[Applause]
you