Coordination-Free Computations

Christopher Meiklejohn

Recorded at GOTO 2015

Get notified about Christopher Meiklejohn

Sign up to a email when Christopher Meiklejohn publishes a new video

hi so I'm Chris Mikkel john i am at
bosch oh I won't touch them like again
because I was irritating but so the name
of this talk is coordination free
computations the title and the abstract
don't really reflect the work as it is
now but it did at the time that I was
invited to give this talk so we're going
to change things up we also didn't have
a name for a lot of this stuff because
it's like research and we like you know
change names and do all sorts of fancy
stuff so the new title slide is this so
what I'm going to talk about today is a
programming language that we're building
it's called last and it's for building
distributed eventually consistent
computations so this is joint research
under a EU funded research project
called sync free so I get the flag
because you're supposed to do that and
i'm christopher michael john i'm at
bosch oh and my collaborator on this
project is a peter ambroy in belgium you
might have heard of him he's done some
stuff with programming languages a few
times so anyway we're going to kind of
talk to the motivation so it's
interesting having the talk you know the
last sort of talk on the last day
because what you do is you just like
spend the entire conference working on
slides you don't see anything which
sucks but what's really good about it is
that Kyle gave a really amazing keynote
where he talked about all these problems
in distributed systems and how do you
reason about concurrent operations and
how do you think about concurrency and
then they all give a talk where she kind
of talked about weakening some of this
stuff and still having some of the
guarantees and Katie as well so it's
really great because they basically did
the background section of my talk for me
so I don't have to so so thank you to
both of them and thank you to Kyle so so
what kind of well kind of go through the
motivation so if you saw Kyle stuff Kyle
kind of made this point that CRT T's are
really really awesome because they
preserve these concurrent operations
that happen in a distributed system and
we're going to talk about why CRT T's
are good and then we're going to talk
about why they're not enough and what
we're trying to do to adapt this to make
CR DTS more expressive in more rich okay
so the general idea here about the
project is that synchronization is most
expensive and impractical when I refer
to synchronization I'm referring to
things like I have a database and I have
a copy of this database
saying you know you're open I have a
copy of this database in Virginia and I
want to synchronize some state between
these two so synchronization is really
expensive and for a lot of applications
synchronization is is kind of
Impractical right so we're going to kind
of talk about two cases why we feel that
synchronization is impractical so the
first one is um Rovio entertainment so
rovio makes mobile games you might have
heard of them they made this game called
angry birds and what they have is they
have shared state between clients so if
you think that like me and Katie both
play Angry Birds and we share the same
profile because we want to you know kind
of double our points and be the best
Angry Birds player in the system we have
shared state we're sharing a profile
we're sharing scores we're sharing
accomplishments the problem here is that
you have the shared state and clients go
offline and you don't want your system
to stop working you don't want to go
into the tube and like open up angry
birds and it's like sorry you got to
come back online or sorry sign on at
five o'clock so we can compute a
round-up axis and like share your scores
with everybody right so so this is
something that that we just don't have
and we can't support in this model we
need clients to always be able to make
progress you know depending on offline
and whether you know you know and when
clients go offline you know you have
this idea that if me and Katie both go
offline and we do things I do it at two
o'clock and shoot is at four o'clock
then we both come on at five those
operations appear to the system to be
concurrent we don't really know which
one influence the other we don't know
which one is correct and we don't know
this kind of a Internet of Things kind
of buzz where do you think that's
happening lately and the idea here is
that you're going to have like some time
shared state but you're definitely going
to have disjoint state and the state is
going to be aggregated and you might
have multiple ways of aggregating the
state you might have different
distribution mechanisms you might have
things going into the centralized DC and
then maybe a sensor can give another
sensor its information and it can share
that data on its behalf so again the
problem that you have is you have this
state that might be repeated on the
network or reordered on the network you
might be computing aggregates things
have to reason about what wins you have
to reason about what I should do if
should I count this you know if I have a
counter that's at five and you know if I
have a candidate's five and then all of
a sudden I get two updates from parties
that have the same replicas and it both
goes to ten
did it go to 10 or did it go to 15 you
don't know right it could have gone
anywhere in between so fundamentally the
problem that we're dealing with when we
think of these systems is that there's a
there's effectively no total order in
the system really we have these
replicated state these clients make
progress everything's offline things
come back online on certain periods and
we have no way of really reasoning about
the order of the events that occur in
wall clocks well if you went to kyles
can you know hopefully he demonstrated
that wall clocks don't work clocks go
backwards clocks go forwards you can
skew clocks forwards and backwards you
can have a noisy neighbor on amazon ec2
and your clock could go backwards you
can't use clocks and you definitely
there's this whole thing called doom
stones that we won't get into but it's a
really interesting thing that you can
read about the citation here that we
have is spanner spinners and interesting
papers as Google's like kind of globally
distributed coordination database
coordinated database and even in this
paper they kind of say well even with
the best hardware these GPS transceivers
that we have in all the machines we have
atomic clocks even with all of this
stuff all of this money all this
equipment the best we can do is
approximate something within a very very
small window but we still don't know so
they build a consensus algorithm
essentially around like understanding
that you can kind of do things with this
window and you can work with it but
again like that doesn't really solve the
fundamental issue right so what is kind
of fundamental here and the idea is we
have concurrency that has to be
reconciled by the user or the
application developer mmm excuse me mmm
so I get talking like Joe Plumstead
sometimes and I have to like slow myself
down so we're going to walk through an
example of why this is a problem so here
are we going to have two replicas of the
same object will assume they start out
with the empty state and what we have
here is at this time here we set so
there's a register so like Kyle talked
events can you know your register you
set a value so we set the value to 1 on
replica a and then we send the state to
line comes in gets the state and it has
one so then concurrently in the system
they might have an eight different
physical time spoken currently as the
system is able to reason about so
logical time we've set the value on
replica a2 b2 and we set the value
different be three
replica p and now these merge messages
happen and we end up with I don't know
what do we end up with and what is the
system do so systems like react and
react to 0 specifically basically have
this kind of allow mole allow mole true
sibling thing and what it will do is it
will store both values will store both
values and when you go to read you're
going to get two and three and it's up
to you to make a decision on how to
resolve this so this isn't really like
the best abstraction this isn't the most
useful thing that you could essentially
have to build a large scale to choose
your system which is already hard but we
do have a solution right so we have this
thing called CRT tease that a bunch of
people have mentioned and I guess I'll
kind of talk about them too because
they're cool I guess and and what they
stand for is conflict-free replicated
data types so there there are data
structures that are designed for
distribution and these data structures
have a deterministic resolution function
so they say in the event of concurrency
we're either going to arbitrate to some
value we're going to apply some bias
we're going to do something but we
guarantee that we know how to merge
these values and we can do it
automatically so if you look at systems
like by you they if by you is one of the
first systems to kind of pose this idea
that you could have a server divine
merge function that the user specified
and this was a way to resolve these
concurrent operations Co deities our way
to formalize this so what does CRT T's
look like they have this deterministic
resolution function they come in a
variety of flavors we have maps like a
dictionary we have sets bunch of
different types we have counter as a
bunch of different types we have multi
value registers we have last reuter wins
registers and then there's like a causal
time last writer wins register and we
have graphs and in react today we
support the first four and the graphs
there's a there's a paper kind of pseudo
code implementation done by Mark
supieras group who is the first to
formalize the cotp model and the
property that CR dt's realize in the
system is a property that we call
monotonic strong eventual consistency so
this is kind of a refinement on the
strong eventual consistency property
that was originally posed in this paper
if you're interested and what it says is
that if you have replicas in these
replicas are correct that as soon as the
system guarantees that all updates are
delivered to all the copies of the
object in the system so every object
observes the same updates regardless of
ordering regardless of batching and
regardless if messages are delivered
multiple times
you'll end up with the same stage so
I'll end up with equivalent state so
these objects effectively kind of
deterministically resolve to a value
that you can program again so you don't
have to think that a message reordering
might cause an object to be divergent on
two different replicas so to explain all
of this crazy stuff said people might
not know how many people know about crd
tease look at this guy no socio deities
oh we're going to talk to an example so
let's think about that last example so
supports writing natural numbers so 0 up
to infinity and art of our resin are the
merge function will be max so in this
example we have these concurrent
operations the emerges come in and
regardless of the ordering of these
operations regardless and things are
duplicated or batch differently we end
up with the right result because the max
is always going to move towards 3 so
this merge function you know if you want
to get into the heavy math the merge
function and it's I'm really having met
but the merge function is essentially a
commutative associative and I idem pit
in and the function kind of defines this
monotonic growth of the data structure
over time so we'll look Jesus will talk
about a little bit more complicated
example so we're going to look at a set
so one of the possible set types in in a
crdt literature is called the oor set so
this is an observed remove set so what
best as possible a sequential data type
version of set so what this says is that
under concurrent operations so at the
same time if I add and remove that if I
add and remove the same element the ad
will win so the ad will win and when
these objects merge it will still be in
the set and the reasoning for this ad
bias there's also a remove bias version
as well but the reason for the advice is
because from a programmers point of view
you can't necessarily remove items you
haven't observed unless you blindly
shoot updates without reading so this is
the closest approximate because you
can't map it correct like directly to a
sequential version of a set because it
doesn't make sense because that's it
we'll walk to example so here it might
be a little small I apologize to those
of you in the back we're going to have
three replicas of the same object please
in this example these dotted lines
represent merges and we'll see that some
of them take longer than others and
we're going to have this three tuple
here so this three tuples going it so
this is
going to be the user observed value and
this is going to be the metadata and
rest of the presentation and in this
will assume that the street rupal
represents the value and then a set of
unique identifiers of additions and then
a set of unique identifiers of removals
so what you do is if I want to determine
if one is in the set I take the
difference between the ad set and the
removed fit if the ad set is a superset
of the removes then the object is in the
set when i remove an element what i do
is i take all the observed additions and
i add them to the remove set so I Union
them into the remove set so in this
example I start off and I have two
concurrent operations on this data item
I add one and I add one so OneNote see I
generate a unique identifier called be
and I apologize for the reuse of letters
that wasn't thinking clearly when I made
this and at node a i generate a unique
identifier of a so I add those to the
set okay and then we see that we send
these merge messages so it's assuming
like optimistic replication we're not
talking about a quorum majority quorum
system here and then I have a remove so
this remove on on node C on replica see
hasn't observed any of this state from a
yet the merge messages haven't arrived
so C can only remove the additions that
it's observed so what it does is it
takes this it takes this be from the ad
set and it you use that into the remove
deliver all the messages in the system I
end up with the same thing so as these
merge messages come in I Union these ad
sets and I Union the remove sets and the
data structure monotonically advances
the state continues to grow and we end
up with the desired result so that's
great and like it's really great to have
these individual data items that you can
operate over but that how can you build
a program how can you build a higher
level kind of thing that has the same
properties that these individual data
items have so a little back story the
reason the language is called laughs is
because the lattice a semi lattice
effectively is the core abstraction that
we're using to model these data types so
in the same like kind of style of
etymology is list list where list is the
core abstraction that you can pose are
as our lattices that are composable so a
little history lesson on the name so
what we're trying to do is use these
lattices is the core abstraction and
then build larger programs that deal
with these lattices so we're going to
consider this example where we have two
data items
but we have two independent data items
so these are two crdt sets we're not
going to consider the metadata here and
what we want to do is we can compose
them using like an intersection right
and we'll get get the intersection it
will compute three and whenever these
data items change we want this
intersection to recompute kind of in a
dataflow style and then we want to
update this final set and what we need
to ensure here is that we can't naively
just kind of move it like transporting
these values here I can't just apply a
normal intersection against the user
observed value what I need to do is have
this metadata that we're not seeing here
that was in the other diagram I need to
have that metadata transparently sense
so the user should program like this but
it should preserve the metadata through
all those trans transformations and then
again I may want to do this I want to
make the output crdt and then commun Oh
apply a function application on it so I
of just multiplies the element in the
set times itself and then i have this 9
and that's another crdt so i have this
fragment and I want this fragment to be
the composition of these two also have
that merge property to also have that
strong eventual consistency property and
i also want this program fragment to
have that strong eventual consistency
property as well therefore i want the
entire application to have this strong
eventual consistency property and this
property will allow me to take these
programs and replicate them I can
replicate the data items I can replicate
the entire application code as well and
then I can guarantee that this one and
this one are merger ball they're
comparable via partial order on the crdt
and they have all these nice convergence
properties so that if they receive
updates in a different order I guarantee
that the entire program ends up being
correct so the problem with this is this
function application and data
composition is non-trivial gee who would
have thought right I mean I wouldn't be
up here if it was trivial right so so in
this example we're going to take the
previous thing from before and we're
going to apply a function so this is a
function that's going to take the
elements in a set and map it to two so
this is the function applied to the
values at replica see so as replica see
changes we apply the transformations to
the items here so as the mergers come in
we derive it and now we have two empty
set to great ok now I can do that here
so I can remove that replica be for the
sake of the diagram and I can apply that
to so as these merge messages come in as
I applied these functions I excuse me I
just apply the functions against
user specified value and great so now
you have that one and I have this one
and I have these two things but there's
a problem right because the problem is
is that this function is applied to the
external value right and this external
value doesn't map that metadata through
so now if for some reason this function
applied to replica see happens to be
behind and it's missing an update and I
run into a scenario where I have to and
I have the empty set I don't know how to
resolve that and I'm back to the
situation where the user needs to say
what needs to happen so mapping this
metadata is critical and the difference
we see here is that the user wants to
program against a user observed value of
the data structure they want to think
about the values of the data structure
and not the metadata but we want the
applications to actually work against
the state we want this state to be
transformed we want to apply these
transformations across all these
compositions and all these programs and
have this wonderful beautiful convergent
system and the metadata mapping is
non-trivial and we can't merge things
without the metadata so there's a bunch
of papers that have talked about this
Neal Conway originally kind of brought
this idea up in bloom where some of it
where he kind of talks about some of the
scope dilemmas and problems with
composition and functions and why you
have to seal functions and freeze
functions Russell Brown a co-worker of
mine dealt very very struggled for a
year with trying to figure out how you
do this inside one data item correctly
and thank God a quick check because we
found some insane race cases that would
have raised conditions that would take
like series of inter leavings to produce
and finally we wrote a paper about well
okay we finally figured it out for a
single object can we break it across
objects and no you're back to the same
problem so there's been a bunch of work
that's been happening over the past two
years on this kind of idea so what do we
propose so I'm proposing this language
called last it's a it's a language so
it's semantics performing for doing data
type composition it's a distributed
runtime built on top of react or so the
distribution model that's used in react
the database provides the distribution
of CR DTS and some anti entropy
properties and finally it's right now
it's an erlang library we hope to turn
this into a language Peter very much as
a PL person moving into the distributed
systems world and I'm a distributed
so we're hoping that we can meet on
something really nice and have a
language that maybe runs on like llvm or
something and and you know we can run it
all over the place and push you to
clients but right now for the purpose of
doing the research we're focusing on
Erlang which is kind of our core
competency at this point and the model
kind of realizes crd teases streams of
state changes so we like to think that
you know crdt is an object that
monotonically evolves over time at a
single replica so you have these partial
orders applied at each replica and we
want to connect these through these
monotonic processes so processes that
kind of read and make a change so you
know simplify that if you think about
that graph I showed earlier with the
intersection we just want to have it so
that when the inputs change they change
some output and we want to ensure that
the state is always growing we're always
kind of getting that monotonic property
that allows us to merge always carrying
provides a one-to-one mapping a CR DTS
and a many-to-one mapping of CR DTS as
you show as I showed through the earlier
slide with the application the primitive
operations that the language provides is
a monotonic read operation so this is
kind of a modified version of the read
upper the threshold read operation that
Lindsay Cooper kind of published the
alvars work about it provides a session
guarantee that kind of goes cross nodes
so you can do a read with a causal
context and ensure that you always read
a future object and you never see an
earlier object in the system it has an
update option that allows you to add
things and change their duties otherwise
the system would be useless it has
functional operations map filter and
fold it has set theoretic operations
such as product union and intersection
and finally all these operations are
lifted to operate over the metadata so
you don't specifically focus and program
with the metadata so we'll look at an
example of a map so holy God that got
cut off at the top oh god damn it um but
like so I mean the idea here is that you
know we have a function application here
this doesn't even what a real like
distribution like this would look like
because you have all these failures that
happen in here that Kyle talked about
like this evil cloud that's in the
middle where all these lines go all over
the place so we have this function
application and we have this function
all observe these changes at all
different times but this merge property
and carrying this metadata through
ultimately allows us to have a program
that has this metadata that allows us to
reason about the order that things
happened where objects are how the far
they behind and kind of get this
composed crdt system so what does the
architecture of this language look like
so it relies on realizing the system is
a bunch of variables in this kind of
shared data store that they mutate and
you'd say that like each node kind of
has some subset of those variables we
integrate with leveldb bit kaska Nets
leveldb is obviously the most optimized
storage engine we can use to persist
things that have the language actually
run pretty well EDSA specifically used
so we can quick check the implementation
generate lattices and then kind of make
sure that everything happens correctly
things go to the correct value and we do
this using the quick tech statham
extensions which are provided in the
Erlang version of eq c we use the crd
tease that were built by russell brown
and sean cribs in the react ET open
source library and react ed provides a
purely functional Erlang implementation
of all those crdt that talked about
before there's a bunch of new
implementations just last night at david
greenberg at two sigma released a
closure version of the optimized our set
in a open source library which is really
fantastic and he tested it very heavily
with I believe you use test out check
which was written by a call you have a
former colleague of mine which is
similar to quick check we have
centralized semantics so we can run on a
single node and we can kind of have the
system execute on a single node over a
single copy of a crdt for testing and
kind of simulating district different
distribution models we have a
distributed kind of semantics as well
that realized crd teased as a single one
and we have some proofs around how we
guarantee anti entropy and progress of
the system we have an entire paper about
this so if you're curious about how all
depth about it in the react core
implementation of this CR DTS are
distributed over a consistent hash like
using consistent hashing with hash based
partitioning we used majority quorums to
ensure fault tolerance and high
availability and in the event of
failures in the network we have an anti
entropy protocol that can repair things
so this is all part of just making sure
the guarantees of the system are held
and finally we have a hybrid model that
we're using to explore some of these
different Internet of Things
applications and mobile applications
that allow us to do arbitrary
so the next part of the talk i'm going
to show some code and then i'm going to
kind of talk about three example kind of
applications and then we'll kind of talk
about related work and where we're going
with the research so the syntax is kind
of familiar so this is erling if it
looks like you know if you're an erlang
programmer this should be pretty
familiar erling doesn't really have a
type system so given that we have to
declare things with types you have to
know if you're using it observe remove
set or a G set and we provide semantics
to map between these things so you can
map and observe remove set into a G set
and things like that we have this update
here so here we're creating a observer
moveset i'm calling it s 1 then we're
going to basically add elements 1 2 and
3 to s 1 this a here at the end is the
actor so this is a unique actor every
participant in the system has to
identify itself by a unique actor if we
want to ensure that we want to capture
the capture all the concurrency and if
you're interested in that you can read
the Sharon bus result which kind of
talks about that in detail and some very
heavy math and then finally here we
create a second set of the same type and
we met between them using this function
and you know this is just the prototype
syntax our single the you know we have
an entire kind of extension that removes
distribution from the picture completely
so we can kind of test things and
explore different distribution models
and you can see it's basically the same
you just have to supply a level DB
instance or something like that to know
how to persist the data structures so
we're going to talk about the ad counter
so the ad counter is the first example
that we were building for rovio
entertainment so this ad counter is
basically a counter for advertisement
impressions we want to push the
advertisements to the client we want to
have the system disabled advertisements
at a minimum of fifty thousand
impressions so Rovio is okay with
displaying and add more but they have to
guarantee under contractual obligation
that they have to display a minimum
number of ads doesn't cost that it
doesn't really cost them any more any
money to display more and finally we
will get clients to be able to display
these ads while they're offline so to
look at the information flow we have a
graph that looks like this which is
really small so we're going to zoom in
so in this part of the graph here we're
going to just kind of like take two so
we're going to have some rovio counters
and some Riot Games counters these are
ads these represent ads we're going to
kind of put them in a group so we're
going to add them into a set so we have
a set of ads we
in these ads together and then we can
kind of compute this product with
contract so contracts are just kind of a
you know to make the example a little
bit more complicated contracts basically
say that you know an ad is displayable
so we compute the cartesian product
across ads and contracts to get this
adds contracts and then we filter
basically on ads with contracts so this
is equivalent to a sequel join here the
product and filter combination and then
finally we push all these counters to
the client so the client actually each
of these mobile clients has all the
counters or some subset of the counters
and the mobile client will increment
those counters as advertisements are
just played and they use a unique
identifier so they properly capture the
concurrency in the system finally
periodically they'll send their
advertisements back to the to the main
data center let's say and then we just
have a process that waits until that
advertisement counter is above 50
thousand it disables the ads re
propagates through the graph and the
next time the client comes on line it
gets the it gets the change so the
information flow in this graph is
completely monotonic everything just the
state is always growing we have the
cycle in the graph that kind of pushes
data back through we enforce you know
this infinite process of sending data
through the graph through this metadata
so this is monotonic read operation that
prevents us from reading an object if
it's in the past because there's Kyle
sending his chemo you have this
interesting thing in a distributed
system well you might write an object
one and then write it 2 and then read it
in its one again right so we have to
have a primitive in the language that
enforces that we always read into the
future because that's and we do this
using a causal context so we effectively
use like version vectors and dotted
version vectors and those kind of
logical time techniques to prevent us
from reading an earlier value in the
system because failures will inevitably
happen we're going to look at the code
not because I want you to read the
syntax and all the comments I wrote
because i want you to see the size of
the code so in this example we have this
alignment of the slide is really
irritating but in this example we have a
client that kind of just you know
displays ads when it receives a message
to display them we have a server that
just basically this is a five should be
a 50,000 sorry it reads and it basically
blocks until the counter gets to that
point so this is the use of the
monotonic read operation and then we
have the rest of the code so this
section is responsible for just creating
the ads so we create ads and then we put
and sets those ads have unique
identifiers we compute the product and
filter basically just kind of two very
simple functions we initialize a bunch
of clients and we initialize a bunch of
servers and all of this code is in our
repo you can run it and just make tests
and it won't just run the whole thing
for you and you can play around with it
so so traditionally when we think about
distribution we would look at this graph
we would say well how do i distribute
this on our network well normally you'd
do this right you would say okay this is
the clients this is the server I'll run
this in a data center or on this on the
client right but there's nothing that
says that the distribution needs to be
that rigid in our model we could do this
or we could do this and the benefit we
have here is that the system itself is
fully composable all of these boundaries
here these communication lines are
enforced using crd tease with those
monotonic read properties with that
causality information so we can
guarantee that we can kind of say well
what I'm really going to do is I'm going
to run this in amazon's Virginia data
center and then I'm going to run this
component like maybe close to the edge
so I run this at a point of presence may
be in San Francisco and in Dublin then
maybe I run one of these in each of the
EU member states right so I run one in
in Belgium I'll run one in France and
maybe I'll run one in Portugal and then
finally we say well these are where the
clients run so we're really aiming to
build a system that allows for this
arbitrary kind of distribution of the
language as well and by having these
rich data structures that allow us to
have realized how things change in the
network and understand when data is
stale and when data is up to date we get
this arbitrary composition so there's a
property that's nice and this is a
have the prototype of the language
semantics which is the subject of the
first paper that we're working that's
under submission and then the
distribution boundaries is kind of the
focus of our next work so the second
application is thinking about
materialized view so I really like the
idea of materialized views I want
materialized views and react but
consistent system or hard because you
have still replicas and how do you
distribute the programs and God usually
everybody's uses JavaScript MapReduce
and I have to debug some like crazy
javascript vm sing and some ancient
version of spider monkey so I really
don't want that stuff so I want
materialized views so how can we you
know how many people here know what
materialized views are all right a fair
amount right so any
inverted index and solar is a
materialized view you know in sequel you
have materialized visa use all the time
as Neil Conway said to me once you know
all great problem that every great
problem in computer science is just
materialized view maintenance and I
think that's a really good quote and we
know if he originally said it but he's
the one who said it to me so I'll
tribute to him and he might hate me for
it but whatever so I mean materialized
views the goal here is to have
incremental e updating we want to
propagate only Delta's you know the
schema based on you know the schema is
kind of problematic in react but we're
going to try to build something that's
similar to like the 2i mechanism and
react and this model could be
generalized to something like voldemort
cassandra or any of these dynamo style
systems so what do we want to do so we
want to think about the database as a
stream of updates so react is a logical
unit here represented by this one circle
and we have a stream of updates come in
so this is update to key one two three
and then update to key one and two but
the way these systems usually distribute
data it logically not not always in the
implementation is as we learn but is
that you have these disjoint replica
sets so you have disjoint replica sets
and they don't have overlapping data and
then these updates kind of go here so
for simplicity the diagram we're going
to assume the updates to k1 go to
replica set 1 K 2 to 2 and so forth and
then within each replica set like in
react today you have full replication
between the nodes so we have three nodes
and all of these three nodes are
effectively equivalent so what are we
think here so there's an interesting
analogy here that you can do to the crdt
world is that you know we have these
things called computational crd tease
which is being worked on by a group at
Nova and they say well you can have a
crdt that only observes part of the
updates that's a really good kind of
analogy to how this is right it's very
similar to say well a computational crdt
that only observe some subset of the
updates it's similar to replica sets
right so if we think of react the
database it's just a huge map of CRT T's
it's just a dictionary then we can say
well you know okay that's fine then this
is related to computational co duties
and then inside of here we can think
that every node is essentially a crdt
because the crdt property says that all
copies observe all updates eventually so
we have this strong convergence property
in here so when we think about
materialized views we can think of well
react is essentially a database that
gets a bunch of updates and stores them
and when you read an object that's
effectively materializing an object from
the log right so you take all the
updates in the logged understand blindly
and we merge all of them and then we
give you the object and that's
effectively materializing that object
right so what we want to do is we want
to have a way where we can kind of use a
last program to define how to derive
some other data item from another data
item in the system which is effectively
materialized view maintenance so in this
example trolled again but in this
template so imagine we have a template
for kind of how to read data out of
react and write data to react well what
we could do is we can build the last
program that says okay I can just take
all those updates and the strong
convergence property I can take all
these updates and I can build a list of
all the objects in the system and what I
can do is I can you know if these we're
going to assume that these are like map
kind of data structures for the
simplicity of the diagram we can see
well what we can do is we also can build
from this object so we can compose these
as we're thinking about the last
programs we can build one that says well
I can collect all the people who are
over 65 and then I can collect all the
people over 80 and I can incrementally
update the a tea set from the 65 set and
I can distribute this view I can have
multiple copies of this view and I can
guarantee that I can merge these I have
recently guarantees I've fault-tolerance
guarantees then we also can build this
one of the people named Chris like me
and that's the one we'll talk about so
how do we distribute this so what we can
do is we can take this person named
Chris program which reads these values
out of the trdt so we think react is
just a big collection of objects and we
map that set into the set of objects
that are named Chris so we cut this up
and we have this we but we basically do
some clever program rewriting to create
a bunch of mini little programs and we
put them at the nodes and so this is
kind of similar to a MapReduce model
right if I have MapReduce than I could
say well I have you know I'll run the
maps at all the nodes to collect all the
people named Chris and then I can join
them all and produce them and get the
full list of people named course so we
have again this property that says well
given the nodes within a replica set all
observe the same updates eventually we
don't guarantee the order we don't
guarantee the batching we don't dare
into plication since all the objects in
the system on the node will receive that
if I actually put those programs at the
nodes as well those programs will affect
effectively have the same result so I
don't have to worry about multiple
copies of the same index getting out of
sync because I know they'll eventually
converge to the right
and I can incrementally update them and
then what I just need to do is I just
need to some across the replica sets and
we do some of this stuff today in react
ski listing and 2i mechanisms I just
summon this sum is not like addition
this is like commutative some so then
when I execute the program I just
collect the result here and I just kind
of merge them emerge that yeah yeah yeah
yeah I said that very poorly but you
know what i meant but i was trying to
distinguish it from like adding you know
but like I meaning like a generalize
like but yes we'll talk later but so
then I merge them so this effectively
this coordinator here is effectively a
reduced function if you think of like a
map reduce style a kind of reducer right
so it collects these results so these
are kept up to date and then I basically
build the object here so what we also
can do with this model is that we can
cache these results at the coordinator
so we can cash the results of the
executions here and we can we can
disseminate these through gossip as well
and then what we can do is when we
execute a query we can do a time-bound a
query we say I ok it's ok if data is a
little bit stale but I want to ensure
that the query execute within a bound a
certain time bound certainly in C bound
and then I can take any object that I
received and merge that with the cache
copy and guarantee that that will
converge correctly as well because the
cache copy just looks like a previous
version of the crdt an older version of
the crdt so we can do some aggressive
caching there which is nice because
again in some of these mechanisms today
so if you look at if you look at some of
the problems that things like you know
elasticsearch have with the leucine
indexing mechanism is that those indexes
assume that data is fully disjoint and
if you happen to contact two nodes that
for some reason have the same data item
and your computing you know you'll get
two copies back and they'll be different
the index will show that I had two hits
on two different objects and it won't no
one could just be a delayed copy that
you don't want to see and then this
obviously gets much worse if you compute
like a Cartesian product and the result
of the solar query as well because
you'll see that explode so it's very
important in those models that when you
execute that search query like in like
in distributed solar and in solar cloud
as well you'll see you'll kind of see
the results get kind of finicky so we
have some other work that we added
logical clucks to some of the solar
indexing stuff and we kind of show how
you can apply the same
makes that exist in CR DTS today to
these two some of these distributed
search engines as well so so if we if we
think about Internet of Things and we
look at this model that we had before
this is the react model where we say
okay the updates come in they logically
are kind of applied at these replica
sets and then they kind of go to these
nodes so how does this model kind of
apply to the Internet of Things model so
if we kind of change the information
flow here and you think that well these
are sensors and these sensors are
generating updates and eventually those
updates are aggregated upstream and we
want to execute so the program's no
longer finding people named Chris
because that's a really poor application
of the Internet of Things sensor network
but you know a more realistic
application would be if I have a bunch
of temperatures and I want to alarm if
any of the temperatures happen to be
over 90 degrees which is very common use
case for the Internet of Things stuff we
can kind of apply the same model here we
we derived from a single application to
the edge and then we can have these edge
applications execute this program and we
can determine the sensors that are over
and realistically in these models you
kind of eliminate this aggregation
upstream so a lot of the Internet of
Things kind of applications we're
looking at that exists today kind of
aggregate all of their data upstream or
try to aggregate all of their data
upstream there's been a bunch of work on
looking at the problem of the industrial
Internet of Things model where you know
you're aggregating data from your
sensors into some HDFS thing in your
centralized DC but now the problem that
you have is your sensors also generate
data like analytics data on the sensors
so sensors might be failing as well so
you kind of have the problem that you
also need to do sensor kind of network
stuff on top of the sensors that you
have deployed and a lot of the kind of
work that is happening it was just a
paper a few weeks ago kind of talks
about well I can't aggregate all of this
stuff upstream it's impractical to
aggregate all of this back to a
centralized data center so we need to
push computations the edge so there are
some interesting applications of RX and
a corba style IDL which is really crazy
but I don't know why somebody would use
that today but you know so people are
looking at this problem as well so if
you saw been Christians in stock
yesterday at
he talked about some of that RX stuff as
well so again what are we trying to do
so we're trying to execute computations
at the edge we're trying to have it so
you write these programs where you think
about all of your data and then we're
trying to do this transparent
distribution and we're trying to use
crdt so that we can kind of replicate
things to make it fault tolerant and
have kind of these nice ordering
properties that allow us to know if
we've seen objects in the system or not
and kind of reason about stuff that way
so what kind of just briefly look at
some of the other systems that exist
today for doing some of the stuff so
most of the work that we've done is
based on distributed oz so distributed
oz is a distributed semantics for the
odds programming language for its
deterministic dataflow programming model
again this model focus is on immutable
data it focuses on single assignment
variables so kind of the same
programming paradigm applies but you
kind of have to change things when
you're dealing with mutable data as CEO
duties are our earlier work that we've
kind of talked about before and I talked
about this last year at recon is that we
kind of have a deterministic data flow
model that operated over lattices but
wasn't as rich as the model that I just
talked about today bloom L a lot of
people are like well this is very
similar to bloom so it is similar to
bloom some of the fundamental problem of
the fundamental differences are is that
bloom functions effectively seal at a
given time so if i have a lattice so
their lattices also don't support
removal kind of without coordination so
I can't use something as rich as and
observe remove set in the blue model as
it exists today additionally since since
these things are sets that always grow
if you have a function that's applied to
that set effectively like a function
might compute like a boolean over set if
the cardinality is greater than 10 or
something once that trigger is met at
the front the function effectively seals
it no longer distributes it to the graph
and when you read that value it's just
always that so that's set you know so
obviously that ceiling property and the
idea that you can't use things that you
remove we kind of remove that
restriction on our model so that's kind
of a an extension of that work again the
bloom stuff is kind of a logic
programming approach declarative like
kind of logic programming approach
ours is trying to be a more functional
approach so we're trying to make it very
similar to functional languages the
alvars work is very similar the l VARs
work focus is on a de Toron
deterministic concurrency on a single
notes the elv ours does not think about
distribution another criticism of the
alvars work is that you have to have
like kind of a priori knowledge of how
your data structure is going to change
you always well that the threshold read
primitive always needs to know the state
and that's problematic when you use an o
are set because the o.r set generates
unique identifier no local these unique
identifiers at the nodes that are
performing the update and given that you
can't threshold on something like a no
are set without knowing what those
unique identifiers will be and if you
threshold on the value you have to
freeze so there's some similar work and
lindsay has another paper called joining
forces that kind of tries to bridge
these two models and we kind of extend
that work as well so if you're
interested in that and this is a haskell
library so you can play around with that
similar the D streams work kind of
focuses on doing kind of having
computations that track lineage there
are some interesting ideas if you know
the data is kind of if you don't have
this property where data is immutable
and you're going to kind of store that
data in HDFS and you can rerun these
computations or or fix things that are
failing the lineage information is kind
of not enough if you have mutable data
or if you have data at the edge so we
are trying to kind of say that the
lineage work is not enough you kind of
need this causal history information to
cut to do some of this stuff correctly
and finally a hummingbird is a very
similar model this is a this is an
analytics platform that Twitter
developed and wrote a paper about
something birds a similar model
hummingbird uses a commutative semi
groups we use CRT T's commutative I
dumped it in semigroups so it's kind of
a slightly modified model but it has a
ton of similarities so this is an
interesting paper if you're curious on
building systems like that so to wrap up
I'm just going to kind of talk about
where we're going with the work so the
first one is invariant preservation so
if we go back to that add counter
example and we think about how I gave
those ads out to the clients the clients
eventually talked back to the server and
then the server would disable the add
that I Clinton next time the client read
that's great but that only allows us to
enforce a lower bound a minimum of the
number of advertisements that we want to
display so what happens if we want to
enforce an upper bound
so some of the work that of altar and
new nose group have done at inova focus
on using like escrow techniques where
the counters that we send to the client
have an upper bound the client can only
increment that counter a bunch of times
before it has to stop the trade-off
there is that if you exhaust all those
impressions before you come back online
what do you do right so that it's
interesting to think about well can we
have this kind of hybrid model where
some of those counters are capped
because of some restriction that the
client places where some of those might
not be so there's an interesting model
and they have a variant of crd teas that
allow for this escrow kind of thing so
it tracks where the leases are going and
synchronization freeway which is
different than the original escrow and
demarcation work if you're familiar with
that we're looking at causal plus
consistencies this is a big thing as
part of the research group at the sink
free research group is exploring calls a
plus consistency causal plus consistency
extends the causal consistency model
tumor jable transactions so you
effectively have transactions that never
ever roll back if you're operating over
crdt you always can just continue moving
forward this kind of extends like the
Eiger gentle rain and that that series
of transaction work that's happened and
we have a research database that we're
working on called antidote which is the
other project that I work on this part
of sync free which has which has an
example protocols for causal plus
consistency so this is so one of mark
students one of mark's students did his
PhD work on this and I believe they have
a proof that shows this is the strongest
this is I'm sorry this is the strongest
consistency protocol you can provide
while being a hundred percent available
finally all of the stuff I showed you
today use those sets that store those
three tuples that track every Edition
that's ever happened at the set
obviously that grows unbounded so that's
a problem g so again we know how to
solve this we have a data structure
called the observed remove set without
tombstones or the o.r swat and this data
structure uses version vectors it
effectively use this kind of a dotted
version vector technique divergent
vector set technique to store to kind of
collapse this metadata down so you don't
have to worry about garbage collection
so you only have to worry about the
number of actors
in your system instead of the number of
operations is really nice we don't have
it implemented but we know how it should
work so this is just a matter of doing
some engineering and finally Delta State
CR DTS and operation based crd T's are
two ways of reducing the amount of data
that you have to send on the network the
amount of data you have to propagate so
this is really important work and
deforestation is some interesting work
as well if you think about if we do this
if we do this arbitrary distribution so
deforestation there's a paper that
Phillip water published and it focuses
on intermediate tree elimination so if
you do like a map a map and map the
compiler can basically optimize it out
so languages like Haskell and closure
have a belief closure has this but have
in Scala as well and so it's interesting
to say well how do you apply this to an
arbitrary distribution model right so
now if things were distributed you know
things that you can't basically do
fusion on but can you have a system that
derives a better distribution model that
would allow you to do less computation
so this is a very interesting area that
I've talked to some people at pfl about
so finally there's open source code you
can check it out we got this paper that
you could read and this one and there's
this conference called recon that's
pretty cool in a house poke at and I
spoke at and bachelor unze and you
should come to it and if you like
talking about computer science he's just
MIT a talk to it so i highly recommend
that and yeah i would be in trouble if i
didn't mention that this research is
funded by the sink free project which is
a seventh framework a seventh framework
programme funded project under the EU so
i think you'd i'm very much for all of
the money and everything that they've
done to let us continue moving forward