Navigating Unstructured Data - Availability vs. Analytics in NoSQL

Matthew Brender

Recorded at GOTO 2015


Get notified about Matthew Brender

Sign up to a email when Matthew Brender publishes a new video

alright hi everyone I'm Matt Brender I'm
a developer advocate at bash oh not
going to talk much about what we do at
bash oh except just saying that we build
react as an open source distributed
database that's why I've been
researching a lot more about other no
sequel databases and how people build a
distributed system as part of a larger
architecture it really comes down to
discussing unstructured data which is a
confusing term and we'll get into a few
theories if you want to go learn how to
develop an app I'm not going to show any
source code here there's plenty of code
on our github repositories there's
actually something got called the taste
of react which will walk you through
installing a client in your favorite
language and actually getting your first
code into our distributed database we
have a dub cluster that you can run
locally spin it up in vagrant there's
all that stuff but it's not about us
it's about actually trying to solve a
much more complex problem I want to talk
about some of the theories and
architectures I've seen throughout that
which I find really fascinating because
at heart I'm more of a sysadmin than a
coder more of an ops person than a
developer I struggle between the two and
when we we talk about unstructured data
it's not that there's no semantics to it
there's no schema whatsoever it's that
it's just different than we're used to
we're seeing a lot more data that looks
like this JSON blob of a tweet at go to
Chicago from me as opposed to something
that fits very cleanly into a nice table
with rows and columns and that's because
they're just a quantity of data sources
that are a little absurd these days we
have social media data streaming from
everywhere you could possibly want you
can latch onto api's to github and get
every commit known to mankind happening
right now every device we have the few
that we have on our persons and the many
more we have in our homes are all
streaming logs in real time and and
these schemas change the quantities
change but the one thing that's we're
sure of is that there's a lot of value
in correlating these disparate data
types and that's interesting if we only
had a good term for that I
haven't heard many so I'm gonna go with
the one I hear the most that if i don't
hear at least one grown in here you'd be
the first audience big data we all hate
it's explaining this idea that there's
just quantities of information that
doesn't fit on a single system or is in
different forms than we're used to and
postgres just doesn't seem to be cutting
it or my sequel isn't the best and i
don't have enough of a budget for the
oracle database that they're trying to
sell me next so here we are some mash up
of ops and dev and we get to be lion
tamers with all this data coming in
through different streams and different
architectures and we're trying to get
real information out of it but anyone
that's ever inherited in architecture is
familiar that you don't get to start
from scratch you don't get some
beautiful playground where you get to go
with the the best choice on day one
you're usually building a prototype and
that prototypes on something that's very
fragile if it actually achieves the
scale that you're attempting to achieve
so how do we start thinking about these
and and what are the opportunities to
iterate on these along the way well
let's think about what big data actually
encompasses even if you're still
groaning at the term there there's a
good definition of data growing at a
scale of data velocity volume and
variety the information is you know we
used to be very fine with hourly or
daily reports on information now we want
as fast as real time as possible data
volume it was absolutely nothing to have
kilobytes but like growing two terabytes
and there's supposed to be something
like 40 zettabytes or I just numbers
you're getting absurd on the amount of
information we want to store and then
the variety again it used to fit into
nice rows and columns and now we're
dealing with disparate types of logs
schemas that are changing on the fly
information that isn't easily normalized
or D normalized and yet you still want
to correlate them you still want your
data scientists or you to put together
something that adds it into something of
value to the business and what I find
interesting is that it's fundamentally
just
beaded systems problem so because data
doesn't easily fit into these buckets
that were used to using we have to think
about how we coordinate across multiple
machines now if you want to understand
the computer science behind that there
are two very smart people in the back
that are both presenting today that will
be going into some of the theory and
practice of it I'm going to talk again
from the strategy let's think let's
think kind of practically as nerds that
are newer to this so when you're looking
at data variety it's the kind of
information just doesn't fit naturally
into a single type of database so as
opposed to trying to massage information
and transform it on the fly from your
application layer right down into your
initial persistence layer it's it's
better to be storing raw events in
something that can handle unstructured
data types and then parse it after the
fact that in order to achieve the scale
that we're using these days requires
more than one computer when you look at
the velocity there's no one stop shop
there's not a single platform and no one
you shouldn't believe anyone that's
trying to sell you one that will be able
to do everything from batch to to
real-time analytics for you or storing
information at the scale and variety
that you're used to dealing with today
and volume there's just different
heuristics in place that you want to use
when you're dealing with small amounts
of data and very large quantities of
data like you might be absolutely happy
with storing your terabytes of
information on HDFS file and or
distributed system or but if you're
expecting anything that's remotely low
when it comes to response time on the
front end that's an incredibly
inappropriate choice you need something
more that gives you latency guarantees
so because of that there's a solution
that ends up coming up over and over
again also not the most useful term in
the world but no sequel is a conclusion
that's quite helpful how many people
came in here very familiar with no
sequel
okay so a portion of you are familiar
really just think of it as something
that is not comfortably fitting into a
relational database sometimes it comes
in the form of a graph sometimes it's
columnar sometimes the very basic level
its key value but data stored in ways
other than what we've been doing for the
last 40 years and when we talk about no
sequel there's a lot of options this is
intentionally not helpful as a graph
because there are a number of varieties
key value documents tours graphs they
all tend to be smushed together but they
solve slightly different problems in
slightly optimized ways from each other
and whether you're trying to solve a
problem of availability at scale or
whether you're trying to find insights
on the fly in the sub millisecond way
whether you're doing text search versus
relationship analysis you have to
understand that each tool has its own
use case and right now there isn't a
single tool that does it all that's for
a very good reason there's actually
quite a lot of trade-offs and I want to
go through three in particular that come
up as we analyze different no sequel
platforms in particular so when we're
talking about a no sequel solution they
fall I found it helpful to look at these
three categories of information related
to them what do they do when it comes to
consistency what are they guaranteeing
if you saw Kyle's keynote on Jepsen
you've got a very deep understanding of
consistency and what that means I'll
talk to it from a higher level and take
a step back from there there's conflict
resolution so what happens when which
you eventually will always have in
Atkins in a distributed system what
happens when there's the opportunity for
conflicts either have high availability
or you have conflict it's really one or
the other and then partitioning what
happens when or how are you separating
information and spreading it out over
the cluster to guarantee the sort of
performance and scalability that you
need
and the definition that we always come
back to started out as just a conjecture
from Eric Brewer the cap theorem that
I'm going to take a second to belabor
the point in case people aren't familiar
with it I find it interesting to
understand that cap theorem is really
just saying that in the case of a
partition in the network what should I
expect to happen will clients on each
side of that partition still be able to
read and or write to the database that
it can see or is it somehow not going to
be available are we going to sacrifice
the availability to guarantee some form
of consistency the only thing that we
can be certain of across multiple nodes
is that partitions will occur so which
one are we willing to tune and for what
cost so they tend to to break it out
into the the orbs there's kind of an AP
system is what we're generally talking
about a majority of the no sequel
solution so I'll go over our AP and that
provides you with a higher availability
which means lower latency but at the
cost of some of that consistency that
we'd have to resolve down the line a CP
system will you know in the partition
some the system won't be available but
it will be able to scale across multiple
nodes so it's a question of whether you
expect a 404 from your system or whether
you expect an answer even if that answer
isn't the most up-to-date a CA system is
what you're used to it's a Postgres
database in my sequel database so just
to map those out a little and give some
examples I'm most familiar with react
the database that we make at bash oh and
that's a tunable AP system cassandra is
similar they're both based on the dynamo
paper by amazon in 2007 that goes
through basically some lessons by which
you can build distributed systems in a
highly fault tolerant way while
realizing that there's you know there's
more important things than just
consistency if you value consistency
over that you can use a strongly
consistent platform like mango or Redis
rattus of course for memory is in memory
only so it's more of a caching layer as
opposed to being used for something for
persistence so you have to understand
the layers that you're going to build on
top of each other and next so focusing
in on the AP systems because I think
that's really what everyone's agreed is
quite interesting when you're going
across multiple nodes in if you have a
thousand nodes and you get a partition
in the middle you don't just want all of
them to be unavailable you'd rather have
to 500 node clusters that are available
inevitably if say we're using a key
value database like react if i write to
the same key on each side how do i know
on when that partition is done that I'm
going to have information that's worth
having or or is it just going to be
smashed together in the way that Kyle
was showing with Jepsen and seeing that
information is in fact overwritten that
shouldn't be how do we guarantee that
information is correct so it comes down
to the way in which you deal with
conflict resolution really two main
options you have a last right win where
you're paying attention to the the clock
on your system or you're using something
different some sort of causal context
these are called vector clocks or dotted
version vectors in in some platforms and
others they're basically different
algorithms by which you can guarantee
that information is written in a certain
order so to take a step back and think
okay from theoretical standpoint so can
I rely on the clocks in my system or do
chain that I'm able to resolve on my
application side it's important to
understand the difference and be ready
to anticipate either that loss or that
answer even if that answer is in fact
not the most recent answer you expected
in the case of a distributed system next
up when you're dealing with multiple
systems you ultimately have to partition
where information is stored so in in a
system like
like Mongo the the CP system as a
master-slave methodology so each part
each portion of the data is stored and
has a master and then multiple slaves or
other nodes that will be able to resolve
reads versus something that is using a
consistent hashing table like Cassandra
and react both both do where data is
stored in some sort of contracts usually
called a V node and is distributed
evenly in an even way across all
available nodes in the cluster so if we
think about the failure scenarios in
these different architectures if a node
goes down in a consistent hashing ring
you have to have some sort of logic by
which information is read from other
nodes that is true on Cassandra and
react in in the case where you have a
master-slave methodology there's an
opportunity where a single node can go
down with the master and in that in that
time a slave needs to be elected as a
master so you have a opportunity where a
portion of time your database is
actually unavailable which is not
hashing design so it's important to keep
in mind as you when you're building
these systems you like can I can I
expect that a single node going down can
bring down my my infrastructure you have
to be you have to recognize that masters
master slave methodologies do have that
risk to them while a consistent hashing
algorithm actually works around that
where you have some sort of quorum you
defined and as long as that quorum of
nose is available your system will be
available to see that in a little
greater context this is react specific
so you'll have to excuse me but with
data and react we had a four node
cluster this time data is distributed
across consistent hashing algorithm and
actually chunks of information is spread
evenly across each node you can see that
a portion of that of that consistent
hashing table is actually stored on each
vinodh and then lands on each of the
physical nodes
ultimately meaning that data will be
evenly spread across all nodes what what
makes that important is that you want
some sort of predictable scalability to
your infrastructure and while a
master-slave methodology usually you
have to define where data is going to be
going from your application logic like
you're adding some sort of semantics on
top of what you what you've built from a
database while in the consistent hashing
ring you are actually the system will be
dispersing it automatically for you
based on an algorithm it saves you some
application logic and it provides a more
linearly scalable model so let's switch
gears and talk a few tactics about
designing your information and how you
store it we're going to put a thinking
cap on I'm asked a lot well I usually
hear the statement that my date is too
complex for key value databases this no
sequel thing is that's fine but I need a
relational database with nice
transactions and the certainty and
stability that I'm used to and I I
finally the the thought experiment that
gets people in the right mindset is
asking the question of what questions
you anticipate you need to ask so are
per-user basis what kind of orders are
they going to have or in a hierarchical
tree of information like what are the
sub trees so if you can answer that
question in a nice atomic way you can
write that information down then a key
value is an absolutely perfect
architecture for your your need set it
in a simpler way we can go old school
the original key value data store here
is the Dewey Decimal System every time
you write information down on the card
it has a definitive answer on the other
side and you can go get that aggregating
data in that way is actually very
effective and when you can rely on each
key being available in this in this
method you can scale in a very
nice way stolen from a fantastic blog
post talking about data modeling here's
an example of thinking about the
branches of information on an e-commerce
site and seeing some arbitrary amount of
that information is being seen as a
single tree and is going to be stored in
a single aggregate this won't be D
normalized so you're actually going to
have some repeat data but it's going to
be a little Jason blobs it's not that
much in the amount of space and it's
going to be so much quicker on a read
that'll be worth worth your time to do
so which is a totally different way of
thinking from a relational database
another another blog post that really
influenced away had been understanding
how to design data types is actually
thinking about them from the angle of a
stream processing system so seeing
events in a more immutable way borrowed
thankfully from from Martin from his
talk at dev / dev wynter yeah in London
he talks about instead of formulating
data in a way where you're updating a
table as opposed to updating a value as
it stands on a current file or an icky
value as opposed to reading that key
putting that key value back try to think
of everything as a state and time and
save those as raw files or raw raw
events and that will enable you to think
of your architecture in a totally
different way so his example is to add
information to a cart at the first time
you add that information it could you
could have a quantity of one of that of
whatever you're buying and then you
could update it to three and then you
can update it two to as opposed to
having some relative value and
subtracting and and doing that each each
of these are true statements they're
there in and of themselves accurate
what's nice about that is that in the
case of a conflict you can use something
like the dotted version vector logic we
were talking about and be able to
resolve this more certainly and you
could also feed it into some other
system taking that raw information and
seeing
oh my users actually change their card a
few times before this and that might
have some downstream interesting
ramifications so I also wanted to talk
about some of the architectures that are
quite interesting and have been brought
up over an hour initial conversation at
the at the keynote and just to remember
what kind of problems yet we have to
deal with as you know systems people
looking at lots of different systems are
built try to report on something like
this which is was the real data platform
that was behind linkedin before they
started migrating and building apache
what is now apache kafka so they had a
number of application services with
queries on top going into they had
messaging Hugh's leading some more apps
leading the key values there's log
aggregation on the side being stored in
on disparate data types there's really
no clear pathway where information is
traveling in any sort of way that would
give you an opportunity to replay
information in that a beautiful way we
were just talking about and be able to
do some interesting analytics on top of
this so what they propose and what they
have since moved to at LinkedIn is
something more like this and I really
love this this architecture it's
obviously easier to read but
applications are storing to some local
information or some sort of database
that's appropriate for the application
right like if you're you're trying to do
some analysis on relationships you have
also write that data to something
persistent or with a guaranteed
persistence level like a no sequel AP
system that scales linearly and then you
could also pull that information either
directly out of the key value store or
or also store it into a search like
elastic search and be running some
queries on top of that Oh what I love
about this is that it then feeds into
some sort of streaming data platform
which they of course would recommend
kafka they built it but you can use AP
no sequel solutions as well
for a streaming system where information
can be pulled out from that central
source now and well yeah you can use a
you could use a database like that but
Kafka is really cool in the fact that it
allows you to have some sort of offset
for your information so each stream is
actually a repeatable list of data in a
certain order that you can pull off and
feed into all these different systems
and then ultimately a kid it could plop
down into something like Dupin you can
do some further batch processing
processing after that so when we start
thinking about architectures we can
think about something like an error
analysis system where you are storing
information primarily on to your no
sequel solution and then you either have
it paired with a solar cluster have some
sort of or using a document store where
you can do that search and pull
information out right there you're doing
something where you're trying to find
other patterns you could you could do
what we are showing in that top level
architecture where applications are
writing using a multi-client right model
to multiple data data store
simultaneously and then have some sort
of batch systems on top of that or yeah
it's almost always a batch system
pulling information either out of that
faster lower latency products or stored
on top of HDFS and then running
you can store information and no sequel
to some sort of ETL process where you're
extracting your streams forming your
information and pushing it in set
messaging queue now that it's in a
cleaner format and then pull that into
whatever sort of analytics platform
you're using today so just to visualize
that a little again borrowing from the
amazing work on by the kafka team
showing some sort of web service pushing
information directly to an event stream
and that then the raw information can be
stored on something like a no sequel
database and then you can also be
aggregating that information and storing
it to a another bucket on the same
database or something different there's
a lot of flexibility to building these
things it's just really thinking about
how are you storing immutable
information looking at another
architecture that explains it from kind
of restaurant response time angle
relational databases no sequel databases
other apps all expect very low latency
they can feed information into something
like a streaming platform like Kafka and
then you can have these levels of batch
analytics following after and then it
was brought up briefly but a lambda
architecture which supposedly can break
cap theorem but it's really just a
nonlinear I zabal way of seeing
information both as the bulk like speed
layer of information that's happening
regularly and then the slower batch
layer of information and then
aggregating those once requests come
downstream but the the sort of
architecture of these things is quite
can be quite complex but it has the
benefit of having both your your faster
data with your batch data so just take a
step back like the reason this is all
interesting is that you can't really
analyze the information that you haven't
been able to store so storing
information ends up being a very complex
problem that you need to take into
consideration and as you do so there are
a great deal of challenges to balance
like how what are my latency
expectations for each of these
applications how am I going to maintain
these how do they handle a conflict
resolution and what you find very
quickly is that most of this works at
very small scale it gets complex once
you go across multiple systems which
gets into an absurd problem of
distributed systems so to summarize some
of the the core problems we're dealing
with there's some terms here I like I
like thinking of no sequel as a
something that's really just a
a collection of highly available
scalable systems that fall within the
the problem of the cap theorem
unstructured data is just any sort of
information trying to parse and then
some of the architectures we were
talking about and some of the tools so I
certainly ended early happy to to bring
up more specific cases and discuss them