Get notified about Tafseer Ul Islam Siddiqui

Sign up to a email when Tafseer Ul Islam Siddiqui publishes a new video

[Music]
so let the research minutes to second
how we use volatile storage to power
dispatching algorithms and how we went
from a journey of matching captain's to
customers within minutes to within
seconds that's the title Minister
fast-moving data using in memory storage
storing volatile data for efficient
lookups real-time decision making using
sub millisecond lookups and recommended
practices so but before we 12 into the
presentation into the solution if let's
let us spend a few minutes of Whovian
who we are we are creme found in 2012
the word Karine comes from an Arabic
word Karim itself which means generosity
means giving back to the community or to
your circle we were founded in 2012 born
and bred in the Middle East we are the
first unicorn or the region and we call
ourselves
unicameral instead of a unicorn but why
do we exist we exist to simplify their
lives and simplify and improve the lives
of people and build an awesome
organization that inspires at the moment
we are simplifying lives by using a our
ride hailing platform in the future we
will dwell into further verticals or the
same problem now before we go even
further let's talk about our region our
region is slightly different than most
of the mature markets that and in the
western side of the world the
infrastructure is in virtually not
existed whatever there is is you know
very poor and clean and unsafe most and
most important aspect women form about
29% of our demographic yet majority of
the movement are unable to work because
of the bad or poor infrastructure post
Kenny Mehra we have seen an uptick in
the number of women who are able to work
by using our own our writing platform as
well so we have sort of changed the
region and into that and with that
respect as well so let's before we go
deeper into a solution I would like to
set the context of the some of the
problems or the
realities that we face and you know
because those are the realities that
form the basis of some of our decisions
that we took so ground ground conditions
as I said a poor infrastructure people
spend people were stuck in ours for
hours in traffic
the region is populous and security
security challenge in some of the cities
estimating ETS is very challenging
because the intersections are we're
quite far apart and due to in some
cities high speeds if a captain or a
driver misses an intersection then the
ETA increases exponentially to reach the
destination social norms we our region
has is a very traditional region has
still has a lot of social norms like in
some parts of a region women cannot
still drive and some parts women cannot
share a ride with man as well so all
these conditions together put a great
strain limit our solution capability and
we have to tackle these to deliver a
compelling experience to over region
basics we call our drivers we call the
drivers that use ever that from the
supply part of a system captains the
captains the door driver so before we go
into deeper the word driver has a very
negative connotation or impact in our
society in our region particularly and
we wanted to change that we don't want
as just be drivers or scheffers but
rather be owners of their own destiny
and build and work with us to change the
region so we call them captains who is a
captain captain is someone who is a
person come on and it's a team it's so
all of our drivers are captains and
that's what we refer to them by in the
future slides as well our captains are
the essence of whatever we do they wear
day they are the feet and they are the
people behind the wheel driving us
driving our more arm driving our mission
to improve the lives of people so now we
have said the notion of ground ground
conditions and what we call captain's
what is a marketplace it's a conduit
where we match a
stammers captain's MMS supply with
demand a customer someone who wants to
travel from point A to B or further and
captain is someone who has a vehicle and
wants to drive two unmanned II and a
vehicle is not particularly restricted
to just being a four-wheel car it can be
a you know bus it can be a bicycle or it
can be a motor reach as well as we said
so marketplace the coin would really
match supply with demand we need to make
a real-time decisions because of the
real-time nature of our marketplace now
customers in this day will not wait for
long on any app they will change their
decisions they can grab a taxi or move
to you know even worse move to a
competition sorry now our marketplace
has three important characteristics that
we want to solve for that we needed to
solve for reliability this whole ride
healing platform our business is about
reliability reliability we define has
different meanings for different actors
of the system so one important actor
customer it's the trust that they will
be able to get a car anytime they need a
car or they any time didn't want to go
anywhere because if they have the stress
there will be a return customers and
keep coming back to your platform now
for captain's it's the trust that they
will be able to use our system to make a
living that's the most basic trust that
we want to instill in our system as well
now the second most important expect is
Mass quality so we call me me cotton we
call a match of his of the highest
quality if we can match a captain with a
customer who can reach the customer in
the least amount of time that is ETA why
also spending the least amount of fuel
so that he does not the captain does not
spend his own cash while going to get a
customer captain to a customer to make
money Wow
tracking as I said previously that our
region is a bit security tag challenge
so we need to provide the ability for
our customers to be able to track their
relatives or loved ones and be able to
track them in at a very granular level
so that's where tracking comes in forms
a very important aspect of our system
and we want to be and we want to be able
to serve our customers by pinging as
often as possible now if you remember a
basic ground condition that
intersections are quite far apart so if
you have captains are pinging quite far
apart the last known paying a last known
location could be very far from his or
her current location as well so that
dies into our tracking aspect as well
now that said when we started off we had
humble captain's right in the region is
not super rich
so we had humble captain's who did not
have extra cash to buy expensive
smartphone devices likenesses and
excesses or high-end Samsungs but rather
some China and Chinese or you know basic
handsets they were very well now these
devices itself form a very important
part of the equation because low and end
devices generally sacrifice on the GPS
quality the half chip says that not that
are not very accurate and you know add
add a noise to the GPS readings which
makes metering and tracking a little bit
difficult as well plus our region is is
always enabled to is in the neighborhood
of a desert always everywhere so there
are summertime temperatures value from
46 degree centigrades to 56 degree
centigrades now what happens is even
though the captains are driving an
air-conditioned car but metal-based
foon's
heat up what happens is this also
impacts the GPS and data connectivity
and also affects the CPU cycles so we
cannot we cannot have an app that is
very CPU intensive running on captain
apps because it will drain more
bandwidth to work and run the CPU and
then the fan to hold on the device
itself so limited bandwidth now you know
riding platform if you if a captain is
sending these locations to the car to
the system quite often it will be a very
chatty are chatty apps need a lot of
bandwidth or consume a lot of magnets so
we needed to engine but bandwidth in our
region are very expensive even at that 1
GB of bandwidth will cost you somewhere
around 75 to 100 the Rams per month
which is you know it's a lot of money
for
humble captain's so we need to we want
to ensure that we use the least amount
of bandwidth to deliver a compelling
evidence and some cases we figured out
that when you first met life we were not
consumed by the HTTP handshakes so you
know tackling that problem was also a
import now plus you know we want first
we wanted to make sure that we solve
that part and ensure that the payloads
are very small enough and you can you
know without sacrificing the data or all
the data that we are sending the only
known solution for us you know reducing
the payload was compression but even at
that you want you need to ensure that
the compression algorithm is not eating
up all your CPU itself so you know no
need for compression if the device can
not work given a CV so these were some
of the issues that we had when we
started off now
when as you build a solution or solve
this problem right healing you always
want to be able to measure the quality
of your solution or what you have
delivered and we thought hard about some
of the how can we measure the quality of
a solution and some we these for matrix
from a core part of it ETM ithi has two
dimensions right one is a promising tier
what we show the customer hey we can get
your car in two minutes but delivering
making ensuring that the customer gets
that same service is an enough matter so
we call that part the time the actual
time that it took for the captain to
reach the customer as an actual ETA now
in a highly functional and I'm very
transparent and super performing system
this the Delta between the promise and
the actual ETA needed to be very small
it didn't you know doesn't matter the
ETA is not less nine minutes but you
know nine minutes at the time of booking
but in actual if it's 50 minutes then it
deliver the experience gets you know bad
for the customer and you'll never so for
nine minutes if the captain reaches
Intendant that's fine you know things
happen on the road so the Delta needs to
be very small between the two now time
you need to make the match as I said
customers spend want to spend a very
small amount of time on the app itself
and we want to change lives improve
lives by helping them get a very quick
match
so we want we need to ensure that it
that the time it took for the system to
find the best kept and then match them
and then send the notification out that
this is your captain on its way
needed to be very small so when we
started off it was quite high but as we
went on we realized that this time
needed to be very small you know
preferably in milliseconds but seconds
was also good enough previously as well
age of captain location so when you are
tracking a captain it's important that
you know as in real time the real the
most real time location of a captain not
a location which is which was three or
four minutes about so in some parts of
our region driving speeds can be up to
120 miles per hour 120 kilometers per
hour and within three minutes it could
be five to six kilometers away from your
previous location which making the poll
purpose of tracking redundant now the
ratio of requests vanished this actually
gives you an insight into how well your
system is doing now this will also
depend if you have enough supply if you
have you know demand of more than supply
then you know this metric will also be
bad but let's assume that you have
enough supply to meet your demand now if
that condition is true this ratio this
ratio will give you a very numerical
insight into how well your algorithm is
performing because you know we cannot
fulfill 100% of the requests
that's nirvana that's we all want to
achieve but that's not possible
so but we want to be in the range of 70
to 80 percent we are you know we are
able to match 70% or 80% of the or can
we match 80% of the requests and convert
them into rights in revenue now
condensing all these on the previous
slides into four basic points that we
want to find the best captain within the
minimum amount of time we want an
ability to provide an upfront ETA which
we call a promise CT lowest promise
lowest possible data Delta between the
promise and actually this and the
ability to look up a captain's location
and status for tracking purposes the
let's go into a take one you know we
when you are starting off and we love
simplicity so we wanted we built a very
simple solution took my sequel put all
build the data structures really table a
can and start sending everything over
HTTP and HTTPS over time now this world
but given the matrix you will see when
we go down it worked but it did not
deliver a compelling experience and the
reasons we will figure out as well
before we go into the reasons let's just
important to state some of the scale
that we had at that time when the stick
one was built yeah now when you ran into
this when the system at life it worked
out of times but performance was a big
factor now performance was hurt
primarily because of that logs we built
you know you were fresh out of college
we built a highly normalized system
being you know each day one had a
foreign key to the parent now even if
even if a child table which has a
reference to a parent is being updated
in separate transactions it still runs
into a deadlock issue because it way it
will wait for the other transaction to
complete it was actually a book with my
sequel at that time now when we started
off we had my sequel five point six
which had new no geospatial support
trivial and we did that by using a
hammer sank formula which calculates a
spur if achill distance between two
coordinates now once we started we
wanted to ping as often as possible but
given the issues that we had we had to
reduce the frequency to every a ping by
captain every sixty Seconds now all
these affected our ETA s and customer
experience plus it also had a cost
impact because we had to use our
provision servers faster service to get
around the deadlock issues as well now
this was the performance that we
measured at that time but it was okay
but not excellent these were the
important factors because of the
downward spiral of a deadlock and
application being frozen and a downtime
reliability was bad forty percent at
about so we were only able to match
forty percent of a request this meant by
we were investing more into the system
but getting very few returns
over revenue on the business side of our
update was about 95 percent at that time
now we had Korean we had a culture
always you know we failure we always do
a retro and learn from and improve
systems rather than you know blaming and
shaming everyone or who build the system
so few learnings that we had along the
way that we had love white locks and
there were few strategies to avoiding
logs that we had goodness considered all
had a lot of cost and some of the
strategies were aborting ever use of
foreign keys but that will lead to a
data data quality problem acquiring
exclusive locks now that's a very tricky
to get right and has its own set of
problems or throughput issues that you
will encounter now then we change our
logic such that everything goes into a
queue and no updates are happening at
the same time no two updates are
happening on the same row or the same
word at the same time my dad would have
been a very bad so they should inspect
because scaling that out would be a
difficult problem or we could change our
my sequel version move to 5.7 and see
try things out we moved to 5.7 that did
help out but yeah was not helping out
where there were scaling needs now we
needed a mechanism to support
microsecond lookups because if a lookup
is taking one minute for during an
algorithm where you're matching multiple
captains most of your time is spent
using doing lookups instead of doing the
actual part which is finding if this
captain is best for this request or not
that involves a lot of things
calculating ETA is building models
running models whether this captain will
be able to make the trip in time or not
so these were few learnings that we had
the most important that I really like is
is a learning that a quadrant or a
location is that multi it's a vector
right it's a multi-dimensional attribute
and storing them in a single column is
pretty difficult and without that
utilizing indexing was pretty pretty
pretty difficult so we what for clay is
that we needed to come up with a
mechanism of representing a coordinate
in as a scalar value now when we were
building the next version
of a system we first went in and did
want you know we didn't want to be
simple but we want to be thought of with
the system so we identified some of the
pillars that our system should a port so
it was schema-less that we should be
able to change the model or a schema at
any time without requiring a downtime so
in my sequel if you have a table the
schema which has a very high transaction
number of transaction happening on it
changing or altering the schema without
taking a downtime is not so trivial it's
super difficult to get it right so it
should be near good enough as a schema
loss buffer excuse so in the right any
platform you already have you have a
consistent supply which is dedicated
capitals working on your platform but
you also have this transient or part
timers that come on your system and earn
money
so typically at morning rush hours or
evening rush or so you will see more
captains more and more captains coming
online or going away from work and you
know wanting to make money on the same
right this meant that you the scale on
the system of the Loden system will
double during those times and we wanted
to be ability that we we handle this
increase a sudden increase in a in a
graceful manner so we decided that we
needed some kind of a buffer or cue
mechanism instead of a direct I should
be endpoint to deal with those spikes as
now second you know we needed
persistence that we only need
so let's say if I'm a customer I will go
back and look at my trip you know really
so it meant that most you know
persistence which was not really
necessary but it can be done in an
offline manner and a slow process signal
but dispatching needed real-time
information it needs information right
now right this second or you know close
to this second so that's where we choose
Redis as of a caching mechanism right
excuse to buffer the things that are
coming in and dynamodb as our long-term
in a persistent storage which was done
as a an offline system in parallel to
the real-time storage of pings in Redis
now by Rattus Rattus has a unique
architecture where in a single thread
but it is locked free in the sense so
you know you don't run into the same
issues of multiple updates in a and
multiple updates on the same key running
into a deadlock it has a very rich set
of strings and sorted sets where it
takes care of your sorting at the
database level instead of your
application doing the sorting and
provides a lot of built-in operations to
process data at a very fast B pipelining
is a little technique where you can use
pipelining and use one connection to
serve thousands and thousands of
requests without waiting for a response
plus it also provides you know this read
scale out strategy by using a primary
and a replica configuration where you
can build you know you can scale out
reads and provide the field so failover
support as well so this is how it take
to look like so we had things coming
into an STS key being read by a worker
and being stored in different various
clusters over time at the same time they
were being written to dynamically which
had a very low you know SLA that if the
ping wasn't in after five it does no
problem but a ping that came into the
real time system had to be written in a
very real thing nation that's what that
part was given more computation powers
or computation and computational power
versus the part
Oh starting to die a good evening now
let's look at some of the data
structures that we had it was a very
keys key value system we had a captain
ID and the value the ping that came
against the system and it was easy to
look up given because the keys were a
domestic and very easy to look up now
this one is a bit different so to solve
for the looking up you know looking up
captains and nearby near a location on
the erequest what we said we use Redis
let's not provide this capability out of
the box so we had we use the assorted
set structure a little bit differently
where we said let's define a key let's
define a key and a define we define a
key as a composite of a jus hash versus
a product that a captain was serving so
that's it we had a offering which says
business cards right and we divided our
entire geographical region into G hashes
so geohashes has this wonderful
capability that the you can if the less
number of digits that you use you have a
higher zoomed-out level as well comes
out of the box with it so we said let's
take the level 5 which gives you about a
in kilo and the radius of over 20
kilometers and you can cover your whole
city by using not that many geohashes as
well so we created keys so index is this
pickup indexes so business this you hash
index you know all captains that are
qualified for business cars will be
placed in this index so meaning in that
sort it said sort it said has a well
provides we have an attribute which is
which they call score as a double and we
use that value to store the last time
stamp of the ping so when the large
capital last big and when we look up
those captains by just saying hey give
us all captain's within the score
current times time minus 30 seconds or
one in 80 seconds it will give you a pre
sorted set of captains is up to your
application now this is when a captain
was spinning in the captain number 89
moved further along the road was moving
it will move to the next you said that's
how we will remove it from the previous
user and move it into the next and you
said George you in the
this made you know this really
simplified our Kaplan lookups placing
and made them blazingly fast just as a
reference it can be created a grid of
three two hashes 3x3 geohashes and all
its neighbors so dog being being the
customer request will figure out all
capitals in different in the ninety
hashes that you hatch itself the captain
customizing and neighboring two hashes
so when once when done in peril this led
to a super blazing speed you know we
were able to get eligible captain's
within no fraction of a second now let's
just look at the scale that we had when
this take to live and live forty seven
CC City is sixty seven custom sixty-six
million customers to and fifty thousand
captain's some Det a request now what
this actually shows that we were able to
reduce our ETS and the difference
between and make the Delta between those
eighty s consistent so if you have
consistent yeah
it was still okay but not super
excellent but it was better than good it
was near to a very good part and
delivered a very good or a compelling
experience to our customers as a result
of because of using a lot of free
architecture where we try this or
anything about a liability went up quite
a bit and our uptime also hit that
magical ninety-nine point nine nine
number as well just an insight using our
new relic structures how how fast our
lookups were just to drive from that
point so as a summary we were what it
meant that we were able to increase the
number of the frequency of things that a
captain was sending every minute from
once every second to for every second
that meant a ping every fifteen seconds
but very close to a real-time location
as well improved our customer eight
years and experience as well because we
were dispatching captain's that were
really close to the customer and were
able to get to the customer in a fast mo
amount of time side benefit because of a
ping every fifteen seconds we were able
to track captain's in at a very granular
level and that's the problem
of the epsilon we were able to reduce
the time to management over two minutes
to 15 seconds that was the biggest when
that we had real experience now going a
little bit under the hood we are proud
Java company and use springboard as a
framework and we use elastic Beanstalk
as our scaling of scaling order
applications now looking deeper now this
has different modes standalone where you
have just one Redis cluster do not
recommend it for use in a distributed or
a highly scalable system another mode
which is primary replica you have one
primary node and you have multiple
replicas with Sentinel support you can
have failover support in case your
master dies down and the cluster node
which is you know sharding shouting the
data that you have over what different
keys some of the high genes that we
learned or experienced along the ways
all they always have multiple slaves not
just have one slave or one replica in
your cluster always configure a backup
of your replicas or a backup of your
data using a replica instead of a master
because it will take your master node
down for a few seconds wait now for
writes never use direct never write
directly to your primary endpoint but
rather connect to a central and endpoint
so that in case of failure what happens
your application that's not separate
downtime and always set aside a certain
percentage of the memory for reduces
internal operations including your
application because early-season
in-memory replication system always
scale out you know always scale out in
your reads instead of reading from the
primary because red is sort of
guarantees not really guarantees but
delivers a you know non zero or you know
a zero lag an application you can take
advantage of the wrapping and read from
the replicas and write to the master
that will reduce your load quite a bit
we assemble read is a single threaded so
no matter how many cores your server has
is always utilizing only one or two
cores of your system so it's very
important that you scale out your reads
if you have a higher number of reads
what
in my personal opinion you know if you
have to want to use
take advantage of radish to the fullest
always you know choose the right client
library there are a lot of client
libraries out there Jase and lettuce
these are very basic in nature but you
know move towards libraries like land
that is clustered lettuce which has a or
Radisson as well which is far more
mature then there are basic questions
and it offers you good outright support
whether you are using with AWS or any
other cloud provider or your own
solution now we want even further
schooling scaling because business as we
say is booming now we wanted to increase
the frequency of things for different
products that we are building plus we
wanted to all enable even better
dispatching even closer capless and
knowing where the captain is in a really
really real time manner plus we had more
captains on the network the number of
growing more capital sitting captain's
data for tracking purposes and but the
business wanted the same performance
demand as always but we also wanted to
ensure that we have you know a healthy
data service or your system is something
where everything is balanced you know
you are utilizing it not at the hundred
percent if you but utilizing it such
that it always gives you some space for
a spike in traffic and certain scaling
just to this are some of the numbers
that we were now dealing with 80 cities
15 million customers nearly half a
meaning captains hundred fifty million
requests one in forty million things
every day and nearly 50 million lookups
so we went with clustered that is also
the cluster mode that uses sharding it
you can configure it to either shard on
your specifications or automatically
scale distribute keys equally amongst
all our shards so we went with cluster
dradis
each primary had different its own set
of replicas so each set is also read
scalable and highly available as well
the best part amount of red is showing
is its application agnostic that you
don't have to build and
your application code or change your
application configuration to take
advantage of this I think that's the
only data store that I've seen that does
that at this moment now this is how I
take three look like instead of one
single radius we created we distributed
in two different charts with each Shawn
having its own replicas deployed in
different zones to provide us a really
strong failover support whether in case
a a zone goes down or a because of any
abnormal activity now this is the same
slide I copied because we deliver the
same kind of performance now that's all
from thank you
[Applause]
sorry just before we have we have an R&D
centre at Berlin we you know if you want
come talk to us if you want to you know
change people's lives and at the same
time salt some amazing and challenging
technological problems come talk to us
we have a booth as well you'll have fun
thank you I got two questions from yep
if any other questions just please ask
did you consider any other no SQL
database aside from Redis not pretty
because right now this was something
which was providing us we said that we
wanted to store in memory so we just
went with Redis it had had we considered
memcache and redness but mam cache has a
something of an odd eviction algorithm
when it comes to maintaining stale data
so we went with that is but not really
beyond that moron company so I got a
question do your background check your
captain's and do you have any
responsibility when problem secure
between captain's in customs yes as you
said because we are security challenge
and the customers that we get you know
expect you know we deliver seven times
security so yes we do background checks
we but we don't do it ourselves we
outsource them to you know law
enforcement agencies or a private
contractors working closely with law
enforcement agencies plus we do is in a
total background check where we verify a
captain's address and everything
physically as well but that's not done
by us per se okay that stick to
questions or that does anyone else have
a question for now give a couple of
minutes left sir thank you very much
sorry for the inconvenience police raid
yep