Language as an Interface

Spencer Kelley

Recorded at GOTO 2016


Get notified about Spencer Kelley

Sign up to a email when Spencer Kelley publishes a new video

hi guys thanks Brian there we go yeah so
conferences like these are the very best
when senior engineer who's had 10 years
of experience in my undocumented field
is able to sort of take the hour and
talk about the things they've learned
that if it's hard this is definitely not
crappy web developer with no experience
in linguistics or anything similar have
an unreasonable level of success with
this library built a sort of side
project it's called MLP compromise the
idea is that it's a natural language
toolkit sort of similar to what's out
there already accepted one javascript
file at the end of the day you can npm
install it and it's around the same size
as jquery compressed i'm just not the
most accurate natural language
processing engine that's for sure but is
good enough hence the name NLP
compromise um and the surprising level
of success I've had with this thing has
made me think that I've actually like
stumbled into a hole where where this
offer has not been written yet and it
should I'm so this is the most upvoted
comment ever on TechCrunch it says Gaia
is what four years ago five years ago
when / was just starting and he said I
have 30 years experience in a car
sharing industry and here the reasons he
gives these great arguments why uber is
not going to work the the takeaway from
this is that sometimes being a
non-expert is actually a good thing
right because you like run straight into
the problems that other people knew you
were going to run straight into and
sometimes you are able to pull them off
so i'll type briefly about um how i get
into this stuff like how i fell down the
rabbit hole of natural language
processing i was working out i start up
we're two weeks away from launching it's
a bit of a joke cuz every startup was
two weeks from watching but we were and
we had we were ready for user testing
and her main call to action was
something like this where there's a
search box and we're expecting the users
typing things um I was the search guy um
I imported all the Wikipedia articles
into like elastic search and we were
expecting things like this where people
would type in something we knew but they
what happened when we had users come in
the first person typed in this London in
not a Wikipedia article the next person
typed in that new gold iPhone and we
didn't have that either so is brutal and
it was all my fault and my boss called
me into his office that evening guys I
had a suspicion i may get fired but he's
a very nice guy he said Spencer you know
we have this term London in the rain we
don't have the full match but we do have
the word London so why don't we just
chopped up each word and search for them
as a fallback when we don't have the
full thing now I knew that some of our
terms are two character or two words or
three words like London Ontario in the
rain or Paris Hilton in the rain so
forth so it's not quite that easy I knew
what we had to do is an n-gram if you
took computer science 101 is just stay
like substring or like a subsequence but
it's like you grab all the 22 word and
sequences or the three word sequences
and stuff like that problem is that we
were running this in keystroke so so
even if somebody types in four words
that was 10 HTTP requests per keystroke
and of course they could do a lot more
which threatens to certainly crash our
search api but maybe even their browser
if it was too much so I knew this was a
problem and this wasn't going to work
and we had two weeks to get it going so
I did I think what anybody would have
done in my situation or at least sort of
the clumsy strategy I made a list of
stop words so like da and in these are
really frequent words in their words
that we definitely know we don't want to
search for so we ignore them the second
thing I did is these um in the second
column terms that started with the word
in or ended with the word in or of like
words like this are very unlikely to be
something we're looking for an exception
would be out of mice and men but you can
imagine a very few of them so what I did
is I googled what is
in and of like what is the part of
speech you know there exists a branch of
linguistics I knew that at least in a in
an of are called prepositions and the
other ones are like x + 4 and things
like this so by by filtering out words
that ended in a preposition we were able
to get like half of them gone and the
last column of course would be the rain
when we're already searching for rain no
rain is like a pop album or from Korea
or something but the rain is redundant
when you already have rain because uh
and our determiners um so so just by
applying those three filters it was like
a small afternoon of JavaScript I was
able to get these ten and grams down to
three which sort of save the day and
we're able to go forward with that so
given the success i had there these are
the existing tools it's sort of the
landscape of natural language processing
as it is today i'm an otk is the most
most popular one probably it's also like
several gigabytes you need to clear off
room and your hard drive to use it on
the other ones I I'm not critical at all
of these things I they're excellent but
my favor one is the Illinois tiger I
mean Chicago but made at the campus down
the road whose name or vannah champagne
I think it's called anyway I then have
these third party ones um which
sometimes are actually just wrappers for
the most common libraries there's a
course a lot of value added with with
the bed Liam cement and so forth but um
it's a very big deal and if you're at a
hackathon or you have two weeks before
your startup needs to launch it's going
to be tough to put this in your stack
right like these are very serious
products to use them and what's worse is
that a lot of them were made you know in
the 80s or at least like have their
history from the 80s so if you want to
get like XML out of them you know good
luck a lot of time so it it's a
different culture I'm not too critical
of course but I think I found a place
where it was actually useful because
remember in like the three lines of
JavaScript I used like the sort of
hockey crowbar method I was able to get
my 10 n-grams down to three so I'm
so I felt like it could be hacked like I
I didn't want to sit down and write my
own natural language processing engine
but I started to you know because we
think of language as a gigantic thing we
spend most of our lives need a reading
writing speaking or writing but how big
is it really like canopy hacked this is
this image is the printout of Wikipedia
and currently it's 5.1 million and but
is that how big language is short answer
really small and we're going to see that
happening so this is zip slaw here's
this is a George Zipp he has a German
name but i think it was american
linguist in the 40s what he does he
looked at all the words all the
languages in the world and did a
frequency frequency analysis of the
words and found that they all fit
exactly the same curve like no matter if
it's Arabic or German or English they
all on the top one would be the in
English at least the most frequent word
and then you have like the sort of fall
off where it's like a chair and table
but then you have words that you'll hear
once or twice in your life so what
happens that if we're actually using
language Oh another thing else I've flaw
is that it actually also applies to
characters and words to in some sort of
freakish coincidence like II is the most
frequent letter or character in English
and then it falls off really quickly as
well people have looked at this in like
whale songs and found the flaw or like
that zippy and distribution there as
well some people talk about on the
search for aliens as in like how do you
detect a message is communication or
language and not some natural phenomena
people want to look for sip flaw in that
anyway it's pretty interesting but I
won't go into it most important thing to
know is that it's just a very steep
curve so if we if we draw our our filter
off at the point we can actually cover
you know ninety-five percent of English
really quickly same thing for vocabulary
when you talk about how big a vocabulary
is over somebody there's a lot of like
this cheeky sort of attitude it's like
with IQ how people talked about IQ is
some sort of scientific measurement same
thing goes for for like with vocabulary
size too
I'm most people settle around and this
30,000 mark it's supposed to be like
those lines that are going above I like
if you go to Yale or something but it's
not that bad if you take all of the
works of Shakespeare and you split them
up by words it's of course up a little
bit more complicated because some terms
are like multiple words but so forth it
comes to around thirty thousand mark at
well as well so I'm it's not going to be
double down it's not going to be half of
that if we give it a little space to
50,000 mark just to make our math easy
that's actually still really small if
you print out I did it if you print out
50,000 words from working it into a text
file and how big is it that's 600 to
kill advice if you have a start-up and
you have a background image on your
website that's full screen that's like
two or three megabytes right like you're
talking every word of English that
you'll ever speak if you had a little
microphone on your ear from the time
you're born to the time you die like all
of those words are smaller than like one
image on a website and this is the
distribution of words in wordnet if you
don't know wordnet it's pretty great
it's it's basically the the one
interesting project in linguistics that
everybody uses its like the benchmark
it's a dictionary but what makes it
interesting is that it's a dictionary
split by sense so the word hot would
appear as like sexy but also appear as
hot as in temperature also hot as in
like electricity going through a wire
the wire is hot so forth and it splits
it out like that and then it does all
those sort of tricky lumping splitting
that you not going to want to do if
you're doing a project using language
but anyway this is the the distribution
of words and word net the main thing to
note is that it's seventy percent nouns
right most of the natural language
processing tool kits I looked at in fact
all of them and when they're trying to
recognize a word they don't know they
didn't see it in their lexicon they
don't recognize it with any other their
rules or tools they always fall back to
a noun and it makes sense you know
imagine if I say the word lactobacillus
and you're not a biologist you don't
know what that is but you're not
thinking that you're going to
lactobacillus to the store so it's not a
verb another thing you're going to like
that person is really lactobacillus its
most things are nouns so if you if you
just assume you're going to fall back to
and now and if you don't know what it is
you can chop out all of them and just
fall back to the ones you don't know so
we've gone from from six hundred
kilobytes to 180 kilobytes we're going
to go any further this is how words
conjugate like we know if you know the
word tomato and you know the rules of
English grammar you also know that word
tomatoes and potentially tomatoey if
something has a lot of tomatoes or
you tomato I is it in in software
industry we're pretty guilty of abusing
this one the quick and quickly like
fastly and things like this in a in
sports journalism commentary you have a
lot of abuse of this aggressive
aggression aggressiveness issue I'm not
a grammar snob but uh you see how words
I'm sneaking between different tenses
and it's usually just the suffix that
changes in English at least so of course
there's exceptions but you see you see
the verbs tend to move around and so if
you know one of them you know all of
them and converting them I didn't I'm
not going to talk about a verb
conjugation but you can imagine it's
just a series of transformations you
know if you have a really delicate
number of regular expressions you can
cover most of them so yeah so we've gone
from six story we've gone from six
hundred kilobytes to 180 number 110
kilobytes this is remember this is every
word you're going to hear in your whole
language in your whole life in 110
kilobytes if you if you are a web
developer you're going to know this this
graph this is the file sizes of the most
common libraries and they're growing as
time goes on of course but remember how
small 110 kilobytes uncompressed is like
this is this is unheard of how small
importing it on every website almost
but you can see how I sort of fell into
this where I started started believing
in this being a thing we can actually do
and use and that would be a useful tool
okay so we talked about the lexicon of
this this sort of the sequence of how i
do part-of-speech tagging in NLP
compromise first one is lexicon so we
established a really big list of words
and if the word happens to be in that
list that we know the part of speech and
the second thing is suffix rules which
we talked about briefly like this is a
if the word ends in ly for example is
really likely to be an adverb right you
have exceptions like fly or comply but
most of the words in English that nly or
like quickly immediately and so forth so
if you have really careful regular
expressions you can actually identify
words by their suffix as well the third
third I'm level that will talk about as
well is sentence level sort of Markov
chain type deals where if you have I'm
know I'm going to wait until net site to
talk about it but anyway these are the
three sequences of how and not be
compromised just part of speech tagging
and I'm probably get pretty accurate
results from it this is actual print out
of the file I'm when I'm talking about a
list of regular expressions I'm not
joking it's quite brutal I know there's
cleaner and better ways to do this but
it works works pretty well and is under
a millisecond and it's quite accurate as
well you can see we support like times
and am pm and stuff like that so this is
the dissenters level rule so when we're
talking about she could walk the walk
right walk in the first situation is a
verb and walk in a second situation is a
noun so this is where the sentence level
processing comes in like if if we have
the sequence verb determine a story verb
determiner verb we know that something's
up we know we've misinterpreted the
sentence right when I'm talking in your
listening and vice versa we're doing
that all the time in your head like as
we hear the words coming in we make them
a little mistake somewhere and then as
we hear the next words we we change our
assumptions and that's that's just what
I did I did I listed out a
you know you can get corpus from
Wikipedia or anything and just list out
the number of times that you see the
verb determiner verb situation and
they're very small anyway I thought
there's gonna be a huge part of fire
speech tagging without only compromise
it really is very it's become a very
small part of the project if you look at
sort of all the routines that we we run
if we you import a novel or something
and to Emily compromise this method
actually gets hit hit very few times so
that was surprising so in short to
summarize them you can make an accurate
part of speech Tiger very easily and
it's not to discredit the hard work
that's been done by the brilliant people
over like careers and careers in this
industry but it's actually quite easy to
get to mid 80 percent with accuracy
there's sometimes these conferences you
know where all the computational
linguists will go and they will like
fight tooth and nail about you know
somebody got an ID two-point-eight
percent accuracy on this test and
somebody got 93.1 and it's like the
biggest difference but remember that
these these terms like you may read a
sentence in the interpret it one way and
I'm a radius interpret in other way
there's like some sort of subjective
amount of that great so so I I don't
place a lot of value in in these I
there's more important things for that
receipt then accuracy with them using
language in software so we'll talk a
little bit of with them but it is quite
easy and if you want to roll your own
you can do it in afternoon it's fine so
this is where I've gotten to with this
project it's quite interesting and I
didn't know what I was doing when I
started but I found it you know I get
emails from people with at retail bank
com email addresses saying how useful
it's been in their work I'm also getting
you know I job offers and things like
this I I've since like quit my the job
at startup I more or less doing this
sort of work full time now which is
which is crazy to me because I really
don't know what I'm doing this is one
example of I thought was funny
where we would negate a sentence like
keep on rockin in the free world like
doesn't keep on rockin in the free world
don't keep on like keep on rockin in the
non-free or like what is he how do you
it was really a joke but you see on
github like I folded in 14 pull requests
as we can people using stuff like this I
don't know what they're using it for but
it's quite fun um it it just says it's
like a maybe I wasn't sure if I was
gonna say this maybe I will it's bit of
a comment on the the industry of
computational linguistics how this
doesn't exist already you know there are
so many people working in this stuff and
the the the culture is to work really
hard and have a deadline and write an
academic paper and then you don't your
focus is not on having something playful
for people to inspire them to use on in
their weekend project right it's just
not what people are doing so stuff like
this gets a huge response at least
that's what i found this another example
so i've tried to make this stuff sort of
extensible like a plug-in like how
jquery did it and I I made a really
simple one it was to convert something
into the valley girl type of speaking
where you take a everything after copula
you add like so it is like a cool
library and we're trying to make stuff
like this really easily I don't know
this is a bit of a cheeky example
someone made a chewbacca won and stuff
like that but you can see how this stuff
is it's fun to do and people like doing
it there's my favorite but of all it's
the one that transforms newspaper
headlines and then every noun it adds
cyber in front of it so cyber putin band
the cyber nuclear weapons or anyway it's
quite fun and so industry but it's also
a useful tool because language should be
able to be manipulated like this another
thing I've been working on news i'm sort
of search and replace or matching like
sort of reg ex like match for language
so so examples like this where after
every determiner you want to insert a
word it should be like a really quick
syntax to be able to do that right now
if you want to do that just with
characters it's not
so easy and the most important slide of
all in this talk is this one and it's
easy to carried away when you have a
five percent accuracy or eighty-six
percent accuracy but the most important
thing to remember it is this slide these
are two sentences we gave the monkeys
the bananas because they were ripe we
gave the monkeys the bananas because
they were hungry one of the most trivial
or like supposedly trivial things to do
in NLP is called Anna for resolution or
the pronoun resolution where were they
in the second sentence refers to the
monkeys because they because monkeys are
hungry or things that can be hungry and
they in the first situation refers the
bananas because bananas are you know
fruits and fruits and things so that can
be ripe and stuff worth so in order to
interpret this it's the very most simple
thing you could possibly do in a
sentence just figure out what the
pronoun is it's impossible until you
know that all this knowledge like like
monkeys are animals and animals are you
know marsupials I don't know but I'm
there's a level of knowledge that's
necessary to fully understand this out
so so if anyone tells you they're
getting you know plus ninety five
percent accuracy and anything like ask
them where their knowledge base is like
when they invented a I I'm not sure
they'd have hit so usually when you try
to understand a sentence and sort of
going from the bottom up we gave them a
kisa bananas as a string of characters
so this is usually how we work with
language as computer programmers and we
we do like sort of state machine regular
expressions so ga ve and so forth I
would encourage you when you're working
with language to take a step up I'm not
sure if if I've been too much of a
master manipulator but working with part
of speech tags is much easier than
working with characters because you know
you have all these different situations
one thing that's we're working really
hard on the people involvement and will
be compromises contractions contractions
are things like he isn't or apostrophe s
or ' NT and so forth I'm you don't want
to get caught up on
stuff like that if you're if the user is
typing in you know they have a white box
type in stuff they're going to type in a
lot of crazy stuff and you don't want to
get caught up on stuff like that and you
don't want your developers to have to be
writing you know cases to handle this
contraction cases handle that
taking it's really useful if you have a
tool like this it's just like a one
method call and you have these tags and
then you can handle them sell for the
next level which something I don't know
anything about but I find interesting
it's called the dependency parsing
sometimes you see if you've taken a
linguistics class you see people right
out like a branch like a syntax tree
it's called we're like the word give
here applies to noun like it applies to
the monkeys so we gave and then
everything after that is what they gave
it's hard to do it's really hard to do I
haven't read enough about it to to start
doing it without the compromise i think
i should i think i'd like to try it at
least um but the last level of course is
what i talked about the sort of
knowledge engine strategy where you know
what a monkey is you know aruba diana is
you know that giving means like the
voluntary transfer of property blah blah
like people used to call this level
psyche or open psyche in the 80s I used
to call this level freebase and both of
those projects have sort of died in
spectacular fashion a lot of people are
calling it wiki wiki data now that level
I run my eyes when people say that so I
don't know what that's going to look
like a lot of people are working on it
and a lot of smart people are working on
it sometimes this sometimes the
solutions are too smart so I don't know
what that's going to look like but
that's of course the interesting mode so
last slide the things that we're working
on right now if you want to get involved
when people talk about github projects
as sort of a panacea of democracy I I am
I've run into a lot of like mean
maintain errs of software project I
don't so I always feel like my
background is from Wikipedia so I always
feel like people should be able to do a
lot of different
is to get help so I'm I insist on being
the most thoughtful and a nice
maintainer and github so please get
involved one thing that's being
discussed right now is the immutable
immutable issue a lot of people are
talking about a JavaScript but I'm if
you want to change a verb to another
tense for example and you want to chain
that into it another things right now
every method returns this except for
some of them which don't return this so
it's a little bit tough some of the
things we're talking about are doing
like event machine sort of logging so
you can reverse it and time travel and
sell guys there's a neat ideas that will
mean a big API refactor the other thing
is speed and performance um right now
I'm not be compromised will understand
the sentence in about eight milliseconds
on average which doesn't sound like very
much but if you put a novel in it may be
like 20 seconds or something like that
which is fine for some some situations
but definitely not all of them I'm
certain we can get that down to half
with some caching or something like that
if you're interested in that please
contact me and the other thing of 22 of
course is splitting out to different
languages a lot of these really clever
natural language processing engines are
able to like from the get-go do any
language you you want you can put
Klingon into it and it will understand
it you know we're definitely not there i
would like to start working on a similar
rule based small javascript library that
conjugates verbs in Spanish Portuguese
French and so forth one of the things as
another criticism of the computational
linguistics field it's really hard to
get data as a web developer most people
here you know if you want to try
something strange and funky you will
find that 100 other people have tried it
in their stack overflow and stuff like
this there's not a stack overflow for
papers and emailing the people and
saying hey can I use that data set and
you'd never get the data right it's just
like a not a culture of sharing so I'm
so the hard part of that bullet point is
actually getting a good data set of verb
conjugation
to train on the last thing of course is
that I've said before I'm trying to make
this the d3 of text and one thing that
was really helpful with the three jess
is this sort of demo scene where where
it made it really easy to make something
cool and share it around and the code
was there to copy and paste so I'm
trying to think of ways to do this and
if anyone has a good idea I think of
music tonic maybe or something similar
where you can you can make some sort of
cool example and share it so that's that
I'm from Toronto if you ever want to get
a beer or coffee that's where we have a
slack group if you want to join it I'm
Spencer mountain on Twitter thank you
very much I think ah I think we have a
few seconds for questions hey boodle is
it going soos your first oh yes so to
repeat the question it's good one is
there not some sort of systematic bias
in every interpretation like is is like
a standard for grammatical parsing like
that's a good point i recently got in
some trouble with oh my don't um don't
do it it's interesting i think it's an
awesome discussion okay yeah no that's a
very good point and it's fair and it
would be a criticism of the entire
natural language industry oh sorry yeah
exactly yeah I took be the question or
the comment is is the interpretation of
meaning like um no matter how
complicated it is whether it's like a
metaphor or anything is that can that be
handled with a real base system as well
I'm yes yes I think so in fact I've
worked on a lot of things I'm not
allowed to talk about that are very
similar I'm like it's it's something
that you will want to do like
statistically right where say if I so so
in Britain if you say hammered or what
is it smashed one of them means tired
hammered means tired in Britain
hemorrhoid means drunk in America so
there's gonna be things like that and
also you know like class and stuff like
that so there's a lot of involvement has
someone interprets things so you need to
know a lot about about them but it's
like a statistical thing you know nine
times out of ten hammered means smashed
or drunk whatever so it's a very
interesting problem i encourage you to
work on it cool thank you very much