Mining Synonyms

Lester Litchfield
4 min readMar 6, 2018

I was recently asked to look into synonyms as part of the research we’re doing on search widening for product search in an ecommerce context.

The intuition is that when you search our site for “campervan”, chances are you’re also willing to look at “RV’s” and “mobile homes”. Now I’m a big fan of word2vec style word embedding, so I thought, easy, they’ll be close in the embedding space. “Easy” I said.

Well, I was wrong. It’s hard.

Why? Well the first problem I encountered was that words with the same parent (co-hyponyms) are also close — especially members of the same set. Colour’s are the clearest example, where blue and red are very close in most word embedding spaces, but terrible synonyms for product search!

Even worse, Antonyms are often very close, i.e. wrong and right. Eek.

After doing quite a bit of research, my favourite paper was this one:

A Minimally Supervised Approach for Synonym Extraction with Word Embeddings

In short, we can use relative closeness as an improvement on closeness. The intuition here is that within your Top-N nearest neighbours in a word embedding space, synonyms are proportionately closer than the other items. When looking at embeddings, we usually use the angular distance, rather than the magnitude differences. Cosine similarity is the most common distance measure. The krux of the above paper is:

relative cosine distance = cosine distance / sum of top-n cosine distances.

Here’s a random slice of some of the synonym candidates I found using this technique — the total list is thousands long:

gas, lpg
lease, rent
rent, rentals
mitsi, mitsibishi
laptop, notebook
garden, gardening
caravan, rv
bracket, mount
gun, pistol
art, artwork
pants, trousers
st, street
volkswagen, vw
volkswagon, vw
clothes, clothing
lawnmower, mowers
fridge, refrigerator
big, huge
fire, fireplace
360, xbox360
super, ultra
computer, desktop
stationwagon, wagon
houses, properties

Not bad! However a lot of our problems came from users searching for model numbers and names which could be either one or two words long. For this I decided to use a simpler approach.

First I used NLTK’s collocations. This is a great function that finds likely two-word entities such as ‘free form’ or ‘free range’.

My data set for this problem was around 15M search terms (and their following search terms, used for word substitution earlier).

From this I generated 30,000 likely collocations. I then looked up the frequency of the concatenated version of these collocations and applied a threshold. This generated a lot of gold:

klr 650 => klr650
gtx 1080ti => gtx1080ti
pay now => paynow
shock absorbers => shockabsorbers
play doh => playdoh
jim beam => jimbeam
pinch weld => pinchweld
mg 5150 => mg5150
off roader => offroader
jaqui e => jaquie
chilly bin => chillybin
tn 1070 => tn1070
p8 lite => p8lite
humming bird => hummingbird
330 ci => 330ci
la mer => lamer
bi focal => bifocal
640 lte => 640lte
over alls => overalls
upside down => upsidedown
crown lynn => crownlynn
et al => etal
ql 550 => ql550
tinker bell => tinkerbell
pre loved => preloved
ae 101 => ae101
fly tying => flytying
wilder people => wilderpeople

Basically, these were a terms that people are unsure about whether they are one or two words. This was drastically effecting recall in our search results, as a search for one wasn’t showing results for the other.

And finally, I looked for part word synonyms — i.e. shortenings that occur a lot in New Zealand and Australian slang!

For these I took 3 and 4 character n-grams from all the words in our vocab. I then searched for the frequency of these n-grams and applied a threshold. If the n-gram was common, I then compared the distance in the word2vec embedding space to check if they occurred in similar contexts. This generated a nice list also. Here’s a small sample:

camo => camoflage
camo => camouflage
carb => carburator
carb => carburetor
carb => carburetors
carb => carburettor
choc => choclate
choc => chocolate
cig => cigarette
coax => coaxial
coil => coilpack
cre => crescent
cres => crescent
cre => cresent
cres => cresent
cyl => cylinder
cyl => cylinders
dif => differential
diff => differential
draw => drawer
draw => drawers
draw => drawes
elec => elecric
elec => electic
elec => electri
elec => electric
euro => european

Closing thoughts

From a modelling perspective, it’d be nice to think that word embeddings captured synonyms in angular distance — but that doesn’t appear to be the case as yet. It would be interesting to tweak the word2vec process to capture these relationships. I have seen research where people use patterns such as “either x or y” or “y or x” etc. to attempt to capture lexicographical relationships, but this too seems muddy to me. if you have thoughts — feel free to share below!

--

--