Alex Gittens also has a nice paper this year showing how Skipgram enables vector additivity. See <a href="http:&#x2F;&#x2F;www.aclweb.org&#x2F;anthology&#x2F;P17-1007" rel="nofollow">http:&#x2F;&#x2F;www.aclweb.org&#x2F;anthology&#x2F;P17-1007</a>

<a href="https:&#x2F;&#x2F;github.com&#x2F;kudkudak&#x2F;word-embeddings-benchmarks" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;kudkudak&#x2F;word-embeddings-benchmarks</a> has a pretty nice evaluation of existing embedding methods. Notably missing from this article is GloVe ( <a href="https:&#x2F;&#x2F;nlp.stanford.edu&#x2F;projects&#x2F;glove&#x2F;" rel="nofollow">https:&#x2F;&#x2F;nlp.stanford.edu&#x2F;projects&#x2F;glove&#x2F;</a>) and LexVec ( <a href="https:&#x2F;&#x2F;github.com&#x2F;alexandres&#x2F;lexvec" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;alexandres&#x2F;lexvec</a> ) both which tend to outperform word2vec in both intrinsic and extrinsic tasks. Also of interest are methods which perform retrofitting, improving already trained embeddings. Morph fitting (ACL 2017) is a good example. Hashimoto et al (2016) sheds some interesting insight on how embeddings methods are performing metric recovery. Lots of exciting stuff in this area.

I don&#x27;t really understand the implications of this license. Does it forbid using the resulting vectors for commercial purpose? Or does it only forbid stuff like packaging their code into a product, or offering to run it as a service.

<a href="https:&#x2F;&#x2F;github.com&#x2F;facebookresearch&#x2F;StarSpace&#x2F;issues&#x2F;38" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;facebookresearch&#x2F;StarSpace&#x2F;issues&#x2F;38</a>

That particular implementation isn’t (FAIR research code, non-commercial), but are there known patents on the algorithms that would prevent a clean-room implementation from being used in a commercial setting?

Note for those who it&#x27;s relevant for, this is not usable in a commercial setting.

Good point. I&#x27;ve added a reference to StarSpace anyway as it&#x27;s useful for applications outside of NLP.

Given that this was a discussion of word vectors specifically it seems reasonable. StarSpace doesn&#x27;t have anything new in terms of word vectors.

No mention of StarSpace (from FaceBook) ? It figures, with the rapid pace of innovation these days.StarSpace can compute 6 types of entity embeddings, of which word embeddings are just one type. It&#x27;s a whole family of algorithms.<a href="https:&#x2F;&#x2F;github.com&#x2F;facebookresearch&#x2F;Starspace&#x2F;" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;facebookresearch&#x2F;Starspace&#x2F;</a>

Maybe I&#x27;m not reading it right, but that arXiv paper about CoVE doesn&#x27;t seem to be getting anywhere near commercially useful results.For instance, the random result for ImDB in Table 2 is 88.4 and the best one is 92.1; that&#x27;s really not a lot of lift. I could see TREC-6 and TREC-50 results being good enough to let off the leash, but I still have a hard time picturing this being useful in the real world.

Word embeddings (or subword embeddings) are used for nearly all recent NLP algorithms, both shallow (e.g. FastText) and deep (e.g. Google Neural Translation). Unless you&#x27;re using a basic bag-of-words approach, you need to translate your words into some vector format, so you probably want some kind of embeddings. In practice, all the state of the art approaches for translation, language modeling, classification (eg sentiment analysis), etc. all sit on top of embeddings.It&#x27;s not word embeddings job to handle phrases - but nearly all modern phrase embedding algorithms sit on top of word embeddings. They often create a weighted average of embeddings by using an attention model, or they can use a more complex model such as an LSTM with attention (e.g. CoVE - <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1708.00107" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1708.00107</a> ).Word embeddings can handle polysemy - high dimensional vectors can (and do) hold information of various types that is used in different contexts in different ways. Some approaches deal with this more directly (e.g. including part-of-speech as part of the vocab item), and that sometimes can help a bit.

Would you be willing to go into a little more detail about what this is actually doing?What is the shape of the database? Do you normalize each document into a single vector which is compared, or are you keeping per-word vectors? I&#x27;m imagining you probably don&#x27;t have a database with a row for every word, but maybe you do?How do you pre-filter the list of documents to compute L2 for? If no pre-filtering, can this approach scale into millions of documents?

It&#x27;s incredibly useful for search, given the property that similar words are close in the vectorial space. And given it&#x27;s purely numbers, it&#x27;s really fast to compute.To see an example, type &quot;fuel&quot; in the search input on this page: <a href="https:&#x2F;&#x2F;openvoyce.com&#x2F;&#x2F;products&#x2F;quuu" rel="nofollow">https:&#x2F;&#x2F;openvoyce.com&#x2F;&#x2F;products&#x2F;quuu</a>You&#x27;ll see many relevant results, none of them using the word &quot;fuel&quot;. This is done purely with postgres, computing a L2 distance sort - no elasticsearch.

Word vectors are &quot;useful&quot; as an alternative representation for words in any machine learning task.Previously one used TF-IDF to represent a document, but now one uses Word2Vec and will usually get better results.For example, From Word Embeddings To Document Distances[1] shows how to use a new distance measure (Word Mover Distance) in classification tasks. This leads to state-of-the-art performance on 6 out of 8 classic text classification datasets (and very close on the other 2).In most other practical NLP tasks you find similar results: replacing an older representation with a word embedding almost always improves performance.[1] <a href="http:&#x2F;&#x2F;proceedings.mlr.press&#x2F;v37&#x2F;kusnerb15.pdf" rel="nofollow">http:&#x2F;&#x2F;proceedings.mlr.press&#x2F;v37&#x2F;kusnerb15.pdf</a>

Even if that example was the only use-case (it&#x27;s not, they are used for word similarity, sentiment analysis and more...), word embeddings would still be useful, since creating ontologies is not Easy and takes time.Somewhat unrelated, but the old joke goes: give one problem to two ontology experts, and they will come up with three different ontologies.

My question is what are they really good for.I mean king = queen -woman + manThat&#x27;s the kind of thing we have ontologies for.This article mentions that word embeddings are useful inside translators, but from the viewpoint of somebody who wants to extract meaning from text, what use is something that doesn&#x27;t handle polysemy and phrases?

Thank you Sebastian.Keep up the great work !You will note that negative sampling improved by leveraging information on word pairs form dictionaries entries (we called it &quot;controlled negative sampling&quot;) do help, though not much. It actually really depends in the rare words rate (see section 5.4, improvement ranges from 0.7% up to 10%). But I guess it is already an interesting, somehow counter-intuitive, observation.Another very interesting observation is that you can also choose to just clamp a general purpose dataset and expand it with external contextual information (meaning not using is for supervision but rather just collapse it at the end of the training corpus in a raw form [^]). In our case, we call those corpus :
- corpus A : plain old wikipedia dump
- corpus B : plain old wikipedia dump + dictionaries text collapse at the end of it.It sounds a bit naive : the latter part of the training corpus is really small w.r.t. the full wikipedia dump. 
Nonetheless, it has an significant impact on word similarity (see improvement in Table 2 to see how those training corpus influences representations learnt by word2vec, fasttext and dict2vec).
(Related to the effect of the training corpus : <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;1507.05523v1.pdf" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;1507.05523v1.pdf</a>)I mention this effect of training corpus content here since it sounds like an interesting info for the working natural language processing practitioners (get a mid size general training corpus, add as many contextual corpus as possible =&gt; may yield useful embeddings...).[^] to be entirely fair, this has been suggested to us by an anonymous reviewer, many thanks for him&#x2F;her for pointing this out : I found the results surprising.

Thanks for the note, Christophe. I had missed your paper. I&#x27;ve added a short paragraph with regard to improving negative sampling by incorporating contextual information.

I also think that there is still room for improvement for embeddings based on other contexts as pointed in the blog entry. Another example from this year is leveraging dictionary entries as external context -
 <a href="http:&#x2F;&#x2F;aclweb.org&#x2F;anthology&#x2F;D17-1024" rel="nofollow">http:&#x2F;&#x2F;aclweb.org&#x2F;anthology&#x2F;D17-1024</a> ()Selecting context words differently is also an option for improvement. Using dependency structures to &quot;filter&quot; out context window seems to work better than &quot;filtering&quot; using subsampling frequent words illustrate that there is room. We may see other solutions to select context words in the future, as a building block as it is. Especially lately with the StarSpace hype advocating the idea of general purpose - task-agnostic - embeddings.Or we can also consider that the expected improvements are insignificant w.r.t. improvements with the model learnt on those embeddings for downstream tasks that may update embeddings especially for this task...() disclaimer: I am a co author

Word embeddings in 2017: Trends and future directions