Discussion &amp; links to various implementations: <a href="https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;MachineLearning&#x2F;comments&#x2F;eg1wr3&#x2F;reformer_the_efficient_transformer_anonymous_et&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;MachineLearning&#x2F;comments&#x2F;eg1wr3&#x2F;ref...</a>

the same with the demos for completing a phrase based on Crime &amp; Punishment... having all recently seen the text completion demos of GPT-2 the Reformer examples are decidedly underwhelmingI mean, I&#x27;m sure it&#x27;s a great new technique and all

I&#x27;d be curious to see how it handles infill. This seems like it has potential.

The demonstration on images is underwhelming at best. It is only marginally better than extending vertically the bottom row of pixels.

I think what they did in the paper is to increase the chances by multiple rounds of hashing, up to 8 times. They show experimentally that 8 times was good enough to be equivalent to a regular transformer.

Wouldn&#x27;t the important embedding be of the token at position A and the token at position B rather than the positions themselves? Since there&#x27;s a lot fewer tokens than positions you&#x27;re fairly likely to get the two tokens to hash together at least a few times.edit: I&#x27;d image the position itself (ie: word number in text rather than token) could be embedded using sin-cosine or by breaking it up into chapter&#x2F;paragraph&#x2F;word. Seems more meaningful and efficient than word number in text. That would prevent this issue on that side of things.

I haven&#x27;t read this paper yet, but to answer your question:because bucket choice is a discrete decision - and discrete decisions are hard to pass gradients through

Why is that any worse than, say, starting with randomly initialized weights in general?

There is no argument for why the LSH would work well, especially at the beginning of training. As the weights are initially random, bucket assignment would be random as well. If predicting at position A requires info from position B, but they are not in the same bucket, there will be no gradient to get the query embedding of A closer to the key embedding of B. The reversible layer trick is neat though.

One neat trick is that you can extend GPT-2 117M&#x27;s context window from 1024 up to 30k on a TPU, since TPUs can allocate up to 300GB of memory for backprop. <a href="https:&#x2F;&#x2F;twitter.com&#x2F;gwern&#x2F;status&#x2F;1218001309435072513" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;gwern&#x2F;status&#x2F;1218001309435072513</a>It&#x27;s not quite 1M words, but a 30k context window is big enough for e.g. most midi songs.

This seems like a big deal. An asymptotic reduction in the resource explosion created by larger attention windows should allow the development of substantially more complex models here.

They are doing the same thing - using less memory by hashing.The hashing trick in VW hashes multiple same words into one integer, not the same as reformer, but similar to how reformer puts similar vectors together.With VW&#x27;s ngram&#x2F;skipgram features, you get the same kind of effect - similar strings hash into the same hash.So locality sensitive hashing = (is around about the same thing as) ngram&#x2F;skipgram on strings plus hashing trick.

The Google paper&#x27;s hashing has, as best I can see, nothing to do with the Vowpal Wabbit&#x27;s &#x27;hashing trick.&#x27;The VW hashing trick is about hashing your input data (ie: words, fields, etc.) into an array to lower storage requirements and deal with novel data at run time.The google paper is about ordering the intermediate states of the neural network (ie: vectors) while preserving distance. This is done so you can chunk the resulting ordered list and perform computations on individual chunks (and their neighbors).The only thing in common I see is the fact they both use the word hashing.

I&#x27;m going to write this out more clearly, because I&#x27;m still getting downvotes for my correct answer.Why neural networks?
<a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Universal_approximation_theorem" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Universal_approximation_theore...</a>Can polynomials do this? (Yes)
<a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Stone%E2%80%93Weierstrass_theorem" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Stone%E2%80%93Weierstrass_theo...</a>What is transformer and attention? 
<a href="https:&#x2F;&#x2F;pathmind.com&#x2F;wiki&#x2F;attention-mechanism-memory-network" rel="nofollow">https:&#x2F;&#x2F;pathmind.com&#x2F;wiki&#x2F;attention-mechanism-memory-network</a>Attention = Polynomial (x2,x3 etc.)Polynomial = interaction. VW flag -interaction1 layer transformer = xx. (x^2)2 layer tranformer = xxx. (x^3)3 ... etcWhat is reformer?
Transformer where LSH is applied.One type of LSH is SimHash. ngrams of strings, followed by 32 bit hash.Vowpal Wabbit -n flag for ngrams.vw -interact xxx -n2 -n3 and you get ngrams + 32 bit hash doing SGD over a vector.This vector is equivalent to a 2 layer reformer.Non-linear activation is not needed because polynomials are already nonlinear.So vw + interact + ngrams (almost)= reformer encoder. (if reformer uses SimHash, then they are identical).Transformer&#x2F;Reformer have an advantage, the encoder-decoder can learn from unlabeled data.However, you can get similar results from unlabeled data using preprocessing such as introducing noise to the data, and then treating it as noise&#x2F;non-noise binary classification. (it can even be thought of as reinforcement learning, with the 0-1 labels as the reward using vw&#x27;s contextual bandits functionality. This can then do what GAN&#x27;s do - climb from noise to perfection).

I guess the simpler proof that they are the same thing would be: Do they work the same?

Downvoters, please see<a href="http:&#x2F;&#x2F;matpalm.com&#x2F;resemblance&#x2F;simhash&#x2F;" rel="nofollow">http:&#x2F;&#x2F;matpalm.com&#x2F;resemblance&#x2F;simhash&#x2F;</a><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;SimHash" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;SimHash</a>Simhash, a type of local sensitive hashing - using hash functions on ngrammed data.That is exactly what Vowpal Wabbit does.

Vowpal Wabbit has been doing this &#x27;hashing trick&#x27; since the 200s.It also the feature interaction, which are the same thing as a layer in transformers (all against all matrix).So it seems like they are still catching up to where John Langford and crew were over a decade ago.And, the vowpal wabbit approach is extremely fast to train because it&#x27;s only doing stochastic gradient descent on a linear function - linear regression. Transformers are much slower to train.EDIT: Downvoters, please see my last leaf to see why they&#x27;re effectively the same. The guy responding here seems unfamiliar with all the functionality of vowpal wabbit.

Eh? My understanding from the article was that long is anything beyond a couple paragraphs, many (though maybe a minority by count) of the Wikipedia pages are much longer than this.

Probably, but it seems like it would run into diminishing returns except on the longest articles, because AFAIK the articles in wikitext are provided alphabetically, and consecutive articles may have little or nothing to do with each other, rendering the very wide window pointless.

I wonder if this could be used for the Wikipedia compression challenge?

How does accuracy compare in Nlp tasks vs XLnet?
If we can have XLnet accuracy and fast inference on a single gpu, that would be revolutionary!

This looks like building blocks from cryptography inspiring ML

Nothing to do with electronics, or with robots in disguise...

The transformer term in AI has been around for a few years now. I agree it can be confusing but the article also defines it by the 5th sentence.

Yeah, like &quot;Reformer, the Efficient Transformer for Machine Learning&quot;.

Would it be reasonable to add something to the tittle so it&#x27;s clear it has nothing to do with electronics? 
Maybe it&#x27;s just me.

The input examples are photographs, how can one take an SVG photograph..?

Any papers &#x2F; blog posts &#x2F; GitHub repos you can recommend to learn about using SVG images for neural networks?

so many smart people and still using fuzzy PNG instead of SVG