Phi-3 Technical Report

varunvummadi | 410 points

Everyone needs to take these benchmark numbers with a big grain of salt. According to what I've read, Phi-2 was much worse than its benchmark numbers suggested. This model follows the same training strategy. Nobody should be assuming these numbers will translate directly into a high ranking on the LMSYS leaderboard, or usefulness in everyday tasks. Let's not dethrone Llama 3 until some real world testing can be done.

That said, I don't think it's impossible for a small model to be very good. I see their "synthetic data" as essentially a way of distilling GPT-4 into smaller models. It would be exciting if a large fraction of the performance of huge models could be transferred to small ones! If true, then Chinchilla-optimal training could make sense again, as you could optimally train a ginormous model and then distill it afterward for efficient inference.

modeless | 11 days ago

Incredible, rivals Llama 3 8B with 3.8B parameters after less than a week of release.

And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.

Source: https://chat.lmsys.org/?leaderboard (select English in the dropdown)

So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones? Kinda? Wild.

(I'm sure there's a lot of nuance to it, for one these benchmarks are not so hard to game, we'll see how the dust settles, but still...)

Phi-3-mini 3.8b: 71.2

Phi-3-small 7b: 74.9

Phi-3-medium 14b: 78.2

Phi-2 2.7b: 58.8

Mistral 7b: 61.0

Gemma 7b: 62.0

Llama-3-In 8b: 68.0

Mixtral 8x7b: 69.9

GPT-3.5 1106: 75.3

(these are averages across all tasks for each model, but looking at individual scores shows a similar picture)

oersted | 11 days ago

This shows the power of synthetic content - 3.3 trillion tokens! This approach can make a model even smaller and more efficient than organic text training, and it will not be able to regurgitate NYT articles because it hasn't seen any of them. This is how copyright infringement claims can be placated.

visarga | 11 days ago

They have started putting some models in huggingface: https://huggingface.co/collections/microsoft/phi-3-6626e15e9...

pkoiralap | 10 days ago

I'll believe it till I try it for myself, Phi-2 was the clear worst of the 20 LLMs we evaluated (was also smallest so was expected).

But it was slow for its size, generated the longest responses with the most hallucinations, as well as generating the most empty responses. It was also the model ranked with the lowest quality answers.

mythz | 11 days ago

Tried it: as soon as you ask something outside the head of the likely training data distribution it starts hallucinating like crazy. This isn’t surprising to me as a researcher: you need the associative memories of a larger model to cover the tail with at least something. That said, it’ll likely work well at specific narrow tasks once fine tuned. Just don’t expect it to really “beat GPT-3.5” at the general chat use case

ein0p | 10 days ago

If I was Apple I'd be quaking in my boots. They are getting too far behind to ever catch up. Nokia in 2010 vibes.

brcmthrowaway | 11 days ago

Phi-2 was useless for practical purposes except if you want to show your friends that it can write a poem, llama3 8b was slightly better but is still same category, it’s complete trash with coding vs gpt4. Llama3 400b “iS OPen SoURce!” But no you will need to pay to access because most one can not practically afford an A100 and set it up properly.

What I’m trying to say is that user experience is now as key as the model smarts and these barely touching gpt4 models cannot beat OpenAI right now as a whole package.

m3kw9 | 11 days ago

Hugging Face Paper Page and Discussion: https://huggingface.co/papers/2404.14219

abidlabs | 11 days ago

Has anyone used these/similar with fine tune and RAG? How is the performance over a narrow domain for simple queries? Is it good enough for say an informational chat bot?

blackoil | 11 days ago

This paper broke ArXiv's HTML generator: https://github.com/arXiv/html_feedback/issues/1090

anticensor | 9 days ago

That's a whole lot of Zhangs!

ur-whale | 11 days ago

Hm, roundabout 84 authors of one "scientific" paper. I wonder if this says something about (a) the quality of its content, (b) the path were academic (?) paper publishing goes to, (c) nothing at all, or (d), something entirely else.

smartmic | 11 days ago

I'm getting a bit skeptical of MMLU at this point. As far as I can tell it's a set of multiple choice questions that hasn't been updated since 2020. We have to trust the model providers not to deliberately or accidentally train on it for those scores to be useful.

simonw | 11 days ago

Both precious phi have been epic letdowns when I actually tried them myself so quite low confidence in this being reflective of real world. Will try it anyway though

Havoc | 11 days ago

Less tokens than Llama 3 (3.3T vs 15T) yet better outcome. No doubt more information dense training data. The interesting thing is the use of synthetic data which they don't talk about.

hackerlight | 11 days ago
[deleted]
| 10 days ago

insane

maximsicora | 11 days ago