Rendered at 22:32:14 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
bel8 2 hours ago [-]
It's a start and I welcome competition but I don't think I ever used small cloud models like Haiku 4.5. They are cute but for serious coding they tend to waste your expensive time.
And this certainly wont bring me back to GitHub Copilot which I cancelled yesterday.
GitHub Copilot had competitive pricing until yesterday when they changed from per-request to one of the most expensive per-token quotas. Seriously, take a look at their burning subreddit for some laughs: https://www.reddit.com/r/GithubCopilot
I have since changed to DeekSeek Flash on high which is Sonnet+ level for almost free.
If I feel I still need smarter models I might signup for $20/mo Codex to use GPT 5.5 which, in my opinion, is the best I can access right now.
SwellJoe 2 minutes ago [-]
I've been doing benchmarking of various models for finding hard security bugs, and my faith in Haiku (and Sonnet, even) has dropped precipitously in the process. Self-hosted Qwen 3.6 27B consistently outperforms both for finding security bugs, which was a shocking result. I expected Qwen to be around Haiku level, maybe a little worse, and I definitely expected it to be worse than Sonnet.
And, DeepSeek and MiMo perform much better than Haiku and Sonnet, near Opus/GPT 5.5 levels, at a fraction of the cost.
There's seemingly no reason to ever use Haiku or Sonnet, if you're not getting it for free or as part of a subscription (that you don't usually saturate).
fnordpiglet 16 minutes ago [-]
I use larger models to organize work into a topologically sorted task graph and pin smaller models to the tasks depending on the complexity with a larger model evaluating the work and patching where necessary. This uses haiku quite often for routine work. I’m able to do multi hour highly complex work with superior results and a much lower bill as a result by doing this, with a parent orchestrator able to do a massive labor within a single context window by effectively organizing work and reviewing quality and integrating where needed. I don’t use haiku directly, but it’s often 30-40% of any major efforts token use. This further improves time to completion as well as cost - but I find haiku is better at following literal instructions and plans without “second guessing,” while opus class models second guess in their thinking constantly.
As such, haiku isn’t a waste of my time, it saves enormous amounts of time for me. But I spent a large amount of time building the orchestration system up front and iterating on it to get here. Interestingly i found my experience as a director and later a distinguished engineer gave me the tools to build it and get it working well and reliably end to end - the dynamics of multi agent workflows of varying capability is not a lot different than the dynamics of a 1000 engineer organization.
lukevp 5 minutes ago [-]
Got anything from your orchestrator you could share that’s usable by others? Sounds like how I’d like to work but is difficult to get going from scratch
lambda 1 hours ago [-]
Yeah, seems like this is in the range of Qwen 3.6, Gemma 4, Nemotron 3 Super, and the like. There are lot of models, including much smaller cheaper ones (like Qwen 3.6 35B-A3B), that are similarly competitive with Haiku. I can run these on my laptop, I don't need to rent them from Microsoft.
I suppose if you're reeling at the new Copilot bill but want to stay in their ecosystem, this gives you something to use, but for most folks, there's a plethora of better options.
vidarh 23 minutes ago [-]
Haiku does quite well if given a detailed plan. That means much more detail than you otherwise would, but you can still save over e.g. having Opus or Sonnet do everything by having them expand their initial plans into more specific levels of detail and feed it to Haiku (or similar level models).
I personally wouldn't use models that class directly, though - I'd use them in a harness as a "backend" for more capable models. And Haiku itself, as opposed to other smaller models, is still expensive.
GaryBluto 2 hours ago [-]
Almost exactly the same story here. I've also had little to no refusals from DeepSeek, with it's Chinese values meaning substantially less friction when it comes to things like reverse engineering, finding copyrighted files, working with dubiously-sourced source code, et cetera. I don't think I'd go back to Copilot even if they dropped prices by 90%.
nate 2 hours ago [-]
The small stuff has their place. I have this safari extension and needed a way to quickly title people's chat histories. Haiku is the fast cheap thing to come up with decent titles of blocks of text. I feel like there's a bunch of those little things lying around you need a model for. I'm even finding Apple's Foundation Model is super useful for stuff like that. Even summarizing an article. It's like equally awful at doing it, but gets enough done to still be useful as a way to be like "oh yeah, this article is actually worth reading"
seanlinehan 1 hours ago [-]
Small models are super useful. But I'm skeptical of their use for coding in particular, which is what this model is advertised for.
58 minutes ago [-]
hparadiz 2 hours ago [-]
The $20/month ChatGPT plan that comes with codex is good value. Even just have premium ChatGPT is nice. I get rate limited regularly but it still lets me do most things.
tedggh 1 hours ago [-]
The $100/month is excellent value. I don’t understand how’s that not the default option for all professional developers. Unless people don’t produce any value writing code, like playing around and experimenting with vibe coding, I understand. But if software development is your actual income, and assuming you live in a wealthy country, $100/month is nothing for a tool like Codex.
hparadiz 1 hours ago [-]
Work pays for my work stuff and I have both claude and codex there. On the personal side I sometimes go days without using it. It's more like my assistant to do annoying terminal shit on my home computer and like personal projects I guess. It's plenty for that.
alkonaut 2 hours ago [-]
Won’t (presumably) all the market actors converge on similar pricing? If OpenAI stopped operating on subsidies and charge the true costs and their most token hungry customers are the ones that switch to Anthropic and others, then their pricing model switch will also be around the corner.
Unless of course we’re thinking Copilot will be more expensive than others longer term. But is that a reasonable assumption?
stefan_ 2 hours ago [-]
Anthropic & co charge API users much more, not least to demolish the middlemen low-effort plays like Cursor and Copilot. To not own the model is not viable in 2026.
swores 55 minutes ago [-]
Sorry, what do you mean by "To not own the model is not viable in 2026."
I assume I'm misunderstanding you (likely my fault), because the way I read that is that you're saying nobody should currently be using models owned & hosted by companies like OpenAI and Antheopic, while clearly a huge number of people are using those in 2026 despite not owning them.
verdverm 2 hours ago [-]
I've been having really good results with DeepSeek-v4-flash, qwen-3.6-moe, and the older gimini-3-flash-preview. (recent geminis suck hard)
Small models are more than enough for the majority of tasks these days. Plan and review with the bigger ones, let the little ones explore and implement.
OpenCode Go is $10/month for the open weight models with nice quotas: https://opencode.ai/go
partiallypro 1 hours ago [-]
> "GitHub Copilot had competitive pricing until yesterday when they changed from per-request to one of the most expensive per-token quotas. Seriously, take a look at their burning subreddit for some laughs"
AI is expensive and it has been heavily subsidized. I you think $20/mo for Codex/Claude flat vs a more usage based model you're in for a shock. Especially once these companies go public and have to meet investor expectations.
emsign 2 hours ago [-]
I wonder when THEY make it illegal to vote with your wallet.
camelmel 3 hours ago [-]
Huh, according to that model card this is a 137B total parameter model.
Performance doesn't seem that good:
- MAI-Code-1-Flash (137B-A5B) = 51% on SWE-bench pro
They benchmark against Claude Haiku but Haiku is not good, it's worse than tiny open models you can run locally or via API at 10% the cost.
giancarlostoro 3 hours ago [-]
The take away is that this model is a smaller model that competes with Haiku, I would hope they come out with a "Sonnet" competing model, then Opus. I have been wondering why Microsoft is kind of "sleeping" on offering models they themselves have made on Copilot, maybe it was part of their deal with OpenAI? Not sure.
mdasen 2 hours ago [-]
Yes, it's a "smaller" (137B) model that competes with Haiku, but it's basically the performance of Qwen3.6-35B-A3B which is 75% smaller and 98% smaller in terms of active parameters (since it's a mixture of experts model). Microsoft should be comparing its model to good smaller models, not Haiku 4.5.
Qwen-3.6-27b is closer to Claude Opus 4.7 than it is to Haiku 4.5 in a lot of benchmarks - and it's way smaller than Microsoft's new model.
Sure, it competes with Haiku, but it shows how far Microsoft is behind lots of other small models that are available.
stingraycharles 1 hours ago [-]
I understand what you’re saying, but I am generally very careful when comparing models and their benchmarks; benchmarks often don’t really match “real world” quality.
minraws 3 hours ago [-]
They did release, MAI-Thinking-1 to compete with Sonnet. Totally not sure why that isn't at the top here.
giancarlostoro 3 hours ago [-]
Good question, and I missed that entirely!
kristjansson 2 hours ago [-]
> 137B-A5B
Yeah, not a 5B param model as the earlier title implied!
3 hours ago [-]
wetpaws 3 hours ago [-]
[dead]
GaryBluto 2 hours ago [-]
What's with the lack of Microsoft design language on the website? It's painfully obvious they're trying to emulate Anthropic's style here and it looks tacky.
foltik 2 hours ago [-]
Definitely vibed microslop, the giveaway is the broken header and scrolling on mobile.
lanyard-textile 1 hours ago [-]
The broken header is an incredible distraction. I can't believe this slipped through.
shrinks99 29 minutes ago [-]
Brand guidelines and web design pretty much don't exist any more as far as I can tell. Gotta get it out yesteday, and the only way to do that is vibe coding, styling be damned.
Handy-Man 1 hours ago [-]
That's neither Microsoft nor Anthropic design. It's from their acquisition of Inflection AI. Even Copilot mobile app design is basically what was Inflection's design
singhkays 1 hours ago [-]
I've always wondered where Consumer CoPilot's design language was from.
If you watch the Build keynote with Satya, you'll notice that the design of the slides changed to Serif typography and warmer colors when Mustafa/Microsoft AI segment came on which was completely different from the rest of the keynote. Now it makes sense!
i_have_an_idea 2 hours ago [-]
maybe it was coded by Claude
winfredJa 2 hours ago [-]
i think it is AI generated.
petercooper 49 minutes ago [-]
"It’s not just smarter; it’s leaner"
gedy 44 minutes ago [-]
This is needlessly embarrassing, seems like a small thing, but it makes them look... desperate?
stringfood 2 hours ago [-]
A little to minimalist - only a few hundred words on entire page!
hmokiguess 3 hours ago [-]
Does anyone actually uses these smaller models for coding? If so, how? I usually Opus everything. Is the play to plan/design/architect with a heavier model than delegate structured tasks to these smaller ones? Would appreciate to hear someone's opinion on having done and tested both paths.
linuxhansl 3 hours ago [-]
I am using Opus 4.x at work, and these "smaller" (20-80bn, 3-4bn active) models at home.
Unfortunately there is no comparison, yet (IMHO anyway).
With Opus I can work, trust its designs, architecture suggestions, and code changes, even in a complex code base.
The smaller models seem to "try". They work for smaller tasks, but for more complex task it's often more work than doing it myself.
I wish it were different, and maybe in a year or two it will be.
0123456789ABCDE 2 hours ago [-]
>Is the play to plan/design/architect with a heavier model than delegate structured tasks to these smaller ones?
always has been
claude code has opusplan — uses opus while in plan mode, switches to sonnet for execution.
Yes. Divide execution of a change into separate responsibilities. Designate the main chat as the "orchestrator", Opus. You designate a goal, then tell it to grind until it gets there using the following sub-agents in sequence:
1. Step execution (Sonnet): Work for 30 minutes / 100k tokens at the direction of the Orchestrator
2. Review (Opus): Scrutinize the previous step's work for errors, fidelity to the instructions, fix those and record opportunities to improve the agent configuration + tools to reduce errors and token usage (record those to a file).
3. Self-improvement (Opus): Implement the highest impact self-improvement items that don't require user intervention.
Repeat: Until orchestrator session token budget exhausted (set it to 1M or whatever).
The underlying rationale is to keep each step manageable to maximize adherence to instructions and minimize cost (even cached tokens cost something). Prompt tokens are much cheaper than generated, so to the extent Opus mostly reviews rather than drives that saves a lot too. Self-improvement steps are very expensive but the improvements compound, if you're going to run a job for days or weeks it's way more expensive not to do them.
Edit: I do this in Claude Code with the Anthropic models as well as Qwen family models for offline use.
veselin 2 hours ago [-]
Claude code itself spins a lot of its subagents with Haiku. The model has low hallucination rate, so it is great for exploration tasks. I guess this is what the best purpose of this model here will be as well. Which is a lot of tokens - many tasks spin multiple exploration agents before the planning or fixing, that is then just a few tool calls.
killermouse0 3 hours ago [-]
I was wondering the same. I guess it makes sense to use a heavy weight model to make the entire design and split the work so that smaller models (possibly local one?) would then do the coding... But how would I even do that? I'm using Claude Code. Would I need support for this within the harness ?
yaodub 2 hours ago [-]
[dead]
axi0m 2 hours ago [-]
From my experience, smaller models like Haïku 4.5 have indeed shown very convincing results on specific, scoped tasks (themselves generated by a more capable model such as Opus 4.6). We use this kind of workflows in production to optimize speed, efficiency, and costs.
ojr 3 hours ago [-]
I use Gemini 3 Flash, I've seen the Claude Code setups, bullish on Anthropic people are driving up tokens but I am able to produce outcomes with a fraction of the money.
hmokiguess 3 hours ago [-]
Do you mind sharing your workflow? What do you mean by fraction of the money, in my case personally, I'm yet to reach a session limit on the subscription plan. I'm not "tokenmaxxing" as they say, so hard to see a scenario in which the plan is expensive for the value I get.
ojr 2 hours ago [-]
I spend around $20 a month through API fees using my own harness, https://slidebits.com/isogen. Nothing too special, I prompt it produces file changes using grep and vector search and I can individually accept which files.
For comparison someone showed me an internal company tool he was working on. He had Claude agents dangerously skipping permissions and firing up github branches through a vm sandbox just to make a single feature change. One agent to code and the other to review.
dist-epoch 3 hours ago [-]
If you don't hit a limit running Opus, it means you are very much in the loop.
For example you probably don't have days where you ask Opus to review your whole code base and look for code duplication/technical debt/robustness issues, and then to fix some of the found issues, and do this 3-5 times until no big issues are found anymore.
hmokiguess 3 hours ago [-]
What’s your prompt for this, the way you described it made it seem like there’s a generalizable way I can go about this. I just rely on a testing pipeline instead so can’t think of why I would need to proactively find holes where tests haven’t already done that for me.
Marha01 2 hours ago [-]
I use similar workflow. Here is my refactoring and code quality prompt that I regularly run:
Perform a thorough analysis of the <project_name> project (the code and the documentation).
- Explore the project, go over all important files one by one and look for any mistakes or possible bugs.
- Look for refactoring opportunities and ways to improve code quality and organization.
- Identify any potential cruft/bloat, to ensure our code is clean and logically laid out. Keep in mind that efficient and good quality code needs to avoid over-engineered constructs and needless complexity. Avoid complicated logic where simple solutions would be more elegant.
- Pay attention to comments: There should be enough of them to document the intent and provide high-level overview of the code logic, but not too much; avoid/remove excessive comments that simply restate the code logic or do not provide any useful information.
- Every important function should have a top-level docstring comment that clearly explains its purpose, high-level logic overview, arguments, and return values.
- Analyze the names of constants/variables/functions/classes and other code elements: could some of them be renamed to make their purpose more clear?
- Analyze the documentation, uncover any potential inaccuracies/omissions and ensure the docs reflect the code.
- Brainstorm ideas for improvements of the code and docs.
After you finish the analysis, save an analysis report into "<project_name>_analysis_report.md" in the project root folder.
hmokiguess 31 minutes ago [-]
Thank you!
dist-epoch 2 hours ago [-]
tests will not find inconsistent naming, duplicate functions, scenarios you have not thought about testing
I use quite plain prompts, nothing fancy:
> go over the tests and do a code review, focusing on how well they test inventory management, planner and controller. maybe some tests need to be deleted, maybe other tests need to be added. the end goal should be good coverage of the core features.
> do a code review, focusing on robustness/correctness issues. validate that the code correctly implements specification.md. focus on the async client.
> there was a big refactor. please do a code review, focusing on eliminating tech debt. look for unused, obsolete or duplicate code that can be removed, look for mismatched interfaces, inconsistent function/argument/variable names. do not output what is correct, just the issues you found. for each issue output instructions for a coding agent on how to fix it. do not nitpick.
hmokiguess 31 minutes ago [-]
Noted, thank you! Appreciate it
lanthissa 2 hours ago [-]
i used to use opus for everything, thats not an option once you move to a multi agent system unless you're working on like high end research. I could easily spend 3k a day if i was using opus as just a normal dev.
As we build a better and better harness and better feedback/verifiers we're switching more to 3.5 flash. I think chinese models would work too, but we cant use those atm.
Generally theres a coordinator running opus and an ever growing set of skills and subagents that take actions using weaker models and output feedback to the coordinator opus.
I'm pretty convinced at this point we're past the level of intelligence needed for most tasks most devs do and that will trend down as we better build harnesses for our own codebases.
cush 2 hours ago [-]
Implicitly, yes. A lot of harnesses will invoke small models to do small changes, saving time and tokens.
newusertoday 3 hours ago [-]
plan using opus execute using local
glaslong 2 hours ago [-]
I keep trying to, because I really want to make qwen 3.6 35b work for end implementation of a fleshed out spec (mostly for local data privacy reasons).
...but I spend so much more time correcting it, or building pipelines to try, retry, and converge, that it's rarely worthwhile for me in either time or $ spent vs Opus.
ebbi 2 hours ago [-]
I use it for smaller changes that I need to make, mainly on UI fixes or some easy logic fixes.
scotty79 2 hours ago [-]
In DeepSWE anything from Antropic is a whole class lower than what's achievable with gpt-5.5
So by using Opus you are using "smaller" model. Well, not really smaller, just worse. The actual smaller models can at least be faster.
altmanaltman 2 hours ago [-]
I actually find planning/design easier with a smaller model and implementation with a larger one. I'm mostly manually working with the model on planning and design and decisions are mine and smaller models are faster. And when there's a clear design/wayforward, the bigger models are usually better at understanding the overall context and applying the specific patch they were assigned to. I call it the 1-2 punch system where you do the first light punch then the harder punch when its actually important to hit properly. I know it goes against the standard of throwing the biggest model at design but I personally experience the bigger models try to do TOO MUCH and take a lot of time which is something that's not good in the design/arch/boilterplate phase.
capten 3 hours ago [-]
It's so weird to me that the benchmarks remain so low, but the models are marketed as revolutionary. And if you say that low coding capabilities aren't a problem, say that to the token price hike and 'general use' model setup.
Why not sell it as a math agent? Why do I have to set up 4 agents to check each others' work?
npn 1 hours ago [-]
from what I understand, it's because unlike the other models, MAI models haven't yet fine-tuned against the synthetic datasets specifically designed to boost the benchmark scores.
redrove 3 hours ago [-]
It’s about bang for buck. That high a score for 5B params is pretty good, nigh unbelievable a short while ago.
It is my belief that smaller models will get better and better, and even cloud SOTA models will shrink.
Yet another reason the current buildout will feel like the railroads.
necubi 3 hours ago [-]
It's 5B active params in MoE, not 5B total params (total is 137B).
bgirard 2 hours ago [-]
> It’s about bang for buck.
Hard to know when they don't give the price per token. Presumably it will be comparable to a low-mid range model in terms of price. But otherwise their 'Ideal Zone' is meaningless without factoring in the price per token. I don't how much tokens are being used, that's an implementation detail to me. I care about price / performance / latency.
Flere-Imsaho 3 hours ago [-]
Yeah the future is probably a number of highly specialised small models you can run on your own hardware rather than massive frontier models in the cloud.
That's what I'm betting on anyway.
girvo 21 minutes ago [-]
Step 3.7 Flash on my Asus GB10 based mini pc is incredibly close to that today. I’m very impressed, and that’s without MTP to boost performance
thewebguyd 3 hours ago [-]
That seems to be what Microsoft is betting on also based on what was shown at the BUILD keynote today + that new surface ultra and the surface mini PC with the new Nvidia chip. Nadella really played up local AI as the main use case they have in mind.
search_facility 3 hours ago [-]
MOE basically work that way already, QWEN/etc with low active params (A-number in name) allows to inference big models locally (only active params have to fit into memory)
dist-epoch 3 hours ago [-]
The SOTA models will not shrink, because the problems will get bigger, from "write me a C compiler" to "clone Stripe business and run it".
cwillu 1 hours ago [-]
What is with people reimplementing window scrolling badly?
Thanks! I've changed the top link to the blog post and put the other links in the toptext.
deckar01 2 hours ago [-]
If only they had launched that yesterday I might have avoided Copilot auto model selection using a 9x model, quietly burning my monthly quota in a single afternoon.
arunkant 11 minutes ago [-]
Why do websites still hijack scrolling? It sucks
OsrsNeedsf2P 4 hours ago [-]
So it's trained on the SWE Bench Pro evalset
topsycatt 2 hours ago [-]
That's not accurate. Take a look at the paper to see what it is trained on! And specifically decontamination is called out in A.4
Unless they specifically clarify that the testing and training benchmarks are completely separate, we have to assume they test on the same 'hill' the model climbs.
Copilot brand is tarnished, so time to bung everything under MAI?
layer8 1 hours ago [-]
Maybe the next Windows update will change This PC back to MAI Computer. ;)
efields 3 hours ago [-]
Please test your websites in Safari. Almost all of your iOS users use it by default, and the desktop experience is pretty close to the mobile experience, so testing is easy.
That scroll effect is jank city for me (yeah yeah works fine in Chrome/Edge).
whalesalad 2 hours ago [-]
some kind of scroll hijack going on for sure, feels terrible on firefox+macos
HDBaseT 12 minutes ago [-]
I instantly close websites which use this weird scroll hijacking and slow animation nonsense.
Let me slide as fast and unrestricted as I want. I do not want to "transition" to the next paragraph.
This trend needs to stop.
mentos 3 hours ago [-]
Shouldn’t the next model focus not be on code but system design?
Seems like the work from a good system design to code is practically solved.
Now it’s a matter of the design of the system. Or is that represented in these evals?
dist-epoch 3 hours ago [-]
Have you tried system design with LLMs? I find them pretty good at suggesting 5 architectures for a problem and then iterating on the solutions.
Even if I had no idea, going with the default suggestion would not be a terrible mistake, assuming you did describe your requirements relatively well.
tosh 3 hours ago [-]
not open weight or at least I did not find anything indicating open weight
ggcr 2 hours ago [-]
:(
I was hoping Microsoft would make it open weights, as they have done for years with the Phi models.
The era of big tech releasing models into the wild might be over, which IMO is counter-productive, as we are shifting from "the model is the product" to "the harness is the product"
1 hours ago [-]
smcleod 1 hours ago [-]
I don't see the point in comparing yourself to Haiku which is not only useless for coding but also old. No thanks Microsoft.
onlyrealcuzzo 3 hours ago [-]
Gemma 4 26B-A4B scored exceptionally well with 20% less params, so this isn't unprecedented.
giancarlostoro 3 hours ago [-]
Mark Zuckerberg must be in crisis. Microsoft releasing models that compete with Claude's models. Meanwhile the only thing anyone knows about Mark's models is that they help you get hacked more easily.
ggcr 2 hours ago [-]
Meta recently launched Muse Spark [1] and they themselves compare against Claude Opus 4.6 Max.
Here Microsoft is comparing against Claude Haiku, the smallest and least capable model from Anthropic.
i have had good results adding muse spark's contemplate mode as a roundtabler for complex questions. but you cant turn off their data ingestion for training so that is a shame.
yuppiepuppie 2 hours ago [-]
Wait… I think he has moltbook IP as well that he can scale up.
Seriously tho, wtf is going on over at Meta? Anyone working there currently want to describe the vibe of the org when it comes to being a frontier company?
giancarlostoro 2 hours ago [-]
I don't understand his plan, if I were him I'd either have just gone all in on making RAM which would become very lucrative, or would have focused on building programming models. They've built some key open source technologies, but its as if Mark Zuckerberg cannot run anything that isn't a social media company / project.
npn 2 hours ago [-]
I personally do not like Microsoft, but congrats them to release this model.
While the scores are not good compare to other open weight model, the important thing to note is their training data (as they claimed) is very clean, without any synthetic datasets.
cainxinth 52 minutes ago [-]
Claude Haiku 4.5 results with 60% fewer tokens. Sounds good, but they don't list token costs.
Yeah this website is horrendous to use. What were they thinking?
BadBadJellyBean 3 hours ago [-]
You mean "what was the LLM thinking?"
infraredshift 3 hours ago [-]
[dead]
bguberfain 3 hours ago [-]
It is good to se big companies like Microsoft launching LLMs. They have large amount of compute power and good scientists to create useful models.
ComputerGuru 3 hours ago [-]
Microsoft has been releasing LLMs for years.
ipsum2 3 hours ago [-]
Sort of. Phi models were just trained on GPT outputs though.
kingstnap 2 hours ago [-]
For those that don't know about this. Phi was announced with a paper called "Textbooks are all you need". What they did was use GPT 3.5 and created synthetic textbook chapters and exercises.
They also did some more interesting work like showing very small models can be coherent as long as you have very simple children's book style training data (TinyStories is pretty famous).
Lots of these ideas are still used. Learning facts at scale with active reading is an ICLR 2026 paper from Meta AI that does a lot of similar work.
not_a_bot_4sho 2 hours ago [-]
By design. The whole point of Phi is the "textbooks is all you need" theory on curated training data, as opposed to kitchen sinks.
lemonish97 3 hours ago [-]
They were mostly distilled or fine-tuned OAI models.
Havoc 2 hours ago [-]
huh? The granite series isn't distilled
wirybeige 2 hours ago [-]
Granite is IBM
jwitthuhn 3 hours ago [-]
And occasionally un-releasing them like with WizardLM.
Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.
randomsc 1 hours ago [-]
“ Build for developers, not benchmarks” is the worst marketing shot I ever heard
sssilver 40 minutes ago [-]
It claims that, then promptly proceeds to showcase a bunch of benchmarks.
hootz 3 hours ago [-]
I'd love to see a tokens per second metric. I always prioritize speed over raw intelligence for flash models.
throwaw12 3 hours ago [-]
> I always prioritize speed over raw intelligence for flash models.
This model might have a perfect speed:
for i in range(100):
print(random.choices(words))
OsrsNeedsf2P 2 hours ago [-]
Leave it long enough, and it'll print the work of Shakespear!
striking 3 hours ago [-]
To be clear about the size of the model: MAI-Code-1-Flash is 137B A5B.
2 hours ago [-]
gslepak 3 hours ago [-]
Would be cool if this were an open model.
Computer0 1 hours ago [-]
I went to VSC specifically to avoid the pricing I started experiencing on Cursor. After this change I have no reason to stick with GH Copilot, I'd rather keep buying OR credits.
mat0 16 minutes ago [-]
how long until they rebrand this shit as copilot?
jMyles 3 hours ago [-]
I'd really like to get back to an autocomplete flow, ideally with some shared and optimized context with the relationship with my larger agent models.
But it seems like, by and large, even the faster models are now aimed at longer-running agentic flows and not sub-1s autocomplete. Or am I wrong about that?
verdverm 2 hours ago [-]
You aren't wrong, the field is moving to a world where we do less in the code editor, so autocomplete is not needed any more. I've only manually edited code a few times in the last month. Haven't used autocomplete in 6+ months since I left Copilot to build my own agent harness (I'm now mainly using OpenCode)
ilia-a 2 hours ago [-]
I mean they are comparing themselves to Haiku of all things, geez that's not a good start...
Marciplan 3 hours ago [-]
"Build for developers, not benchmarks" Shouldn't that be.. Built?
kylehotchkiss 3 hours ago [-]
"superintellegence team"
Why not assign them to make windows good :D
LoganDark 3 hours ago [-]
"Clean data" is impossible. Language models have polluted the landscape to such a degree it's impossible to filter them out now. OpenAI has no doubt discarded or muddled their dataset that was used to train the original ChatGPT, so there may be no dataset in existence now that isn't contaminated.
zb3 3 hours ago [-]
So it's not an open model while not being much better? Meh.
freediddy 3 hours ago [-]
is 51% good enough to reliably use? There's no world in which I use an AI agent where it gets even 15% of the code wrong, that's as bad a Tesla FSD where you need to pay attention to the road while engaging FSD. What's the point? My attention is what I'm trying to relieve, not mostly correct functionality. The only thing that matters is whether you can one-shot code like Claude or Codex, I'm not interested in a small but mostly-okay-but-annoyingly-buggy-every-now-and-then AI.
VygmraMGVl 3 hours ago [-]
Claude opus 4.6 scores 51.9% on the same benchmark. Microsoft's result is quite good.
IanCal 2 hours ago [-]
51% does not mean it randomly gets things wrong half the time.
These things can be useful if you can accurately predict which tasks they will reliably do, and which they will usually fail on. Then you can get much more reliable work from them.
vancekai 3 hours ago [-]
[flagged]
ghord 3 hours ago [-]
[dead]
briangao 1 hours ago [-]
[dead]
pzo 2 hours ago [-]
TLDR; this is just Claude Haiku altrenative, you can probably skip whole article.
Ozzie-D 3 hours ago [-]
[dead]
fooker 4 hours ago [-]
[flagged]
falcor84 4 hours ago [-]
Please share the script
fooker 3 hours ago [-]
print(ExpectedOutput)
3 hours ago [-]
66yatman 3 hours ago [-]
Please share
mattlondon 3 hours ago [-]
Comparing against Claude 4.5? Aren't we up to 4.8? But disingenuous?
klardotsh 3 hours ago [-]
They're comparing to Haiku, not Opus. Haiku is currently at 4.5.
Even if it were Opus, comparing to a version number makes for an interesting snapshot of time comparison: if you knew how a model performed at whatever time in was in vogue, you can say "well, it looks like Model X is about 6 months/1 year/etc. behind the frontier SOTA" - which is exactly the discussion that happens in the open-weight/local LLM space. (interesting, MAI-Code-1-Flash does not appear to be such an open-weight model, following the western trend of locking models up)
0vermorrow 3 hours ago [-]
Latest Haiku (smallest Anthropic Model) is version 4.5, they haven't released a new version, hence the comparison to that.
And this certainly wont bring me back to GitHub Copilot which I cancelled yesterday.
GitHub Copilot had competitive pricing until yesterday when they changed from per-request to one of the most expensive per-token quotas. Seriously, take a look at their burning subreddit for some laughs: https://www.reddit.com/r/GithubCopilot
I have since changed to DeekSeek Flash on high which is Sonnet+ level for almost free.
If I feel I still need smarter models I might signup for $20/mo Codex to use GPT 5.5 which, in my opinion, is the best I can access right now.
And, DeepSeek and MiMo perform much better than Haiku and Sonnet, near Opus/GPT 5.5 levels, at a fraction of the cost.
There's seemingly no reason to ever use Haiku or Sonnet, if you're not getting it for free or as part of a subscription (that you don't usually saturate).
As such, haiku isn’t a waste of my time, it saves enormous amounts of time for me. But I spent a large amount of time building the orchestration system up front and iterating on it to get here. Interestingly i found my experience as a director and later a distinguished engineer gave me the tools to build it and get it working well and reliably end to end - the dynamics of multi agent workflows of varying capability is not a lot different than the dynamics of a 1000 engineer organization.
I suppose if you're reeling at the new Copilot bill but want to stay in their ecosystem, this gives you something to use, but for most folks, there's a plethora of better options.
I personally wouldn't use models that class directly, though - I'd use them in a harness as a "backend" for more capable models. And Haiku itself, as opposed to other smaller models, is still expensive.
Unless of course we’re thinking Copilot will be more expensive than others longer term. But is that a reasonable assumption?
I assume I'm misunderstanding you (likely my fault), because the way I read that is that you're saying nobody should currently be using models owned & hosted by companies like OpenAI and Antheopic, while clearly a huge number of people are using those in 2026 despite not owning them.
Small models are more than enough for the majority of tasks these days. Plan and review with the bigger ones, let the little ones explore and implement.
OpenCode Go is $10/month for the open weight models with nice quotas: https://opencode.ai/go
AI is expensive and it has been heavily subsidized. I you think $20/mo for Codex/Claude flat vs a more usage based model you're in for a shock. Especially once these companies go public and have to meet investor expectations.
Performance doesn't seem that good:
- MAI-Code-1-Flash (137B-A5B) = 51% on SWE-bench pro
- Qwen3.6-35B-A3B = 49.5% on SWE-bench pro (https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
They benchmark against Claude Haiku but Haiku is not good, it's worse than tiny open models you can run locally or via API at 10% the cost.
Qwen-3.6-27b is closer to Claude Opus 4.7 than it is to Haiku 4.5 in a lot of benchmarks - and it's way smaller than Microsoft's new model.
Sure, it competes with Haiku, but it shows how far Microsoft is behind lots of other small models that are available.
Yeah, not a 5B param model as the earlier title implied!
If you watch the Build keynote with Satya, you'll notice that the design of the slides changed to Serif typography and warmer colors when Mustafa/Microsoft AI segment came on which was completely different from the rest of the keynote. Now it makes sense!
With Opus I can work, trust its designs, architecture suggestions, and code changes, even in a complex code base.
The smaller models seem to "try". They work for smaller tasks, but for more complex task it's often more work than doing it myself.
I wish it were different, and maybe in a year or two it will be.
always has been
claude code has opusplan — uses opus while in plan mode, switches to sonnet for execution.
https://code.claude.com/docs/en/model-config#opusplan-model-...
edit: you can make it work with sonnet for planning, and haiku for execution, or any other combination you fancy to work with.
https://code.claude.com/docs/en/model-config#control-the-mod...
1. Step execution (Sonnet): Work for 30 minutes / 100k tokens at the direction of the Orchestrator
2. Review (Opus): Scrutinize the previous step's work for errors, fidelity to the instructions, fix those and record opportunities to improve the agent configuration + tools to reduce errors and token usage (record those to a file).
3. Self-improvement (Opus): Implement the highest impact self-improvement items that don't require user intervention.
Repeat: Until orchestrator session token budget exhausted (set it to 1M or whatever).
The underlying rationale is to keep each step manageable to maximize adherence to instructions and minimize cost (even cached tokens cost something). Prompt tokens are much cheaper than generated, so to the extent Opus mostly reviews rather than drives that saves a lot too. Self-improvement steps are very expensive but the improvements compound, if you're going to run a job for days or weeks it's way more expensive not to do them.
Edit: I do this in Claude Code with the Anthropic models as well as Qwen family models for offline use.
I also work on a consumer AI application https://apps.apple.com/us/app/slidebits-studio/id1138731130
For comparison someone showed me an internal company tool he was working on. He had Claude agents dangerously skipping permissions and firing up github branches through a vm sandbox just to make a single feature change. One agent to code and the other to review.
For example you probably don't have days where you ask Opus to review your whole code base and look for code duplication/technical debt/robustness issues, and then to fix some of the found issues, and do this 3-5 times until no big issues are found anymore.
I use quite plain prompts, nothing fancy:
> go over the tests and do a code review, focusing on how well they test inventory management, planner and controller. maybe some tests need to be deleted, maybe other tests need to be added. the end goal should be good coverage of the core features.
> do a code review, focusing on robustness/correctness issues. validate that the code correctly implements specification.md. focus on the async client.
> there was a big refactor. please do a code review, focusing on eliminating tech debt. look for unused, obsolete or duplicate code that can be removed, look for mismatched interfaces, inconsistent function/argument/variable names. do not output what is correct, just the issues you found. for each issue output instructions for a coding agent on how to fix it. do not nitpick.
As we build a better and better harness and better feedback/verifiers we're switching more to 3.5 flash. I think chinese models would work too, but we cant use those atm.
Generally theres a coordinator running opus and an ever growing set of skills and subagents that take actions using weaker models and output feedback to the coordinator opus.
I'm pretty convinced at this point we're past the level of intelligence needed for most tasks most devs do and that will trend down as we better build harnesses for our own codebases.
...but I spend so much more time correcting it, or building pipelines to try, retry, and converge, that it's rarely worthwhile for me in either time or $ spent vs Opus.
So by using Opus you are using "smaller" model. Well, not really smaller, just worse. The actual smaller models can at least be faster.
Why not sell it as a math agent? Why do I have to set up 4 agents to check each others' work?
It is my belief that smaller models will get better and better, and even cloud SOTA models will shrink.
Yet another reason the current buildout will feel like the railroads.
Hard to know when they don't give the price per token. Presumably it will be comparable to a low-mid range model in terms of price. But otherwise their 'Ideal Zone' is meaningless without factoring in the price per token. I don't how much tokens are being used, that's an implementation detail to me. I care about price / performance / latency.
That's what I'm betting on anyway.
MAI-Thinking-1 - https://news.ycombinator.com/item?id=48374362 - June 2026 (64 comments)
https://microsoft.ai/news/introducingmai-code-1-flash/
and the model card
https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF
The broader announcement of 7 MAI models seems to be where the 5B active in the title comes from
https://microsoft.ai/news/building-a-hillclimbing-machine-la...
https://microsoft.ai/wp-content/uploads/2026/06/main_2026060...
https://microsoft.ai/news/building-a-hillclimbing-machine-la...
Unless they specifically clarify that the testing and training benchmarks are completely separate, we have to assume they test on the same 'hill' the model climbs.
That scroll effect is jank city for me (yeah yeah works fine in Chrome/Edge).
Let me slide as fast and unrestricted as I want. I do not want to "transition" to the next paragraph.
This trend needs to stop.
Seems like the work from a good system design to code is practically solved.
Now it’s a matter of the design of the system. Or is that represented in these evals?
Even if I had no idea, going with the default suggestion would not be a terrible mistake, assuming you did describe your requirements relatively well.
I was hoping Microsoft would make it open weights, as they have done for years with the Phi models.
The era of big tech releasing models into the wild might be over, which IMO is counter-productive, as we are shifting from "the model is the product" to "the harness is the product"
Here Microsoft is comparing against Claude Haiku, the smallest and least capable model from Anthropic.
[1] https://ai.meta.com/blog/introducing-muse-spark-msl/
Seriously tho, wtf is going on over at Meta? Anyone working there currently want to describe the vibe of the org when it comes to being a frontier company?
While the scores are not good compare to other open weight model, the important thing to note is their training data (as they claimed) is very clean, without any synthetic datasets.
They also did some more interesting work like showing very small models can be coherent as long as you have very simple children's book style training data (TinyStories is pretty famous).
Lots of these ideas are still used. Learning facts at scale with active reading is an ICLR 2026 paper from Meta AI that does a lot of similar work.
Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.
This model might have a perfect speed:
But it seems like, by and large, even the faster models are now aimed at longer-running agentic flows and not sub-1s autocomplete. Or am I wrong about that?
Why not assign them to make windows good :D
These things can be useful if you can accurately predict which tasks they will reliably do, and which they will usually fail on. Then you can get much more reliable work from them.
Even if it were Opus, comparing to a version number makes for an interesting snapshot of time comparison: if you knew how a model performed at whatever time in was in vogue, you can say "well, it looks like Model X is about 6 months/1 year/etc. behind the frontier SOTA" - which is exactly the discussion that happens in the open-weight/local LLM space. (interesting, MAI-Code-1-Flash does not appear to be such an open-weight model, following the western trend of locking models up)