OpenAI's New AI Agent, Forge API Launches, FrontierMath Benchmark

Plus, Google releases software code for AlphaFold 3

Key Takeaways

  • OpenAI's "Operator," an AI agent expected to release in January 2025, aims to execute tasks directly on users' computers, competing with Anthropic and Google in consumer-facing AI tools.

  • Nous Research’s Forge API introduces reasoning enhancements via Monte Carlo Tree Search, Chain of Code, and a model mixture strategy, outperforming competitors in math reasoning benchmarks.

  • Qwen 2.5 demonstrated SOTA coding capabilities, matching GPT-4o’s performance across multiple programming languages and benchmarks.

  • FrontierMath evaluates AI systems on challenging mathematical reasoning, with experts confirming its rigor and AI’s limited ability to solve complex problems.

  • Meta’s Watermark Anything Model redefined watermarking as segmentation, achieving over 85% accuracy in detecting watermarked areas and resilience to manipulations like splicing.

  • CDXFormer improved spatial-temporal context analysis in satellite images using XLSTM, achieving SOTA performance across benchmarks with reduced computational costs.

Got forwarded this newsletter? Subscribe below👇

The Latest AI News

It’s been a little while, but there were plenty of releases last week. OpenAI isn’t showing any signs of slowing down and are discussing the release of an AI agent in January 2025. Sutskever brought up some interesting results from scaling-up pre-training training as well. We also saw a bunch of cool models from Google with unique applications, including flood forecasting and predicting molecular structures. 

The release of Qwen 2.5 was surprising as well with how well it performed on various benchmarks, along with a new math benchmark that received praise from some of the most well-known mathematicians in the world.

OpenAI and Rivals Pivot from Scaling to Advanced Reasoning

We’re seeing a shift in the AI industry’s approach to developing LLMs. Companies like OpenAI are moving away from the "bigger is better" philosophy of simply scaling up models with more data and computing power. Instead, they're exploring more sophisticated techniques that mimic human-like reasoning.

OpenAI's recently released o1 model is a perfect example of how we’re moving toward this new direction. It uses "test-time compute," a technique that enhances AI models during the inference phase. This method lets models generate and evaluate multiple possibilities in real-time, dedicating more processing power to challenging tasks that require complex reasoning.

Ilya Sutskever, co-founder of AI labs Safe Superintelligence (SSI) and OpenAI, mentioned results from scaling up pre-training have plateaued, marking a transition from "the age of scaling" to "the age of wonder and discovery."

Some effects of this shift could include:

  • It may reshape the AI arms race, with companies focusing on developing more efficient reasoning techniques rather than just increasing model size.

  • The demand for computational resources could change, which might affect companies like Nvidia that have dominated the AI chip market.

  • It could lead to more distributed, cloud-based servers for inference.

Other major AI labs, including Anthropic, xAI, and Google DeepMind, are reportedly working on their own versions of these advanced reasoning techniques. This means we might be in a time where innovation in model architecture and training methods may become as crucial as raw computational power.

OpenAI Prepares 'Operator' Launch and Shares LLM Optimization Strategies

Previously, OpenAI released a lightweight library called Swarm. OpenAI has now revealed plans to release "Operator," an upcoming AI agent tool, which is set for release as early as January 2025, with plans to initially offer it as a research preview via the developer API. The tool is designed to execute tasks directly on users' computers.

Operator is expected to compete with Anthropic's "Computer Use" feature and Google's rumored consumer-focused agent, potentially offering general-purpose capabilities in web browsers. As of right now, details on Operator's unique advantages aren’t entirely clea, although it aims to simplify task execution.

OpenAI also released a post detailing how to optimize LLMs, since there’s issues due to varying requirements for accuracy, method selection, and production-readiness, requiring a structured approach tailored to specific use cases.

They recommended to begin with prompt engineering for simple tasks, then progress to RAG for dynamic context and fine-tuning for consistent behavior and task-specific accuracy.

Afterwards, they mention you should establish an evaluation set to diagnose failures and iteratively refine optimization methods, ensuring tools like RAG or fine-tuning are applied when prompt engineering alone falls short.

On the legal side of things, A New York judge dismissed RawStory's lawsuit against OpenAI, which might set a precedent for future cases involving AI training on copyrighted data. The dismissal was primarily based on technical grounds, though the judge's reasoning touched on broader issues related to AI and copyright.

The court found that authors aren’t plagiarized during AI training because the process involves a broad set of data and doesn’t result in verbatim copying.

Forge API's Reasoning Power and CoPilot Arena's Coding Leaderboard

Hermes 3 70B is able to perform well on various reasoning benchmarks. (Source)

Nous Research is launching two new projects: the Forge Reasoning API Beta and Nous Chat. Nous Chat is a simple chat platform featuring the Hermes 3 70B language model, which is available for free.

The Forge Reasoning API Beta is being released to a select group of users, focusing on testing the architecture of their reasoning system. It allows users to enhance any popular model with a code interpreter and advanced reasoning capabilities.

Moreover, the API supports multiple models (Hermes 3, Claude Sonnet 3.5, Gemini, GPT-4) and allows users to combine models for enhanced output diversity.

Here’s what’s involved in terms of reasoning layer architectures:

  • MCTS (Monte Carlo Tree Search): Iteratively builds decision trees for planning problems through selection, expansion, simulation, and backpropagation. 

  • CoC (Chain of Code): Connects reasoning steps to a code interpreter, improving code and math capabilities. 

  • MoA (Mixture of Agents): Allows multiple models to collaborate on a query, providing more complete and diverse outputs.

Nous Research claims Hermes 70B augmented with Forge outperforms larger models from Google, OpenAI, and Anthropic in reasoning benchmarks. Specifically, they mentioned superior performance in the AIME evaluation, which focuses on competition-grade math questions.

Ever wondered what model would take the first place spot for coding? Imarena released a code completions leaderboard using data from the previous month to answer that question.

Nine popular models were evaluated, including open-source, code-specific, and commercial models. Top performers were DeepSeek V2.5 and Claude Sonnet 3.5, with Elo ratings of 1074 and 1053 respectively. The evaluation process included randomization of model pairings and positions, and standardized parameters for fair comparison.

Moreover, a free AI coding assistant called Copilot Arena was launched recently, providing paired responses from different state-of-the-art LLMs. It has been downloaded 2,500 times on the VSCode Marketplace, served over 100,000 completions, and accumulated over 10,000 code completion battles. The tool offers paired code completions and inline editing features.

A novel prompting technique was also developed to enable chat models to perform code completions, especially for "fill-in-the-middle" (FiM) tasks. The method involves generating code snippets and post-processing them, rather than forcing models to output in FiM format directly. This approach drastically reduced formatting errors across various models.

Tackling Climate, Protein and Math Challenges

Google continues to work on unique AI applications, previously introducing a whale bioacoustics model. Now, they’ve developed a new flood forecasting model with improved reliability and coverage. The model provides a 7-day lead time reliability comparable to current best available nowcasts. It expands coverage to 100 countries with verified data and up to 150 countries with data based on virtual gauges.


The improved model quality and new evaluation approach have expanded coverage. Google Flood Hub now reaches users in over 100 countries, up from 80. This expansion enables Google to provide critical flood information to 700 million people worldwide, up from 460 million previously. 

The model incorporates DeepMind's medium-range global weather forecasting model as input. Training data has been increased from 5,680 gauges to nearly 16,000 gauges. The LSTM-based architecture has been improved to better combine different weather products and increase robustness to missing data.

We saw an open-source implementation of AlphaFold 3 called LIGO a couple months back. Google DeepMind has now released the software code for AlphaFold 3, allowing non-commercial use. This comes six months after initially withholding the code, which drew criticism from scientists (as you’d expect).

AlphaFold 3 can model proteins interacting with other molecules, including potential drugs. The code is now downloadable, but model weights are only available to academics upon request. Several companies have developed AlphaFold 3-inspired models, including Baidu, ByteDance, and Chai Discovery. 

AI in math also saw advancements with FrontierMath - a new benchmark designed to test the limits of AI systems in mathematical reasoning. The benchmark aims to capture a snapshot of contemporary math and evaluate AI's progress towards innovative thinking needed for scientific research.

FrontierMath deals with the saturation issue that other benchmarks face. (Source)

A quick rundown of the new benchmark’s key features:

  • All problems are new and unpublished to prevent data contamination

  • Solutions are automatically verifiable, enabling efficient evaluation

  • Problems are "guessproof" with a low chance of solving without proper reasoning

  • Each problem demands hours of work from expert mathematicians


The benchmark has even been validated by Fields Medalists like Terence Tao (considered the best mathematician in the world), Timothy Gowers, and Richard Borcherds.

Qwen 2.5 Matches GPT-4 as X Expands Grok Access 

Qwen2.5 showed promising results on various coding benchmarks. (Source)

Qwen2.5-Coder-32B-Instruct caught us off-guard last week, since there’s claims of it being the current SOTA open-source code model, matching GPT-4o's coding capabilities. It excels in code generation, repair, and reasoning across multiple programming languages.

The model scored 73.7 on the Aider code repair benchmark, comparable to GPT-4o. It performs well across 40+ programming languages, scoring 65.9 on McEval and 75.2 on MdEval.

The series includes six model sizes: 0.5B, 1.5B, 3B, 7B, 14B, and 32B. Moreover, both Base and Instruct versions are available for each size. As you’d expect, performance scales positively with model size across various benchmarks. The series also outperforms other open-source models across all sizes on core datasets.

In other news, X began to test a free version of Grok. Previously, Grok was limited to premium, paying users of the platform.

Sounds great, but there’s a some restrictions that free users will need to keep in mind:

  • 10 queries per two hours with the Grok-2 model

  • 20 queries per two hours with the Grok-2 mini model

  • 3 image analysis questions per day

xAI launched Grok-2 in August with image generation capabilities. The model recently gained the ability to understand images.These features were previously exclusive to Premium and Premium+ users, but free users can also try out these features now.

DeepL Voice Translates as BRIA RMBT2.0 Removes Backgrounds

DeepL, a German startup valued at $2 billion, has launched DeepL Voice, a real-time audio translation service. The service can "hear" 13 languages and provide translated captions in 33 languages supported by DeepL Translator. Currently, DeepL Voice outputs translations as text, not audio, focusing on live conversations and video conferencing.

Translations can appear as "mirrors" on smartphones for in-person meetings or as side-by-side transcriptions. For video conferencing, translations appear as subtitles, with Microsoft Teams being the only integrated platform so far. There's no API for the voice product yet, as DeepL is working directly with partners and customers.

In other news, a new background removal model was released called RMBG v2.0, which is developed by BRIA AI. It's designed for separating foreground from background across various image categories. The model is trained on a diverse dataset including stock images, e-commerce, gaming, and advertising content.

It aims to rival leading source-available models in accuracy, efficiency, and versatility. The model is particularly suitable for enterprise-scale content creation where content safety, legal compliance, and bias mitigation are crucial.

Advancements in AI Research

Some interesting papers came to light last week, ranging from the CDXFormer model for remote sensing change detection to the critical evaluation of domain-adaptive pretraining for medical applications. Meta FAIR also released a paper looking at the issue of localized image watermarking.

CDXFormer: A New Approach to Remote Sensing Change Detection

Researchers from Zhejiang University tackled the critical challenge of effectively identifying changes in remote sensing images across complex and varied environmental conditions. They recognized existing methods like Convolutional Neural Networks, Transformers, and Mamba-based models struggled to balance performance and computational efficiency when analyzing spatial-temporal contexts.

To address these limitations, they developed CDXFormer, a new approach that uses Extended Long Short-Term Memory (XLSTM) technology. Their method introduces a scale-specific feature enhancement layer with two key components: a Cross-Temporal Global Perceptron for semantic-accurate deep features and a Cross-Temporal Spatial Refiner for detail-rich shallow features. 

Additionally, they implemented a Cross-Scale Interactive Fusion module to progressively integrate spatial information and global semantics. It achieved SOTA performance across three benchmark datasets, improving F1 scores by 0.22%, 1.08%, and 7.46% compared to previous top-performing methods.

Crucially, the model maintained high efficiency, using only 16.19 million parameters and 3.92 G Flops—significantly lower than competing approaches. 

Medical Models vs. General-Domain LLMs: New Insights into AI's Effectiveness in Healthcare

Overview of evaluation approach. (Source)

A recent study from Carnegie Mellon University and Mistral AI looks at the effectiveness of domain-adaptive pretraining (DAPT) for medical applications of LLMs and VLMs. They aimed to address the prevailing assumption that DAPT consistently enhances performance on medical tasks, particularly in answering medical licensing exam questions.

To investigate this, they conducted a comprehensive evaluation comparing seven medical LLMs and two medical VLMs against their corresponding general-domain base models. 

They optimized prompts for each model independently and accounted for statistical uncertainty in their comparisons. The results revealed a surprising trend: most medical models didn’t consistently outperform their general-domain counterparts. In fact, medical LLMs only surpassed base models in 12.1% of cases, with ties in 49.8% and underperformance in 38.2%.

These findings suggest SOTA general-domain models may already possess strong medical knowledge and reasoning capabilities when prompted appropriately. 

MIT Study Provides Causal Evidence of AI's Impact on Scientific Discovery and Innovation

A study from MIT provides the first causal evidence of AI's impact on real-world R&D, showing significant boosts to scientific discovery and product innovation. The research exploits the randomized introduction of an AI tool for materials discovery to over 1,000 scientists in a large U.S. firm's R&D lab.

The study addresses key questions about AI's role in innovation, including its effects on the pace and direction of scientific breakthroughs, as well as its impact on scientists themselves. To investigate these issues, the researchers analyzed detailed data on each stage of R&D, from initial material discovery to patent filings and product prototypes.

Key findings revealed AI-assisted researchers discover 44% more materials, leading to a 39% increase in patent filings and a 17% rise in downstream product innovation. 

Interestingly, the technology had strikingly disparate effects across researchers. While top scientists nearly doubled their output, the bottom third saw little benefit. The study found AI automated 57% of "idea-generation" tasks, shifting researchers' focus to evaluating AI-suggested materials. Top scientists leveraged their domain knowledge to prioritize promising AI suggestions more effectively.

It was reported that 82% of scientists experienced reduced job satisfaction due to decreased creativity and skill underutilization, so there’s definitely some challenges that still need to be addressed in workforce adaptation to AI.

Meta FAIR Introduces WAM: Redefining Watermarking as a Segmentation Task

Meta FAIR’s paper addresses the challenge of localized image watermarking, which traditional methods struggle to handle effectively. They aimed to solve the problem of watermarking specific areas within an image, allowing for multiple distinct watermarks and improved robustness against image manipulations like splicing and inpainting.

To tackle this issue, the researchers introduce the Watermark Anything Model (WAM), which redefines watermarking as a segmentation task. WAM consists of an embedder that imperceptibly modifies the input image and an extractor that segments the received image into watermarked and non-watermarked areas while recovering hidden messages. 

The model employs a two-stage training process, first focusing on robustness at low resolution without perceptual constraints, then fine-tuning for imperceptibility and multiple watermark handling.

They used deep learning techniques, including a graph neural network architecture for the embedder and a vision transformer-based extractor. They incorporated a Just-Noticeable-Difference (JND) map to modulate watermark intensity and improve imperceptibility. 

The model demonstrates the ability to locate watermarked areas in spliced images and extract distinct 32-bit messages from multiple small regions, even when they occupy as little as 10% of the image surface. Notably, WAM achieves over 85% mIoU in detecting watermarked areas and over 95% bit accuracy when hiding five 32-bit messages in 10% areas of images, even after manipulations like horizontal flipping and contrast adjustment.

Conversations We Loved

Anthropic’s CEO had a 5-hour discussion with Lex Fridman last week and gave us some interesting insights into the scaling hypothesis to think about. Meanwhile, another conversation about the evolution of LLMs by Andrew Ng also caught our attention.

Evolution of LLMs and Agentic Workflows

Andrew Ng shared insights on the evolving landscape of LLMs and their increasing optimization for agentic workflows. Ng highlights a significant shift in LLM development, moving beyond consumer-facing question-answering to more complex, iterative processes that enable AI agents to perform sophisticated tasks.

Ng observes that while LLMs have been primarily tuned for direct human interaction, there's a growing trend towards optimizing them for agentic behaviors. This includes capabilities like tool use, function calling, and even computer operation, as demonstrated by Anthropic's recent release. He emphasizes the potential of these advancements to dramatically boost agentic performance in AI applications.

The conversation outlines a three-stage progression in agentic LLM development:

  • Prompting existing LLMs for agentic behaviors

  • Fine-tuning models for specific, high-value applications

  • Major LLM providers integrating agentic capabilities directly into their models

Ng predicts that this trend will lead to significant performance gains in AI agents over the next few years, opening up new possibilities for complex, multi-step AI workflows. 

Dario Amodei Discusses Scaling Hypothesis and Responsible AI Development

In a conversation on the Lex Fridman Podcast, Dario Amodei, CEO of Anthropic, offered valuable insights into the state of AI and his company's approach to responsible AI development.

Amodei delved into the scaling hypothesis, which suggests that increasing the size and computational power of neural network models leads to significant capability growth across diverse tasks. He highlighted how larger models like GPT and CLIP have shown dramatic improvements when scaled up with more data and compute power. 

Amodei believes AI systems matching or exceeding human abilities across domains could be achievable as soon as 2026 or 2027, though he acknowledges uncertainties remain.

A key focus of the discussion was Anthropic's Responsible Scaling Plan (RSP), aimed at mitigating potential risks as AI becomes more powerful. This plan involves testing models for autonomous behavior and potential misuse, with escalating safety precautions as capabilities increase. Amodei emphasized the necessity of regulation in AI to address risks like malicious use and loss of human control.

Amodei shared insights into Anthropic's work on the Claude AI model as well, which is being developed with human values in mind through iterative testing and refinement. 

Frameworks We Love

Some frameworks that caught our attention in the last week include:

  • FinRobot: Framework for equity research that integrates quantitative and qualitative analysis through a multi-agent Chain of Thought system

  • XiYan-SQL: Natural language to SQL framework that improves query generation quality and accuracy

  • VoiceLab: Open-source framework for testing and optimizing voice agents, providing tools for custom metrics, model migration, and prompt testing. 

If you want your framework to be featured here, reply to this email saying hi :)

Money Moving in AI

Writer saw a successful series C funding round for $200 million, while two key acquisitions took place last week: Anysphere acquired Supermaven and Redhat acquired NeuralMagic. Note that the sum for these acquisitions weren’t revealed.

Writer Raises $200 Million in Series C Funding Round

Writer, an enterprise generative AI platform, raised $200 million in a Series C round at a $1.9 billion valuation. The funding will support product development, including AI agents, customizable guardrails, and no-code tools, cementing its position in enterprise AI. Writer's Palmyra models, tailored for business needs, have attracted clients like Salesforce, Uber, and Intuit, highlighting its success amidst intense competition.

Anysphere Acquires Supermaven

Anysphere, maker of the AI-powered code editor Cursor, has acquired AI coding assistant Supermaven for an undisclosed sum. Supermaven's technology, including its low-latency AI model Babble, will enhance Cursor's upcoming Tab AI version for context-aware, intelligent coding, especially for long sequences. 

The merger aims to combine advanced model capabilities with a seamless editor UI, accelerating product development and maintaining Supermaven’s plugins. 

Red Hat Acquires Neural Magic

Red Hat has acquired Neural Magic (also for an undisclosed sum), a startup focused on optimizing AI models to run efficiently on standard processors and GPUs, enhancing hybrid cloud AI performance. Neural Magic, founded in 2018, offers tools like vLLM for model serving, which Red Hat will integrate into its platforms like OpenShift AI and Red Hat Enterprise Linux AI. 

This acquisition aligns with Red Hat's goal to expand its AI capabilities in flexible, open-source environments.