OpenAI's Swarm, First Open-Source Multimodal MoE Model, TensorWave’s AMD Challenge

Plus, OpenAI’s new benchmark could reshape AI agent development

Last Chance: Claim Free Tickets for the In-Person RetrieveX Conference on Oct 17 in San Francisco.

Come hear from the creators of PyTorch, Albumentations, Meta Chameleon, Kubeflow, CAFFE, Albumentations, along with leaders from Microsoft, AWS, Bayer, Flagship Pioneering, Cresta, Omneky how to build best retrieval for AI.

If you're executive who's considering or working on GenAI projects, fill in the form below for a complimentary ticket for the conference - hurry up because tickets are limited! Please note that the conference is in-person only.

Date: October 17, 10:30am - 7pm PT
Venue: The Midway, 900 Marin St, San Francisco

Key Takeaways

  • OpenAI released Swarm, a lightweight library for building multi-agent systems that allows dynamic switching between specialized agents during conversations.

  • Aria, the first open-source multimodal native MoE model, was released, showing state-of-the-art performance on various multimodal and language tasks.

  • BioNTech unveiled several AI initiatives, including the Kyber supercomputer and Bayesian Flow Network models for protein sequence generation.

  • OpenAI's MLE-bench, evaluating AI agents on machine learning engineering tasks, found that the best setup achieved bronze medal level in 16.9% of Kaggle competitions, improving to 34.1% with multiple attempts.

  • A study introducing Compositional GSM showed significant disparities in LLM performance between standard and more complex math problems, highlighting gaps in reasoning capabilities.

Got forwarded this newsletter? Subscribe below👇

The Latest AI News

The model releases seemed to have slowed down last week, but we did get the first open-source, multimodal native MoE mode, along with a new library from OpenAI.

Meanwhile, AI in biotech saw some notable progress with AI being applied to calculating carbon footprints as well as a bunch of updates unveiled at BioNTech’s inaugural AI day.

OpenAI's New Tool for Flexible and Dynamic AI Agent Interactions

Example of how Swarm works. (Source)

OpenAI released Swarm - a lightweight library for building multi-agent systems. It’s similar in concept to existing frameworks like CrewAI and LangChain that also help with the creation of multi-agent systems. Unexpected, but certainly welcomed. 

Swarm provides a stateless abstraction to manage interactions between multiple AI agents. It also allows for dynamic switching between specialized agents during conversations. Worth noting that it doesn’t rely on OpenAI's Assistants API, so it offers more flexibility and control.

It lets devs create distinct agents, each with specific roles, instructions, and functions. These agents can interact dynamically based on pre-defined handoff logic, allowing for seamless task switching as conversations or workflows progress.

Since the framework doesn’t maintain internal state between function calls, it lets agents pass control to other agents in real-time based on criteria or conversation flow. The handoff is simple—just return the next agent to engage. 

Notably, Swarm doesn’t use OpenAI’s function calling feature. Instead, it opts for a more flexible approach that allows for easier integration with various AI models and custom implementations.

Swarm uses context variables to maintain and update the state throughout multi-agent interactions. 

Aria: The World's First Open-Source Multimodal Native MoE Model Debuts

Aria showed SOTA results on benchmarks like MathVista and DocVQA. (Source)

We saw the release of the first open-source, multimodal native MoE model called Aria by Rhymes AI.

It's pre-trained from scratch on a mixture of multimodal and language data. It’s definitely no slouch as it showed SOTA performance on a wide range of multimodal and language tasks like MMMU and LongVideoBench. It’s adept at following instructions across both multimodal and language inputs, excelling in benchmarks like MIA-Bench and MT-Bench.

Aria processes text, images, video, and code simultaneously - all without needing separate setups for each type. In terms of multimodal capabilities, it has a long context window of 64K tokens. Aria can caption a 256-frame video in just 10 seconds. 

It consists of a vision encoder and a MoE decoder, with the vision encoder operating in three resolution modes: medium, high, and ultra-high. The MoE decoder has 66 experts in each layer, with 2 shared experts and 6 activated experts per token.

World Labs' Google Cloud Partnership, TensorWave's AMD Challenge, and Intel's AI-Enabled Processors

Fei-Fei Li’s World Labs (which came out of stealth last month) has selected Google Cloud as its primary compute provider to train its "spatially intelligent" AI models. They’re using a significant portion of its $230 million funding round for GPU server licensing.

World Labs has a big focus developing multimodal AI models capable of processing, generating, and interacting with video and geospatial data, which are all pretty computationally demanding. 

World Labs’ partnership with Google Cloud is non-exclusive, so the startup can potentially explore other cloud providers in the future. But for now, Google Cloud currently hosts the majority of World Labs’ workloads and they aim to retain this business long-term.

Nvidia has seen a lot of competition step up to the plate in the chip market in recent months, with last week being no exception. TensorWave is going against the grain by launching a cloud platform that only offers access to hardware from Nvidia rival AMD for AI workloads. Their main goal is to democratize AI by offering more affordable compute access.

TensorWave uses AMD Instinct MI300X GPUs for AI workloads, claiming its MI300X instances outperform Nvidia's H100 in running (but not training) AI models, particularly text-generating models like Meta's Llama 2.

TensorWave rents GPU capacity by the hour with a minimum six-month contract.
Pricing ranges from approximately $1 to $10 per hour, depending on workload requirements and GPU configurations. Interestingly enough, the company aims to be more cost-effective than competitors due to the lower cost of AMD GPUs compared to Nvidia's H100.

In terms of growth, TensorWave is already generating $3 million in annual recurring revenue. The company expects to reach $25 million in revenue by the end of the year. TensorWave plans to scale up to 20,000 MI300X GPUs. The company intends to bring AMD's next-gen MI325X GPUs online as early as November/December 2024.

We also saw a new release from Intel called the Intel Core Ultra 200S series, which features built-in Neural Processing Units (NPUs). That means it marks Intel's first desktop processors with integrated AI capabilities. 

From Packages to Products: Amazon's AI Innovations in Delivery and Shopping

Amazon is introducing vision-based technology to its electric vehicle fleet. It’s designed to help drivers prioritize packages by highlighting them with green circles (for delivery) or red lights (incorrect packages).

The VAPR system aims to save drivers from having to stop and manually search for packages at each stop, reducing the time spent per stop from 2–5 minutes to under a minute. 

It also includes an audio cue to confirm if the driver has selected the correct package, which gets rid of the need for handheld devices that drivers currently use for package scanning and tracking.

Turns out the VAPR system has been in development since early 2020, with Amazon considering unique delivery challenges like lighting and space constraints inside vans. There are plans to deploy the VAPR system in 1,000 of its electric Rivian vans by early 2025, after testing the technology in select markets like Boston. It’s important to note that Amazon is Rivian’s largest shareholder with a stake of 16.6%.

Amazon also released AI Shopping Guides to help users find products based on specific features, offering tailored suggestions and product information for over 100 product types, including TVs, headphones, and skincare. 

It’s a more visual way to filter products, replacing traditional tick-box menus with interactive options for factors like brand, use case (e.g., sport or gaming), and connectivity type. Each guide includes educational content and customer insights to help users better understand product features and make more informed purchasing decisions.

The AI guides appear automatically during searches when relevant, or users can directly explore them through Amazon’s mobile website and apps for iOS and Android.

AI for Refining Carbon Accounting and Updates from BioNTech’s Inaugural AI Day

AI saw some climate tech applications last week, with Forward Earth focusing on using AI to automate the calculation of complex CO2 footprints. The co-founders were former executives at carbon accounting startup Planetly, who launched Forward Earth since they weren’t too happy with Planetly's acquisition and shutdown by OneTrust.

BioNTech recently had its inaugural AI day where they unveiled a bunch of key updates. 

Here’s the main ones you need to know:

  • AI Scaling Across Immunotherapy Pipeline: BioNtech detailed its strategy to scale AI capabilities throughout its immunotherapy pipeline, using AI to drive innovation in areas like DNA/RNA sequencing, proteomics, protein design, and immunohistochemistry.

  • Launch of Kyber Supercomputer: InstaDeep (BioNTech’s AI subsidiary) unveiled Kyber - a near exascale supercomputer aimed at enabling high-performance computing for large-scale AI and biotechnology research.

  • Bayesian Flow Network (BFN): BioNTech presented BFN generative models for protein sequence generation, showing the potential for advancements in personalized vaccine development and targeted therapies.

  • DeepChain™ Platform and External Partnerships: The DeepChain™ multiomics design platform was introduced for external partnerships after success in BioNTech’s internal projects, including the mRNA-encoded antibody RiboMab™ platform.

Adobe Launches Free Web App for Content Credentials, Addressing Past AI Training Controversies

Adobe is launching a free web app to improve the process of applying Content Credentials to images, videos, and audio files.

It’s an interesting turn of events since we saw Adobe use Midjourney-generated images to train their Firefly model in the past, even though they said the model was a “commercially safe” alternative.

It might be Adobe’s response to this situation where they lost a lot of trust when it comes to ethical AI practices. They’re introducing tools that let creators protect their work and opt out of AI training datasets, which shows they’re adapting their practices and policies for a more ethical approach.

Advancements in AI Research

One paper that really stood out last week was about exploring intelligence emergence in rule-based systems. OpenAI also released a new benchmark for evaluating AI in scientific discovery tasks. Researchers provided a fresh perspective on assessing LLM capabilities as well.

How Rule Complexity Shapes AI Behavior

Framework overview. (Source)

Researchers introduced a new approach to understanding the emergence of intelligent behavior in artificial systems by investigating how the complexity of rule-based systems influences the capabilities of models trained to predict these rules.

Previously we saw the release of LifeGPT which also progressed work with cellular automata. This paper used elementary cellular automata (ECA) as a framework to generate behaviors ranging from simple to highly complex. 

They also trained separate GPT-2 language models on datasets generated by individual ECAs and evaluated the models' "intelligence" through performance on downstream logical reasoning tasks.

A positive correlation was found between the complexity of ECA rules and the downstream performance of models trained on them. Models trained on Class IV ECA rules, which showed structured yet complex behaviors, performed optimally. 

Interestingly, it showed that models can learn complex solutions even when trained on simple rules, which is likely due to overparameterization. They presented a hypothesis that by learning to incorporate past states, models develop generalizable logic that can be reused across tasks.

From Bronze to Gold: OpenAI's MLE-bench Challenges AI Agents in Real-World ML Tasks

In addition to Swarm, OpenAI also introduced MLE-bench last week. It’s a new benchmark for assessing the capabilities of AI agents in machine learning engineering tasks. 

This benchmark aims to provide a rigorous measure of progress in autonomous ML engineering agents, addressing the growing interest in using AI to automate scientific workflows.

Key aspects of MLE-bench include:

  1. A diverse set of 75 Kaggle competitions across various domains, including natural language processing, computer vision, and signal processing.

  2. Careful curation to ensure tasks are challenging and representative of contemporary ML engineering work.

  3. The ability to compare AI agent performance directly with human-level performance using Kaggle leaderboards.

The researchers evaluated several frontier language models on MLE-bench using open-source agent scaffolds. They found the best-performing setup to be OpenAI's o1-preview with AIDE scaffolding, which achieved at least a bronze medal level in 16.9% of competitions.

Performance significantly improved when agents were given multiple attempts per competition, with o1-preview's score doubling from 16.9% to 34.1% when allowed 8 attempts. Moreover, Agents performed well on competitions solvable with well-known approaches but struggled with debugging and recovering from missteps.

ScienceAgentBench: Putting AI to the Test in Scientific Discovery

ScienceAgentBench is a new benchmark for evaluating language agents in data-driven scientific discovery tasks. It addresses the growing interest in using AI to automate scientific workflows, while also highlighting the need for more rigorous assessment of these systems.

102 diverse tasks were extracted from 44 peer-reviewed publications across four scientific disciplines: Bioinformatics, Computational Chemistry, Geographical Information Science, and Psychology & Cognitive Neuroscience.

There was a big focus on code generation, requiring agents to produce complete Python programs for data analysis and visualization tasks. They also used careful quality control measures, including expert validation and strategies to mitigate data contamination concerns.

They evaluated five open-weight and proprietary LLMs using three frameworks: direct prompting, OpenHands CodeAct, and self-debug. Surprisingly, they found simpler approaches like self-debug usually outperformed more complex frameworks in both performance and cost-efficiency.

Results show that even the best-performing agent, Claude-3.5-Sonnet using self-debug, could only solve 34.3% of tasks with expert-provided knowledge. It’s definitely an eye-opener about the challenges that AI agents are having in automating complex scientific workflows.

New Benchmark Reveals LLM Reasoning Disparities

Graph comparing reasoning performance on GSM8K and Compositional GSM accuracy. (Source)

Researchers from Mila, Microsoft Research, and Google DeepMind introduced a new approach to evaluating the reasoning capabilities of LLMs in grade-school math problems. 

They introduced Compositional GSM, a two-hop version of the GSM8K benchmark that challenges LLMs to solve chained math problems.

They also evaluated various LLMs, including open-source and proprietary models like Gemini, Gemma2, LLAMA3, GPT, Phi, Qwen2.5, and Mistral families.

There were significant disparities between LLMs' performance on standard GSM8K problems and the more complex Compositional GSM tasks. Smaller, more cost-efficient, and math-specialized models showed larger reasoning gaps.

For instance, GPT-4o mini, which nearly matches GPT-4o on standard benchmarks, showed a 2-12x worse reasoning gap on Compositional GSM.

It showed instruction-tuning effects vary across LLM sizes, with smaller models showing more improvement on standard GSM8K but less on Compositional GSM. Extensive math specialization didn’t necessarily improve performance on these compositional tasks.

Conversations We Loved

A post about a new architecture which might be able to improve on o1’s ability to scale inference-time compute gained a lot of attention - and for good reason.

Another discussion about LongCite, a new method to improve the trustworthiness of AI outputs, was also one to look at.

Pause Tokens and Parallel Reasoning: Entropix's New Approach to AI Cognition

A discussion about Entropix came up last week, which is an innovative architecture anonymously released by an AI researcher. 

It highlighted a new approach to replicating and potentially improving upon OpenAI's latest o1 model's ability to scale inference-time compute - essentially improving its capacity to 'think' before responding.

What stood out was architecture's use of uncertainty measurement (formally defined as entropy and varentropy) to improve reasoning. 

The model inserts pause tokens like "...wait" when uncertain about the next best tokens or thoughts, prompting it to reflect and produce additional chains of thought. This approach allows for dynamic scaling of inference-time compute for more profound and strategic thinking.

Makes us wonder how this method could potentially unlock more powerful reasoning capabilities from smaller models like Llama 3.1-1B, which can run locally on a laptop. 

Perhaps advancements in the underlying attention mechanism, rather than relying solely on prompt engineering or methods like Monte Carlo Tree Search, might be key to unlocking the true potential of inference-time compute

Boosting LLM Trustworthiness with Fine-Grained Citations

Raschka brought an interesting paper called LongCite to the spotlight, which probably slipped under the radar. It mentions a new approach to boost the trustworthiness of LLMs in long-context question answering tasks.

What makes this paper important is its focus making AI-generated content more trustworthy. Misinformation is common in today’s world, so equipping LLMs with the ability to generate precise, sentence-level citations could definitely raise human confidence in AI outputs. Not only does that help address concerns about hallucinations, but it also helps us verify information.

The method used also stands out, since they leveraged existing LLMs to create a dataset of long-context Q&A instances. It’s an example of step-wise refinement that shows how AI can be adapted to deal with pressing issues.

Maybe we’ll see a shift toward models that prioritize citation and accountability as core functionalities in the future.

Frameworks We Love

Some frameworks that caught our attention in the last week include:

  • IterComp: Combines the strengths of multiple diffusion models to improve compositional text-to-image generation. 

  • AvatarGo: Generates animatable 4D human-object interaction (HOI) scenes directly from textual inputs.

  • SimNPO: Framework for LLMs that aims to remove unwanted data influences and associated model capabilities 

If you want your framework to be featured here, reply to this email saying hi :)

Money Moving in AI

AI applications in mineral discovery seem to be making good progress as KoBold Metals secured $491 million after a massive discovery. Meanwhile, Basecamp and Braintrust secured $60 million and $36 million respectively.

KoBold Metals Secures $491 Million

KoBold Metals, an AI-powered mineral discovery startup, is close to raising over half a billion dollars, having already secured $491 million of a targeted $527 million round. 

This funding comes on the heels of KoBold's discovery of what might be one of the largest high-grade copper deposits in history, showing the potential of AI in mineral exploration. 

Basecamp Raises $60 Million in Series B Funding Round

A London-based startup building an AI agent for biology and biodiversity insights called Basecamp has raised $60 million in a Series B funding round led by Singular. 

The company aims to create an AI that can not only answer questions about biology but also produce new insights beyond human capabilities, with its BaseFold model already claiming to outperform DeepMind's AlphaFold 2 in certain protein structure predictions

Rad AI Secures $50 Million in Series B Funding Round

A startup focusing on generative AI for healthcare called Rad AI secured $50 million in series B funding led by Khosla Ventures, which brings their total capital raised to over $80 million. The CEO attributed a big part of their success to strong business metrics, including tripling year-over-year revenue and adoption by more than a third of all US health systems.

Braintrust Raises $36 Million in Series A Funding Round

Braintrust focuses on empowering teams to build robust LLM-enabled applications. They announced a $36 million funding round, bringing their total funding to $45 million. In particular, they’re working toward building features like being able to share code in dev environments.