A new approach to AI development is redefining the way large language models are trained, allowing for distributed machine learning across the internet and unlocking new data sources.
A New Frontier in AI Development: Distributed Machine Learning
Researchers have made a breakthrough in developing large language models (LLMs) using a new approach called distributed machine learning, which allows training of these massive models across the internet without relying on traditional data centers.
Distributed machine learning enables efficient processing and analysis of large datasets by breaking them down into smaller, manageable chunks.
This approach leverages multiple machines or nodes to train models in parallel, significantly reducing training time and increasing scalability.
Key benefits include improved model accuracy, enhanced data security, and optimized resource utilization.
According to a study, distributed machine learning can accelerate training times by up to 10x compared to traditional methods.
The Power of Crowdsourced Training
The latest model, Collective-1, was developed by two startups, Flower AI and Vana, who collaborated to create this innovative technology. This new method involves using GPUs dotted across the world, fed with private and public data, which could potentially disrupt the dominant way of building artificial intelligence.
Flower AI created techniques that enable training to be spread across hundreds of computers connected over the internet. The company’s technology is already used by some firms to train AI models without needing to pool compute resources or data. Vana provided sources of data including private messages from X, Reddit, and Telegram.
Collective-1 is relatively small compared to modern standards, with 7 billion parameters, but its developers believe that this approach promises to scale far beyond the size of the model itself. The startup plans to train a model with 100 billion parameters later this year, which could change the way everyone thinks about AI.
An artificial intelligence (AI) model is a mathematical representation of a specific task or problem.
These models use algorithms and statistical techniques to analyze data, identify patterns, and make predictions or decisions.
Common types of AI models include 'neural networks, decision trees, and support vector machines.'
AI models can be trained on large datasets to improve their accuracy and performance.
They are widely used in applications such as image recognition, natural language processing, and predictive maintenance.
A Shift in Power Dynamics
Distributed model-building could also upset the power dynamics that have shaped the AI industry. Currently, companies build their models by combining vast amounts of training data with huge quantities of compute concentrated inside data centers stuffed with advanced GPUs. They rely heavily on datasets created by scraping publicly accessible material, including websites and books.

This approach means that only the richest companies and nations with access to large quantities of powerful chips can feasibly develop the most powerful and valuable models. Even open-source models are built by companies with access to large data centers.
However, distributed approaches could make it possible for smaller companies and universities to build advanced AI by pooling disparate resources together or allow countries without conventional infrastructure to network together several data centers to build a more powerful model.
The Benefits of Distributed Learning
The new approach allows the work normally done inside a large data center to be performed on hardware that may be many miles away and connected over a relatively slow or variable internet connection. This process is slower than conventional training but is more flexible, allowing new hardware to be added to ramp up training.
Researchers at Google demonstrated a new scheme for dividing and consolidating computations called DIstributed PAth COmposition (DiPaCo) that enables more efficient distributed learning.
To build Collective-1 and other LLMs, Flower AI developed a new tool called Photon that makes distributed training more efficient. Photon improves upon ‘Google’s approach with a more efficient way to represent the data in a model and a more efficient scheme for sharing and consolidating training.’
Unlocking New Data Sources
The distributed approach is likely to unlock new kinds of data, including decentralized and privacy-sensitive data in healthcare and finance, without the risks associated with data centralization. This could be particularly beneficial for smaller companies and universities that may not have access to large data centers but can contribute their data to models like Collective-1.
Data sources refer to the origin of data, including databases, spreadsheets, and external files.
They can be categorized into internal and external sources.
Internal sources include company databases, customer relationship management systems, and enterprise resource planning software.
External sources encompass social media, online reviews, and public datasets.
Accurate identification and utilization of reliable data sources are crucial for informed decision-making in business and research.
Flower AI’s partner, Vana, is developing new ways for users to share personal data with AI builders, allowing them to specify what kind of end uses are permitted or even benefit financially from their contributions.