The reliance on public data — mostly web data — to train AI is holding back the AI field. That’s according to Daniel Beutel, a tech entrepreneur and researcher at the University of Cambridge, who co-founded a startup, Flower, to solve what he sees as a growing problem in AI research.
“Public, centralized data is only a tiny fraction of all the data in the world,” Beutel told TechCrunch in an email interview. “In contrast, distributed data — the data that’s trapped on devices like phones, wearables and internet of things devices or in organizational silos, such as business units within an enterprises — is much larger and more comprehensive, but out of reach for AI today.”
Flower, which Beutel co-started in 2020 with Cambridge colleagues Taner Topal and Nicholas Lane, the ex-head of Samsung’s AI Center in Cambridge, is an attempt to “decentralize” the AI training process through a platform that allows developers to train models on data spread across thousands of devices and locations. Relying on a technique called federated learning, Flower doesn’t provide direct access to data, making it ostensibly “safer” to train on in situations where privacy or compliance are concerns.
“Flower believes that, once made easy and accessible because of the fundamental advantages of distributed data, this approach to AI will not only become mainstream, but also the norm for how AI training is performed,” Beutel said.
Federated learning isn’t a new approach. First proposed in academia years ago, the technique entails training AI algorithms across decentralized devices holding data samples without exchanging those samples. A centralized server might be used to orchestrate the algorithm’s training, or the orchestration might happen on a peer-to-peer basis. But in any case, local algorithms are trained on local data samples, and the weights — the algorithms’ learnable components — are exchanged between them to generate a global model.
“With Flower, the data never needs to leave the source device or location (e.g., a company facility) during training,” Beutel explains. “Instead, ‘compute goes to the data,’ and partial training is performed at each location where the data resides — with only training results and not the data eventually being transmitted and merged with the results of all other locations.”
Flower recently launched FedGPT, a federated approach to training large language models (LLMs) comparable to OpenAI’s ChatGPT and GPT-4. Currently in preview, FedGPT lets companies train LLMs on data spread around the world and on different devices, including data centers and workstations.
“FedGPT is important because it allows organizations to build LLMs using internal, sensitive data without sharing them with an LLM provider,” Beutel said. “Companies also often have data spread around the world, or in different parts of the organization, that are unable to move or leave a geographic region. FedGPT lets all of this data be leveraged when training an LLM while still respecting concerns over privacy and data leakage, and laws restricting data movement.”
Flower is also partnering with Brave, the open source web browser, to spearhead a project called Dandelion. The goal is to build an open source, federated learning system spanning the over 50 million Brave browser clients in use today, Beutel says.
“AI is entering a time of increasing regulation and special care over the provenance of the data it uses,” Beutel said. “Customers can build AI systems using Flower where user privacy is strongly protected, and yet they are still able to leverage more data than they ever could before … Under Flower, due to federated learning principles, an AI system can still successfully be deployed and trained under different constraints.”
Flower’s seen impressive uptake over the past several months, with its community of developers growing to just over 2,300, according to Beutel. He claims that “dozens” of Fortune 500 companies and academic institutions are Flower users, including Porsche, Bosch, Samsung, Banking Circle, Nokia, Stanford, Oxford, MIT and Harvard.
Buoyed by those metrics, Flower — a member of one of Y Combinator’s 2023 cohorts — has attracted investors like First Spark Ventures, Hugging Face CEO Clem Delangue, Factorial Capital, Betaworks, and Pioneer Fund. In its pre-seed round, the startup raised $3.6 million.
Beutel says that the round will be put toward expanding Flower’s core team, growing its team of researchers and developers and accelerating the development of the open source software that powers Flower’s framework and ecosystem.
“AI is facing a crisis of reproducibility, and this is even more acute for federated learning,” Beutel said. “Due to the lack of widespread training on distributed data, we lack a critical mass of open-source software implementations of popular approaches … By everyone working together, we aim to have the world’s largest set of open-source federated techniques available on Flower for the community.”