What’s Next After The Open Source AI Definition?
By Ben Cotton, Head of Community, Kusari
The Open Source Initiative (OSI), steward of the Open Source Definition (OSD), is finalizing an Open Source AI Definition (OSAID). “Open source AI” is a term begging for definition. Model vendors, seeking competitive differentiation, have called their models “open source”, but there’s no shared understanding of what that means. Meta calls its Llama model open source, but its license includes restrictions on certain commercial uses, which clearly violates the OSD.
The OSD has served as the generally-accepted definition for “open source” in the software space. It inspired similar definitions in the hardware (Open Source Hardware Definition) and data (Open Knowledge Foundation’s Open Definition) spaces. It stands to reason that the OSI is an ideal organization to drive the creation of a vendor-neutral definition of “open source AI.”
But OSI’s reputation has created a problem for the OSAID: the definition falls short of what many long-time supporters and open advocates expect. The latest draft (release candidate 1) at a minimum requires a “data description” that indicates how the training dataset was selected, labeled, and so on. Model creators are supposed to share the training data if possible, but it’s easy for them to make an argument that the data is unshareable and is therefore not required. Because the training data affects the model output, critics argue that an AI model can’t achieve the degree of “open” we’ve come to expect in the software world.
A flawed definition is better than no definition. It gives us a common vocabulary so that when a vendor calls their model “open source”, we can say “yes it is” or “no it isn’t” with confidence. After all, collective pressure is how we’ve enforced the definition of “open source” in software. The risk, of course, is that the community may not be willing to enforce a definition of “open source” that it does not agree with.
If the broader community is willing to treat the OSAID as better-than-nothing, then one of two things might happen. The application of the OSAID to real-world models may highlight the flaws in the initial definition. Future revisions — and there certainly will be future revisions, as the OSD has had several over the decades — may address the concerns of critics and require full sharing of training data with no exceptions. On the other hand, the application of the OSAID to real-world models may show the concerns to be unfounded. Models that meet the OSAID may prove to be sufficiently auditable and modifiable to meet the needs of users.
If the broader community does not stand behind the OSAID, then “open source AI” remains functionally undefined. Either another organization drives a definition that gains consensus or “open source AI” comes to mean whatever a model vendor decides it means. The confusion this causes will make it difficult for model users to know what they’re getting. Opacity is dangerous when AI is used to write code, provide customer support, and other key business functions.
In either case, it may be time to expand our vocabulary. “Open source” is essentially a binary attribute: a software license is either open source or it is not. This mostly works, although movements like Ethical Source and source-available licenses like the Business Source License show that there are overlapping-but-different views of what “open” means. AI models are software, yes, but also configuration and data. The openness of the different pieces can vary, so we need a richer way to describe them.