The rise of open source AI models: transparency and accountability challenged

June 24, 2024

177

As the era of generative AI progresses, increasingly more firms are participating in its development and the models themselves have develop into increasingly diverse.

Amid this AI boom, many firms have touted their models as “open source,” but what does that basically mean in practice?

The concept of open source has its roots within the software development community. With traditional open source software, the source code is freely available for anyone to view, modify, and distribute.

Essentially, open source is a collaborative tool for knowledge sharing, driven by software innovations which have led to developments corresponding to the Linux operating system, the Firefox web browser, and the Python programming language.

However, applying the open source ethos to today's massive AI models is anything but easy.

These systems are sometimes trained on huge datasets containing terabytes or petabytes of knowledge, using complex neural network architectures with billions of parameters.

The computing resources required cost tens of millions of dollars, talented employees are scarce, and mental property is usually closely guarded.

We can see this at OpenAI, which, because the name suggests, was once an AI research lab largely committed to the open source ethos.

However, this ethos quickly eroded when the corporate smelled money and needed to draw investment to attain its goals.

Why? Because open source products usually are not focused on profit and AI is dear and helpful.

Given the explosive growth of generative AI, firms like Mistral, Meta, BLOOM and xAI are releasing open-source models for further exploration while stopping firms like Microsoft and Google from accumulating an excessive amount of influence.

But how lots of these models are really open source and not only in name?

Clarifying how open open source models really are

In a recent studyResearchers Mark Dingemanse and Andreas Liesenfeld from Radboud University within the Netherlands analyzed quite a few well-known AI models to learn how open they’re. They examined several criteria, corresponding to the provision of source code, training data, model weights, research papers and APIs.

For example, it seems that Meta's LLaMA and Google's Gemma models are merely “open weight,” meaning that the trained model is released for public use without full transparency regarding its code, training process, data, and fine-tuning methods.

At the opposite end of the spectrum, the researchers highlighted BLOOM, a big multilingual model developed by a collaboration of over 1,000 researchers worldwide, for instance of true open-source AI. Every element of the model is freely available for inspection and further research.

The paper evaluated over 30 models (each text and image), but these illustrate the big differences inside the models that claim to be open source:

BloomZ (BigScience): Completely open on all criteria including code, training data, model weights, research papers, and API. Highlighted for instance of truly open source AI.
OLMo (Allen Institute for AI): Open code, training data, weights and research papers. API only partially open.
Mistral 7B instruction (Mistral AI): Open model weights and API. Code and research papers only partially open. Training data not available.
Orca 2 (Microsoft): Model weights and research papers partially open. Code, training data and API closed.
Gemma 7B Guide (Google): Partially open code and weights. Training data, research papers and API closed. Described by Google as “open” somewhat than “open source”.
Lama 3 Statement (Meta): Partially open weights. Code, training data, research and API closed. An example of an “open weight” model without fuller transparency.

A comprehensive breakdown of the extent to which various AI models are open source. Source: ACM Digital Library (open access)

A scarcity of transparency

The lack of transparency surrounding AI models, especially those developed by large technology firms, raises serious concerns about accountability and oversight.

Without full access to the model's code, training data, and other key components, it becomes extremely obscure how these models work and make decisions. This makes it difficult to discover and fix potential biases, errors, or misuse of copyrighted material.

Copyright infringement of AI training data is a primary example of the issues that arise from this lack of transparency. Many proprietary AI models corresponding to GPT-3.5/4/40/Claude 3/Gemini are likely trained using copyrighted material.

However, since the training data is kept confidential, it is nearly inconceivable to discover specific data on this material.

The New York Times’ recent lawsuit against OpenAI shows the true consequences of this challenge. OpenAI accused the NYT of using prompt engineering attacks to reveal training data and trick ChatGPT into reproducing its articles verbatim to prove that OpenAI's training data contained copyrighted material.

“The Times paid someone to hack OpenAI’s products,” OpenAI said.

In response, Ian Crosby, the New York Times' chief legal counsel, said: “What OpenAI strangely mischaracterizes as 'hacking' is just using OpenAI's products to look for evidence that they stole and reproduced copyrighted works from The Times. And that's exactly what we found.”

In fact, this is only one example of an enormous stack of lawsuits which are currently blocked partially resulting from the opaque and impenetrable nature of AI models.

This is just the tip of the iceberg. Without robust transparency and accountability measures, we risk a future where unexplained AI systems make decisions which have profound impacts on our lives, our economy and our society, yet remain shielded from critical scrutiny.

Calls for openness

There were calls for firms like Google and OpenAI, Grant access to how their models work for the aim of safety assessment.

The truth, nonetheless, is that even AI firms don't really understand how their models work.

This so-called “black box” problem arises when one tries to interpret and explain the particular decisions of the model in a way that’s comprehensible to humans.

For example, a developer may know that a deep learning model is accurate and performs well, but can have difficulty determining exactly which features the model relies on to make its decisions.

Anthropic, developer of the Claude models, recently conducted an experiment to learn how Claude 3 Sonnet works, explaining: “We are inclined to treat AI models like a black box: something goes in and a solution comes out, and it's not clear why the model gave this particular answer and never one other. This makes it hard to trust that these models are protected: if we don't understand how they work, how will we know that they won't give harmful, biased, unfaithful, or otherwise dangerous answers? How can we trust that they’re protected and reliable?”

This experiment illustrated that AI developers don’t fully understand the black box of their AI models and that objectively explaining the outcomes is a particularly tricky task.

In fact, Anthropic estimated that opening the black box would require more computing power than training the model itself!

Developers are actively attempting to combat the black box problem through research corresponding to Explainable AI (XAI). The goal of this research is to develop techniques and tools to make AI models more transparent and interpretable.

XAI methods aim to offer insight into the model's decision-making process, highlight essentially the most influential features, and generate human-readable explanations. XAI has already been applied to models utilized in high-risk applications corresponding to drug development, where understanding how a model works could be critical to safety.

Open source initiatives are critical for XAI and other research that seeks to penetrate the black box and convey transparency to AI models.

Without access to the model's code, training data, and other key components, researchers cannot develop and test techniques to clarify how AI systems actually work and discover the particular data they were trained on.

Regulations could further confuse the open source situation

The European Union recently passed AI law is about to introduce recent rules for AI systems, with provisions specifically addressing open source models.

Under the law, general-purpose open source models as much as a certain size are exempt from comprehensive transparency requirements.

However, as Dingemanse and Liesenfeld indicate of their study, the precise definition of “open source AI” under the AI Act continues to be unclear and will develop into some extent of contention.

The law currently defines open-source models as those released under a “free and open” license that permits users to switch the model, but there are not any requirements for access to training data or other key components.

This ambiguity leaves room for interpretation and potential lobbying by corporate interests. The researchers warn that refining the open source definition within the AI Act “is prone to create a single pressure point targeted by corporate lobbies and enormous firms.”

There is a risk that without clear, robust criteria for what truly constitutes open source AI, the regulations may inadvertently create loopholes or incentives for firms to have interaction in “open washing” – that’s, co-opting openness for legal and PR reasons while continuing to treat necessary elements of their models as proprietary.

In addition, the worldwide nature of AI development implies that different regulations in several jurisdictions could further complicate the situation.

If major AI manufacturers corresponding to the US and China take different approaches to openness and transparency requirements, this may lead to a fragmented ecosystem during which the extent of openness varies widely depending on where a model originates.

The study's authors stress that regulators must work closely with the scientific community and other stakeholders to be certain that any open source provisions in AI laws are based on a deep understanding of the technology and the principles of openness.

Like Dingemanse and Liesenfeld in a Discussion with nature“It is fair to say that the term open source will acquire unprecedented legal significance within the countries where the EU AI law applies.”

The practical implementation could have far-reaching implications for the long run direction of AI research and application.

The rise of open source AI models: transparency and accountability challenged

Clarifying how open open source models really are

A scarcity of transparency

Calls for openness

Regulations could further confuse the open source situation

LEAVE A REPLY Cancel reply

Must Read

Google releases technology to watermark AI-generated text

Nuclear energy stocks hit record highs on rising demand for AI

The governor of California has blocked groundbreaking AI security laws. This is why it’s such a very important decision for the longer term of...

Contactless stores set to grow in Europe as Sensei rakes in one other $16 million

AI search start-up Perplexity is targeting an $8 billion valuation in a brand new round of funding

Socket receives recent $40 million to scan software for security vulnerabilities

Cohere adds a vision to its RAG search capabilities

Latest articles

Google releases technology to watermark AI-generated text

Nuclear energy stocks hit record highs on rising demand for AI

The governor of California has blocked groundbreaking AI security laws. This is why it’s such a very important decision for the longer term of...

Our Newsletter

The rise of open source AI models: transparency and accountability challenged

Clarifying how open open source models really are

A scarcity of transparency

Calls for openness

Regulations could further confuse the open source situation

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter