Over 16,000 artists’ names have been linked with the non-consensual training of Midjourney’s image generation models.
The Midjourney artist database is attached to an amended lawsuit submitted against Stability AI, DeviantArt, and Midjourney, filed under Exhibit J, and in a recently leaked public Google spreadsheet, a part of which will be viewed within the Internet Archive here.
Artist Jon Lam shared screenshots on X from a Midjourney Discord chat where developers discuss using artist names and styles from Wikipedia and other sources.
The spreadsheet is believed to have originally been sourced from Midjourney’s development team and squares up with the leaked Discord chats from Midjourney developers, which allude to the artist’s work being mapped to ‘styles.’
By encoding artist work as ‘styles,’ Midjourney can efficiently recreate work of their style.
Lam writes, “Midjourney developers caught discussing laundering, and making a database of Artists (who’ve been dehumanized to styles.”
Lam also shared videos of lists of artists, including those used for Midjourney styles and one other list of ‘proposed artists.’ Numerous X users stated their names were on these lists.
Midjourney developers caught discussing laundering, and making a database of Artists (who’ve been dehumanized to styles) to coach Midjourney off of. This has been submitted into evidence for the lawsuit. Prompt engineers, your “skills” are usually not yourshttps://t.co/wAhsNjt5Kz pic.twitter.com/EBvySMQC0P
One screenshot appears to indicate an announcement by Midjourney CEO David Holz celebrating the addition of 16,000 artists to the training program.
Another shows a Midjourney developer discussing that you may have to “launder it” through a “Codex,” though, without context, it’s tough to say whether that is referring to artists’ work.
Others (not Midjourney employees) in that very same conversation seek advice from how processing artwork through an AI model essentially disembodies it from copyright.
One says, “all you may have to do is just use those scraped datasets and the conveniently forget what you used to coach the model. Boom legal problems solved perpetually.”
How legal cases are developing
In legal cases submitted against Midjourney, Stability AI, and in addition OpenAI, Meta, and Google (but for text-based work, slightly than images), artists, writers, and others have found it tough to prove their work is absolutely ‘inside’ the model verbatim.
That can be the smoking gun they should prove copyright violations.
Copyright, typically, stays poorly defined within the era of AI. AI models are trained on data that has to return from somewhere, and what higher source to search out that data than the web?
The developers ‘scrape’ what’s termed as ‘open,’ ‘open-source,’ or ‘public’ data from the web, but again, these concepts are poorly defined. It could be fair to say that when AI developers smelled the approaching gold rush, they seized as much ‘open’ data from the web as they might and used it to coach their models.
Legal processes are slow; AI is lightspeed compared. It was very easy for developers to outflank copyright law and train models long before copyright holders and the law that governs mental property could react.
The response process is now underway, but each the AI training process and the technical process involved in generating AI outputs (e.g., text or images) from user inputs challenge the character of mental property law.
Specifically, it’s a) hard to prove that AI models are definitely trained on copyright material and b) hard to prove their outputs replicate copyright material sufficiently.
There’s also the difficulty of accountability. AI firms like OpenAI and Midjourney a minimum of partly used data harvested by others slightly than harvesting it themselves. So, would it not not be the unique data scrapers answerable for infringement?
In the context of this recent situation at Midjourney, Midjourney’s models, like others, will at all times reproduce a combination of works contained inside its data. Artists can’t easily prove what pieces they’ve used.
For example, when a recent copyright case against Midjourney, Stability AI, and DeviantArt was dismissed (it’s since been resubmitted with recent plaintiffs), Federal Judge Orrick identified several defects in the way in which the claims were framed, particularly of their understanding of how AI image generators function.
The original lawsuit alleged that Stability AI, in training its Stable Diffusion model, stored compressed copies of the photographs.
Stability AI refuted this, clarifying that the training process involves extracting attributes comparable to lines, shades, and colours and developing parameters based on these attributes slightly than storing copies of the photographs.
Orrick’s ruling highlighted the necessity for the plaintiffs to amend their claims to more accurately represent the operation of those AI models.
This features a need for a clearer explanation of whether the claim against Midjourney was on account of its use of Stable Diffusion, its independent use of coaching images, or each (as Midjourney can also be being accused of using Stability AI’s models, which allegedly use copyrighted works).
Another challenge for the plaintiffs is demonstrating that Midjourney’s outputs are substantially just like their original artworks. Orrick noted that the plaintiffs themselves admitted that the output images from Stable Diffusion are unlikely to closely match any specific image within the training data.
As of now, the case is alive, with the court denying AI firms’ most up-to-date attempts to dismiss the artists’ claims.
Gen Ai techbros would have you suspect the lawsuit is dead or thrown out, no, the lawsuit continues to be alive and well, and more evidence and plaintiffs have been added to the casefile.
Updated Casefile here.https://t.co/uTqs6grWRE
.
LAION dataset usage thrown into the combo
Legal cases submitted against Midjourney and co. also emphasized their potential use of the LAION-5B dataset – a compilation of 5.85 billion internet-sourced images, including copyrighted content.
Stanford recently blasted LAION for holding illicit sexual images, including child sex abuse and various sexist, racist, and otherwise deplorable content – all of which now also ‘lives’ contained in the AI models that society is beginning to depend upon for creative and skilled uses.
The long-term implications of which can be hotly debated, but the actual fact these AIs are possibly firstly trained on stolen work and secondly on illegal content doesn’t shed positive light on AI development typically.
Midjourney developer comments have been widely lambasted on social media and the Y Combinator forum.
It’s very likely that 2024 will cook up more fiery legal debates, and the Wild West chapter of AI development could be coming to an in depth.