A CBC News investigation has found at least 2,500 copyrighted books written by more than 1,200 Canadian authors were shared online as part of a massive — and now defunct — dataset used for artificial intelligence training and research purposes.
The dataset’s existence and general highlights were revealed earlier this year in The Atlantic. It led to an avalanche of writers expressing shock on social media that their work had been included without their permission and sharing their concerns that AI tools could use information from the dataset to generate content in their distinct artistic voice.
A CBC News analysis of the dataset, called Books3, identified thousands of Canadian authors and books in both official languages.
Although that content represents less than two per cent of the 190,000 plus files in Books3, it reads like a who’s who of the country’s literary community: three quarters of CBC’s Canada Reads contenders and Scotiabank Giller award nominees are featured, along with over a third of all Governor General’s Literary Award finalists.
Authors shocked to find their books used to train AI without permission
Some of Canada’s most famous authors were shocked to find that their books have been used without their permission to train artificial intelligence software. The Writers’ Union of Canada says it is considering a lawsuit, but one law professor says it’s not clear if using the books to train AI is illegal.
Topping the list of Canadian authors with the most books in the dataset is Margaret Atwood, of The Handmaid’s Tale fame, followed by best-selling children and young adult writer Gordon Korman and Nobel Prize winner Alice Munro.
“I’ve been writing kids books for more than three quarters of my life,” said Korman, whose career began when his Grade 7 creative writing assignment was bought by Scholastic and turned into his first book, This Can’t Be Happening at Macdonald Hall.
Korman told CBC News he had read about the dataset and knew some of his books were in it.
“They’re not really stealing your stuff,” he said. “It’s not quite like people are using excerpts or characters from your books or storylines.”
Massive datasets like Books3 are used to train artificial intelligence models to interpret human language — as in read and write like us. Perhaps the most well-known AI language tool is OpenAI’s ChatGPT, which made headlines this year for being able to write university-level essays for students.
But what concerns Korman most is how 28 of his copyrighted books were sucked into Books3 in the first place.
“When I hear about any kind of threat to the way [my writing] works as a business model, just for me to be able to pay my bills and support my family, obviously I have to be very, very concerned.”
Canadian author ‘flattered and concerned’
Stories by Quebec literary giants Michel Tremblay, Marie-Claire Blais and Leonard Cohen also make an appearance in Books3, as do works by Life of Pi author Yann Martel, murder mystery writer Louise Penny and dark horror overlord Patrick Sénécal.
“It’s a combination of being flattered and being concerned,” said writer Drew Hayden Taylor, who has nine books in Books3, including his best-selling novel Motorcycles and Sweetgrass, which was shortlisted for a Governor General’s Literary Award in 2010.
Hayden Taylor, an award-winning playwright, author, columnist, filmmaker and lecturer from Curve Lake First Nation in Ontario even wrote a short story featuring an artificial intelligence entity in 2016.
Like Korman, Hayden Taylor is concerned about copyright violations of his work.
“In the last 35 years that I’ve been a writer, almost all of my income has been derived from royalties. It’s literally taking the milk out of my cereal bowl. It’s very, very, very worrying.”
Hayden Taylor says he wishes the creators of Books3 had asked for permission to include his books.
“I would have considered it,” he said, noting he’d want to know much more about AI training and how it works before committing. “It would have been just more respectful.”
‘Unbelievably disrespectful’
CBC News also found one out of every six members of The Writers’ Union of Canada (WUC) — a national organization of over 2,600 professionally published writers — have at least one book in the dataset.
“It’s huge. It’s incredibly impactful on the cultural economy in a negative way. And, as importantly, it’s unbelievably disrespectful,” said John Degen, WUC’s executive director, after going over CBC’s findings.
Degen says he’s not surprised that a majority of literary award nominees were included, as that recognition typically opens up international markets like the United States — and opportunities such as translation rights and foreign publication rights.
“No one asked for permission. No one explained the project,” he said. “To me, that’s inexcusable and needs to be addressed legally and by Parliament.”
According to Degen, the Books3 dataset is a violation of Canada’s copyright law, which protects the work of artists during their lifetime and for 70 years following their deaths, because it accessed entire works of art without prior approval by the artist.
“Copyright can be very abstract and hard to understand, but I don’t think that taking a pirated book from a pirate site and using it for your own industrial purposes, I don’t think that it’s hard to understand that that’s wrong,” said Degen.
He says WUC is in “deep research phase” and looking at all possible legal remedies, including launching a lawsuit.
The Current24:02Could AI put authors out of business?
Hundreds of writers have learned that their books have been used to train artificial intelligence to spit out imitations. Bestselling authors Sean Michaels and Linwood Barclay discuss what AI might mean for human creativity and artist compensation.
Legality of dataset unclear, copyright expert
Osgoode Hall Law School professor Carys Craig, who specializes in intellectual property law and technology, says it’s debatable if the existence or use of Books3 is illegal under Canada’s copyright law.
“It’s not clear that the inclusion of works in a dataset used to train a generative AI model does constitute copyright infringement,” said Craig. “Even if it’s done without the consent of the rights holder, it’s not clear that it implicates copyright at all.”
In 2022, she and other legal experts co-authored a submission to the Canadian government calling for a broadening of copyright law to allow for artificial intelligence research and analysis, including text and data mining, “without the threat of potential copyright liability.”
She says what is essential to understand is that massive datasets like Books3 are mainly used by AI as data points to understand patterns in language. That’s not the same as authorship, Craig says.
“It’s simply unrealistic to imagine that permissions are going to be sought from every individual author whose work appears there.”
Multiple U.S. lawsuits
The legality of AI-training datasets is being debated in U.S. courts, as Books3 is mentioned in multiple lawsuits launched by the U.S. Authors Guild and individual writers like John Grisham, George R.R. Martin, Jodi Picoult and Sarah Silverman.
Anti-piracy group Rights Alliance sent a takedown notice to the websites hosting the dataset and it was removed last August.
As It Happens5:59Authors launch lawsuit accusing Open AI of pirating their books to train ChatGPT
The Authors Guild, a U.S. trade group for writers, filed the proposed class-action on Tuesday on behalf of 17 plaintiffs. One of them, novelist Douglas Preston, spoke to As It Happens host Nil Köksal.
The lawsuits mention tech companies OpenAI — the company behind ChatGPT — Meta, Microsoft and Bloomberg, alleging they breached U.S. copyright law by training their large language models on books without the permission of authors.
Some plaintiffs believe their books were used to train ChatGPT because the chatbot generated very accurate summaries of their works.
On Nov. 20, a judge in California initially dismissed five of the six allegations in a lawsuit concerning LLaMA, another AI-training dataset owned by Facebook’s Meta. The ruling states that based on the current allegations, the dataset does not constitute “a recasting or adaptation of any of the plaintiffs’ books.”
“Copyright doesn’t protect an author’s style,” said Craig, the law professor. “It doesn’t protect their ideas, the way that they write. It protects their literary text.”
One lawsuit specifically targets EleutherAI, the non-profit artificial intelligence research lab which created and launched Books3 in October 2020.
In a series of social media posts published at the time, Shawn Presser, the independent developer who compiled Books3, described it as a “reliable, direct download” of about 200,000 e-books he found online and reformatted to put “OpenAI-grade training data at your fingertips.”
The day the first story about Books3 was published in The Atlantic, Presser tweeted: “I would gladly go to prison … for advancing science and giving you the power to replicate ChatGPT.”
Why Montreal writers want AI to stop stealing their work
Local writers, such as Heather O’Neill, Trevor Ferguson and Rosemary Sullivan, say they’re interested in participating in legal action against artificial intelligence companies for using their writing to train bots to mimic their writing styles.
Ottawa may review copyright law
Canada’s government is also pondering whether copyright law should be changed with respect to the challenges posed by AI.
This October, it launched its second consultation in less than two years on “the implications of generative artificial intelligence for copyright.”
“I think they’re in catch-up mode and it’s sort of a desperate moment at this point,” said Degen, of WUC.
Craig says the long-term implications of permanently changing Canadian law to adapt to a constantly changing technology should be weighed carefully.
“We have to be very conscious of the way in which copyright law has shaped the Internet that we now have — and think about how and in what way we want it to shape the future of artificial intelligence.”
After CBC News showed author Drew Hayden Taylor how ChatGPT could be prompted to generate a short story in his voice — it gave him pause. He noted the prose contained specific Indigenous words and cultural references and the work sounded “eerily like it could be me.”
He joked that AI should be renamed Artificially Indigenous.
“All of my work comes from my experiences as an individual, as an artist, as a First Nations man, as a human,” he said. “It was a long journey to get to where I am now. And this … in a weird sort of way, invalidates that journey.”
METHODOLOGY: How CBC News identified Canadian and Québécois authors in Books3
To identify authors, CBC News used Python programming with regular expressions (RegEx) to extract the ISBN codes contained in over 180,000 Books3 files (92%). All ISBNs were put through the ISBNdb worldwide database to retrieve their title, author(s), publisher, language and other details. When an ISBN could not be retrieved, CBC News extracted the author(s) and title from the .epub.txt file. A total of 8,820 files could not be identified (4.5%). Upon inspection 1,284 files were completely empty and 205 files were duplicates — they were all excluded from this analysis.
VANCOUVER – Contract negotiations resume today in Vancouver in a labour dispute that has paralyzed container cargo shipping at British Columbia’s ports since Monday.
The BC Maritime Employers Association and International Longshore and Warehouse Union Local 514 are scheduled to meet for the next three days in mediated talks to try to break a deadlock in negotiations.
The union, which represents more than 700 longshore supervisors at ports, including Vancouver, Prince Rupert and Nanaimo, has been without a contract since March last year.
The latest talks come after employers locked out workers in response to what it said was “strike activity” by union members.
The start of the lockout was then followed by several days of no engagement between the two parties, prompting federal Labour Minister Steven MacKinnon to speak with leaders on both sides, asking them to restart talks.
MacKinnon had said that the talks were “progressing at an insufficient pace, indicating a concerning absence of urgency from the parties involved” — a sentiment echoed by several business groups across Canada.
In a joint letter, more than 100 organizations, including the Canadian Chamber of Commerce, Business Council of Canada and associations representing industries from automotive and fertilizer to retail and mining, urged the government to do whatever it takes to end the work stoppage.
“While we acknowledge efforts to continue with mediation, parties have not been able to come to a negotiated agreement,” the letter says. “So, the federal government must take decisive action, using every tool at its disposal to resolve this dispute and limit the damage caused by this disruption.
“We simply cannot afford to once again put Canadian businesses at risk, which in turn puts Canadian livelihoods at risk.”
In the meantime, the union says it has filed a complaint to the Canada Industrial Relations Board against the employers, alleging the association threatened to pull existing conditions out of the last contract in direct contact with its members.
“The BCMEA is trying to undermine the union by attempting to turn members against its democratically elected leadership and bargaining committee — despite the fact that the BCMEA knows full well we received a 96 per cent mandate to take job action if needed,” union president Frank Morena said in a statement.
The employers have responded by calling the complaint “another meritless claim,” adding the final offer to the union that includes a 19.2 per cent wage increase over a four-year term remains on the table.
“The final offer has been on the table for over a week and represents a fair and balanced proposal for employees, and if accepted would end this dispute,” the employers’ statement says. “The offer does not require any concessions from the union.”
The union says the offer does not address the key issue of staffing requirement at the terminals as the port introduces more automation to cargo loading and unloading, which could potentially require fewer workers to operate than older systems.
The Port of Vancouver is the largest in Canada and has seen a number of labour disruptions, including two instances involving the rail and grain storage sectors earlier this year.
A 13-day strike by another group of workers at the port last year resulted in the disruption of a significant amount of shipping and trade.
This report by The Canadian Press was first published Nov. 9, 2024.
The Royal Canadian Legion says a new partnership with e-commerce giant Amazon is helping boost its veterans’ fund, and will hopefully expand its donor base in the digital world.
Since the Oct. 25 launch of its Amazon.ca storefront, the legion says it has received nearly 10,000 orders for poppies.
Online shoppers can order lapel poppies on Amazon in exchange for donations or buy items such as “We Remember” lawn signs, Remembrance Day pins and other accessories, with all proceeds going to the legion’s Poppy Trust Fund for Canadian veterans and their families.
Nujma Bond, the legion’s national spokesperson, said the organization sees this move as keeping up with modern purchasing habits.
“As the world around us evolves we have been looking at different ways to distribute poppies and to make it easier for people to access them,” she said in an interview.
“This is definitely a way to reach a wider number of Canadians of all ages. And certainly younger Canadians are much more active on the web, on social media in general, so we’re also engaging in that way.”
Al Plume, a member of a legion branch in Trenton, Ont., said the online store can also help with outreach to veterans who are far from home.
“For veterans that are overseas and are away, (or) can’t get to a store they can order them online, it’s Amazon.” Plume said.
Plume spent 35 years in the military with the Royal Engineers, and retired eight years ago. He said making sure veterans are looked after is his passion.
“I’ve seen the struggles that our veterans have had with Veterans Affairs … and that’s why I got involved, with making sure that the people get to them and help the veterans with their paperwork.”
But the message about the Amazon storefront didn’t appear to reach all of the legion’s locations, with volunteers at Branch 179 on Vancouver’s Commercial Drive saying they hadn’t heard about the online push.
Holly Paddon, the branch’s poppy campaign co-ordinator and bartender, said the Amazon partnership never came up in meetings with other legion volunteers and officials.
“I work at the legion, I work with the Vancouver poppy office and I go to the meetings for the Vancouver poppy campaign — which includes all the legions in Vancouver — and not once has this been mentioned,” she said.
Paddon said the initiative is a great idea, but she would like to have known more about it.
The legion also sells a larger collection of items at poppystore.ca.
This report by The Canadian Press was first published Nov. 9, 2024.