A CBC News investigation has found at least 2,500 copyrighted books written by more than 1,200 Canadian authors were shared online as part of a massive — and now defunct — dataset used for artificial intelligence training and research purposes.
The dataset’s existence and general highlights were revealed earlier this year in The Atlantic. It led to an avalanche of writers expressing shock on social media that their work had been included without their permission and sharing their concerns that AI tools could use information from the dataset to generate content in their distinct artistic voice.
A CBC News analysis of the dataset, called Books3, identified thousands of Canadian authors and books in both official languages.
Although that content represents less than two per cent of the 190,000 plus files in Books3, it reads like a who’s who of the country’s literary community: three quarters of CBC’s Canada Reads contenders and Scotiabank Giller award nominees are featured, along with over a third of all Governor General’s Literary Award finalists.
Authors shocked to find their books used to train AI without permission
Some of Canada’s most famous authors were shocked to find that their books have been used without their permission to train artificial intelligence software. The Writers’ Union of Canada says it is considering a lawsuit, but one law professor says it’s not clear if using the books to train AI is illegal.
Topping the list of Canadian authors with the most books in the dataset is Margaret Atwood, of The Handmaid’s Tale fame, followed by best-selling children and young adult writer Gordon Korman and Nobel Prize winner Alice Munro.
“I’ve been writing kids books for more than three quarters of my life,” said Korman, whose career began when his Grade 7 creative writing assignment was bought by Scholastic and turned into his first book, This Can’t Be Happening at Macdonald Hall.
Korman told CBC News he had read about the dataset and knew some of his books were in it.
“They’re not really stealing your stuff,” he said. “It’s not quite like people are using excerpts or characters from your books or storylines.”
Massive datasets like Books3 are used to train artificial intelligence models to interpret human language — as in read and write like us. Perhaps the most well-known AI language tool is OpenAI’s ChatGPT, which made headlines this year for being able to write university-level essays for students.
But what concerns Korman most is how 28 of his copyrighted books were sucked into Books3 in the first place.
“When I hear about any kind of threat to the way [my writing] works as a business model, just for me to be able to pay my bills and support my family, obviously I have to be very, very concerned.”
Canadian author ‘flattered and concerned’
Stories by Quebec literary giants Michel Tremblay, Marie-Claire Blais and Leonard Cohen also make an appearance in Books3, as do works by Life of Pi author Yann Martel, murder mystery writer Louise Penny and dark horror overlord Patrick Sénécal.
“It’s a combination of being flattered and being concerned,” said writer Drew Hayden Taylor, who has nine books in Books3, including his best-selling novel Motorcycles and Sweetgrass, which was shortlisted for a Governor General’s Literary Award in 2010.
Hayden Taylor, an award-winning playwright, author, columnist, filmmaker and lecturer from Curve Lake First Nation in Ontario even wrote a short story featuring an artificial intelligence entity in 2016.
Like Korman, Hayden Taylor is concerned about copyright violations of his work.
“In the last 35 years that I’ve been a writer, almost all of my income has been derived from royalties. It’s literally taking the milk out of my cereal bowl. It’s very, very, very worrying.”
Hayden Taylor says he wishes the creators of Books3 had asked for permission to include his books.
“I would have considered it,” he said, noting he’d want to know much more about AI training and how it works before committing. “It would have been just more respectful.”
‘Unbelievably disrespectful’
CBC News also found one out of every six members of The Writers’ Union of Canada (WUC) — a national organization of over 2,600 professionally published writers — have at least one book in the dataset.
“It’s huge. It’s incredibly impactful on the cultural economy in a negative way. And, as importantly, it’s unbelievably disrespectful,” said John Degen, WUC’s executive director, after going over CBC’s findings.
Degen says he’s not surprised that a majority of literary award nominees were included, as that recognition typically opens up international markets like the United States — and opportunities such as translation rights and foreign publication rights.
“No one asked for permission. No one explained the project,” he said. “To me, that’s inexcusable and needs to be addressed legally and by Parliament.”
According to Degen, the Books3 dataset is a violation of Canada’s copyright law, which protects the work of artists during their lifetime and for 70 years following their deaths, because it accessed entire works of art without prior approval by the artist.
“Copyright can be very abstract and hard to understand, but I don’t think that taking a pirated book from a pirate site and using it for your own industrial purposes, I don’t think that it’s hard to understand that that’s wrong,” said Degen.
He says WUC is in “deep research phase” and looking at all possible legal remedies, including launching a lawsuit.
LISTEN | Best-selling authors on what AI means for human creativity:
The Current24:02Could AI put authors out of business?
Hundreds of writers have learned that their books have been used to train artificial intelligence to spit out imitations. Bestselling authors Sean Michaels and Linwood Barclay discuss what AI might mean for human creativity and artist compensation.
Legality of dataset unclear, copyright expert
Osgoode Hall Law School professor Carys Craig, who specializes in intellectual property law and technology, says it’s debatable if the existence or use of Books3 is illegal under Canada’s copyright law.
“It’s not clear that the inclusion of works in a dataset used to train a generative AI model does constitute copyright infringement,” said Craig. “Even if it’s done without the consent of the rights holder, it’s not clear that it implicates copyright at all.”
In 2022, she and other legal experts co-authored a submission to the Canadian government calling for a broadening of copyright law to allow for artificial intelligence research and analysis, including text and data mining, “without the threat of potential copyright liability.”
She says what is essential to understand is that massive datasets like Books3 are mainly used by AI as data points to understand patterns in language. That’s not the same as authorship, Craig says.
“It’s simply unrealistic to imagine that permissions are going to be sought from every individual author whose work appears there.”
Multiple U.S. lawsuits
The legality of AI-training datasets is being debated in U.S. courts, as Books3 is mentioned in multiple lawsuits launched by the U.S. Authors Guild and individual writers like John Grisham, George R.R. Martin, Jodi Picoult and Sarah Silverman.
Anti-piracy group Rights Alliance sent a takedown notice to the websites hosting the dataset and it was removed last August.
LISTEN | These authors say OpenAI stole their books to train ChatGPT:
As It Happens5:59Authors launch lawsuit accusing Open AI of pirating their books to train ChatGPT
The Authors Guild, a U.S. trade group for writers, filed the proposed class-action on Tuesday on behalf of 17 plaintiffs. One of them, novelist Douglas Preston, spoke to As It Happens host Nil Köksal.
The lawsuits mention tech companies OpenAI — the company behind ChatGPT — Meta, Microsoft and Bloomberg, alleging they breached U.S. copyright law by training their large language models on books without the permission of authors.
Some plaintiffs believe their books were used to train ChatGPT because the chatbot generated very accurate summaries of their works.
On Nov. 20, a judge in California initially dismissed five of the six allegations in a lawsuit concerning LLaMA, another AI-training dataset owned by Facebook’s Meta. The ruling states that based on the current allegations, the dataset does not constitute “a recasting or adaptation of any of the plaintiffs’ books.”
“Copyright doesn’t protect an author’s style,” said Craig, the law professor. “It doesn’t protect their ideas, the way that they write. It protects their literary text.”
One lawsuit specifically targets EleutherAI, the non-profit artificial intelligence research lab which created and launched Books3 in October 2020.
In a series of social media posts published at the time, Shawn Presser, the independent developer who compiled Books3, described it as a “reliable, direct download” of about 200,000 e-books he found online and reformatted to put “OpenAI-grade training data at your fingertips.”
The day the first story about Books3 was published in The Atlantic, Presser tweeted: “I would gladly go to prison … for advancing science and giving you the power to replicate ChatGPT.”
Why Montreal writers want AI to stop stealing their work
Local writers, such as Heather O’Neill, Trevor Ferguson and Rosemary Sullivan, say they’re interested in participating in legal action against artificial intelligence companies for using their writing to train bots to mimic their writing styles.
Ottawa may review copyright law
Canada’s government is also pondering whether copyright law should be changed with respect to the challenges posed by AI.
This October, it launched its second consultation in less than two years on “the implications of generative artificial intelligence for copyright.”
“I think they’re in catch-up mode and it’s sort of a desperate moment at this point,” said Degen, of WUC.
Craig says the long-term implications of permanently changing Canadian law to adapt to a constantly changing technology should be weighed carefully.
“We have to be very conscious of the way in which copyright law has shaped the Internet that we now have — and think about how and in what way we want it to shape the future of artificial intelligence.”
After CBC News showed author Drew Hayden Taylor how ChatGPT could be prompted to generate a short story in his voice — it gave him pause. He noted the prose contained specific Indigenous words and cultural references and the work sounded “eerily like it could be me.”
He joked that AI should be renamed Artificially Indigenous.
“All of my work comes from my experiences as an individual, as an artist, as a First Nations man, as a human,” he said. “It was a long journey to get to where I am now. And this … in a weird sort of way, invalidates that journey.”
METHODOLOGY: How CBC News identified Canadian and Québécois authors in Books3
To identify authors, CBC News used Python programming with regular expressions (RegEx) to extract the ISBN codes contained in over 180,000 Books3 files (92%). All ISBNs were put through the ISBNdb worldwide database to retrieve their title, author(s), publisher, language and other details. When an ISBN could not be retrieved, CBC News extracted the author(s) and title from the .epub.txt file. A total of 8,820 files could not be identified (4.5%). Upon inspection 1,284 files were completely empty and 205 files were duplicates — they were all excluded from this analysis.
Data collection: Valérie Ouellet and Shaki Sutharsan (Oct.-Nov. 2023) Data analysis and verification: Valérie Ouellet and Sylvène Gilchrist (Oct.-Nov. 2023)
MONTREAL – Matt Coronato scored the game-tying goal and the overtime winner in a dramatic finish, but video coach Jamie Pringle was the hero on Tuesday night.
Before Coronato powered a Calgary Flames comeback, Brendan Gallagher appeared to give the Montreal Canadiens 3-1 lead with 8:24 remaining in the third.
Pringle, however, instantly flagged the goal for offside. Then the Flames challenged successfully, and Coronato did the rest as Calgary flipped the script and won 3-2 in overtime.
“I was just saying that a post is normally a goalie’s best friend, but I think the video coach is now number two,” Flames netminder Dustin Wolf said.
Canadiens forward Josh Anderson had set up Gallagher on an odd-man rush, though it was unclear in real-time if Anderson had full control of the puck when he entered the Flames’ zone backward.
The Bell Centre’s roof nearly blew off with Canadiens fans celebrating like it was a sure thing, but Pringle thought otherwise.
“We’ve always been confident in Jamie,” Flames head coach Ryan Huska said. “He’s the best guy in the league. So another situation where he flashed it up, challenge right away.
“We don’t get this win if it’s not for the courage that he showed. You have a great guy in that chair for a reason. And Jamie did a great job for us, keeping us in this game tonight.”
Pringle, a 49-year-old from Picton, Ont., who’s also known as “Chips,” is in his 14th season with the Flames.
And it wasn’t the first time he played a crucial role in a victory this season.
In Calgary’s 4-1 win over the Edmonton Oilers on Oct. 13, the video coach successfully challenged two goals, including one Corey Perry deflection that the hockey world was convinced should have counted.
Pringle made the snap decision anyway, even though a failed challenge would put his team on the penalty kill.
“He’s hot this year,” forward Blake Coleman said. “You know what? He needed to redeem himself after a tough last year. We had some good chats down the stretch, and he’s been on fire.
“I’d say of all the guys on our team, he’s probably the one who hasn’t missed a night so far.”
Coronato showed up at the right time on Tuesday.
The 21-year-old winger tied the game with 2:46 remaining in regulation when he cruised into the slot and went off the post and in. He then buried the winning goal seven seconds into the extra period, coming one second shy of tying the fastest OT goal in NHL history.
“He’s remarkable. He’s had so many chances to score, and he’s kind of been snaked bit a few times,” Wolf said. “To see him score on two unbelievable shots, that’s a scouting report on him, his shot’s lethal.”
“The kid can shoot it,” Coleman added. “Couple big ones.”
Coronato, a 13th overall pick in the 2021 NHL draft, spent most of last season in the American Hockey League with the Calgary Wranglers.
This season, he’s played two games in the AHL and eight in the NHL. And with performances like Tuesday’s, he can expect plenty more in the big leagues.
“Sometimes with younger players, you put them in the American League for a bit and it’s hard on them,” Huska said. “There’s a long-term plan for sure. We know how good he’s going to be for us. We just want to make sure that we are putting him in situations that he’s going to be ready for and be able to have success.
“He’s done an excellent job of preparing himself to play, and we saw the result of his effort tonight.”
The Flames (7-5-1) picked up their second win in seven games to kick off a three-game road trip. Meanwhile, the Canadiens (4-7-2) dropped their fourth in a row ahead of four games away from home.
“We didn’t throw up on ourselves tonight, but we still feel a bit sick to our stomachs,” head coach Martin St. Louis said, referencing a post-game assessment he delivered after a 6-3 loss last week in Washington.
The Canadiens didn’t paint a picture of doom and gloom in the dressing room despite coming a couple minutes shy of securing two points and snapping their skid, but St. Louis said his players should leave this game “hungry” to get in the win column.
“If I was in their shoes, I’d wish we played tomorrow,” he said. “That’s what I would want to feel like. That’s what I want to be like.”
This report by The Canadian Press was first published Nov. 5, 2024.
ST. LOUIS (AP) — St. Louis Blues forward Dylan Holloway left Tuesday night’s contest against the Tampa Bay Lightning and departed the rink on a stretcher after being struck by a puck late in the first period.
Holloway was hit in the neck area by a puck with 2:37 remaining in the period, and proceeded to finish his shift, continuing to participate in the play before skating to the bench under his own power.
As play was stopped with 1:11 remaining for a high-sticking penalty that was later called off, teammates started calling and gesturing for assistance.
Blues trainer Ray Barile and medical staff from both teams tended to Holloway for several minutes before emergency medical technicians carted him off the bench on a stretcher.
“I was just sitting beside him and saw something was happening,” Blues forward Alexey Toropchenko said. “I told Ray. He knows what he’s doing. I was just kind of curious to what’s going on. Doctors came in and, like, I think everything is good right now. But we were worried, everybody.”
Holloway was seen raising his arm as he was carted off. The Blues later announced that Holloway was alert and stable and was rushed to a St. Louis area hospital for further observation.
“I think the only way I can put is if you’re at work, and you get a call, and one of your family members is sick, and you rush to the hospital,” Blues coach Drew Bannister said.
“Holly’s a family member. That was tough. I thought we, as a group, showed a lot of fortitude, and the way mentally being able to push through that, because the easiest thing to do is your head goes somewhere else. But, we were able to get updates on Holly and kind of put our minds at ease a little bit and refocus ourselves.”
Referees Wes McCauley and Cody Beach sent the teams to their locker rooms and started the first intermission after Holloway was transported off the bench due to the nature of the injury.
“It’s hard,” Blues captain Brayden Schenn said. “It’s your teammate. Then we got news that he’s going to be fine. And then, you have to wrap your head around it a little bit and go play a hockey game again, right?
“So that’s just, unfortunately, the reality of the sport, and it took us awhile to get going.”
St. Louis rallied to score three goals after falling behind 1-0 early in the second period to beat Tampa Bay 3-2.
WINNIPEG – Nino Niederreiter showed his veteran savvy in his 900th NHL career game on Tuesday.
The Winnipeg Jets forward scored twice and Connor Hellebuyck made 21 saves in a 3-0 victory over the Utah Hockey Club that kept the team’s early-season success rolling with a fourth consecutive win (12-1-0).
On his first goal, the 32-year Niederreiter lifted a Utah opponent’s stick in Winnipeg’s end, allowing the Jets to get the puck and head toward the visitor’s net.
Niederreiter then joined the rush, deked and put the puck around netminder Karel Vejmelka for a 2-0 lead at 7:30 of the third period with his sixth goal of the season.
“Obviously, the game wasn’t very pretty,” Niederreiter said. “There wasn’t a whole lot of flow out there. I think that is something that we knew and just had to stick with and do the little things right.
“Eventually, we would create our own luck and that’s what happened there.”
And what about his deke in front of 12,932 fans at Canada Life Centre?
“I still got it somewhere in there,” Niederreiter said with a smile. “It’s a great feeling, like I said. It’s a cool night to score a goal like that.”
His second goal — the 230th of his career — was into an empty net with 2:59 remaining. He also has 225 assists for 455 career points.
Gabriel Vilardi scored the first goal at 17:57 of the second period on the power play and Adam Lowry picked up two assists.
Hellebuyck recorded his second shutout of the season and 39th of this career.
Niederreiter signed a three-year contract extension with the Jets last December. The $12-million deal kicked in this season.
He’s now scored against 33 NHL teams, including the Jets.
“It’s a cool stat, but I think it also says that I’ve been traded a few times,” he said. “But I guess it gives me the chance to do that.”
Niederreiter was drafted in 2010 by the New York Islanders (fifth overall), becoming Switzerland’s highest NHL pick.
He’s also played for the Minnesota Wild, Carolina Hurricanes and Nashville Predators before being traded to the Jets in February 2023.
Jets head coach Scott Arniel was impressed by Niederreiter’s quick-thinking stick lift.
“We’ll throw that on the old system video,” he said. “But that’s just going the distance, coming all the way back and he creates that.
“We’re never out of it. You never know how a puck’s going to bounce. He just kept coming and obviously we turned that offence the other way.”
Arniel said the team recognized Niederreiter’s milestone.
“That’s special. That’s a lot of games,” Arniel said. “We had a little tribute to him, saw all his pictures from all the jerseys he’s worn and the places he’s played.
“He hasn’t changed a bit. He’s a big power forward and that line I thought was really good. They take that (Clayton) Keller line on, those skill guys. They did a really good job.”
Niederreiter is on a line with Lowry and Mason Appleton.
“Those guys on the PK were really strong,” Arniel added. “When that line plays like that they’re a force, they’re hard to handle. They wear teams down because they spend so much time in the offensive zone.”
Utah (5-5-3) ended a run of picking up points in three consecutive games (1-0-2).
Vejmelka stopped 25 shots for Utah in its second game of a four-game road trip.
“They know what to expect of each other. They play a really, really structured game, and they were patient tonight,” Utah head coach Andre Tourigny said of the Jets.
“I think that was a good chess game. They got one on the power play and from there they waited for the opportunity to have a killer goal. They did a good job.”
NOTES: Jets defenceman Josh Morrissey picked up his 14th assist of the season when his point shot with five seconds left in a power play was tipped in by Vilardi. … Kyle Connor had his franchise-record, season-opening points streak end at 12 games. He almost picked up an assist until Vilardi tipped in Morrissey’s shot.
This report by The Canadian Press was first published Nov. 5, 2024.