We combed through a huge trove of data used to train AI and found thousands of books by Canadian authors | Canada News Media
Connect with us

News

We combed through a huge trove of data used to train AI and found thousands of books by Canadian authors

Published

 on

A CBC News investigation has found at least 2,500 copyrighted books written by more than 1,200 Canadian authors were shared online as part of a massive — and now defunct — dataset used for artificial intelligence training and research purposes.

The dataset’s existence and general highlights were revealed earlier this year in The Atlantic. It led to an avalanche of writers expressing shock on social media that their work had been included without their permission and sharing their concerns that AI tools could use information from the dataset to generate content in their distinct artistic voice.

A CBC News analysis of the dataset, called Books3, identified thousands of Canadian authors and books in both official languages.

Although that content represents less than two per cent of the 190,000 plus files in Books3, it reads like a who’s who of the country’s literary community: three quarters of CBC’s Canada Reads contenders and Scotiabank Giller award nominees are featured, along with over a third of all Governor General’s Literary Award finalists.

 

Authors shocked to find their books used to train AI without permission

 

Some of Canada’s most famous authors were shocked to find that their books have been used without their permission to train artificial intelligence software. The Writers’ Union of Canada says it is considering a lawsuit, but one law professor says it’s not clear if using the books to train AI is illegal.

Topping the list of Canadian authors with the most books in the dataset is Margaret Atwood, of The Handmaid’s Tale fame, followed by best-selling children and young adult writer Gordon Korman and Nobel Prize winner Alice Munro.

“I’ve been writing kids books for more than three quarters of my life,” said Korman, whose career began when his Grade 7 creative writing assignment was bought by Scholastic and turned into his first book, This Can’t Be Happening at Macdonald Hall.

Korman told CBC News he had read about the dataset and knew some of his books were in it.

“They’re not really stealing your stuff,” he said. “It’s not quite like people are using excerpts or characters from your books or storylines.”

Canadian author Gordon Korman says he wants to know how 28 of his young adult books found their way onto the Books3 dataset, a huge trove of data used by Artificial Intelligence companies to train their large language models. (gordonkorman.com)

Massive datasets like Books3 are used to train artificial intelligence models to interpret human language — as in read and write like us. Perhaps the most well-known AI language tool is OpenAI’s ChatGPT, which made headlines this year for being able to write university-level essays for students.

But what concerns Korman most is how 28 of his copyrighted books were sucked into Books3 in the first place.

“When I hear about any kind of threat to the way [my writing] works as a business model, just for me to be able to pay my bills and support my family, obviously I have to be very, very concerned.”

Canadian author ‘flattered and concerned’

Stories by Quebec literary giants Michel Tremblay, Marie-Claire Blais and Leonard Cohen also make an appearance in Books3, as do works by Life of Pi author Yann Martel, murder mystery writer Louise Penny and dark horror overlord Patrick Sénécal.

“It’s a combination of being flattered and being concerned,” said writer Drew Hayden Taylor, who has nine books in Books3, including his best-selling novel Motorcycles and Sweetgrass, which was shortlisted for a Governor General’s Literary Award in 2010.

Hayden Taylor, an award-winning playwright, author, columnist, filmmaker and lecturer from Curve Lake First Nation in Ontario even wrote a short story featuring an artificial intelligence entity in 2016.

Like Korman, Hayden Taylor is concerned about copyright violations of his work.

Author Drew Hayden Taylor from Curve Lake First Nation just north of Peterborough, Ont., has nine books in the now defunct Books3 dataset. (David Hall/CBC)

“In the last 35 years that I’ve been a writer, almost all of my income has been derived from royalties. It’s literally taking the milk out of my cereal bowl. It’s very, very, very worrying.”

Hayden Taylor says he wishes the creators of Books3 had asked for permission to include his books.

“I would have considered it,” he said, noting he’d want to know much more about AI training and how it works before committing. “It would have been just more respectful.”

‘Unbelievably disrespectful’

CBC News also found one out of every six members of The Writers’ Union of Canada (WUC) — a national organization of over 2,600 professionally published writers — have at least one book in the dataset.

“It’s huge. It’s incredibly impactful on the cultural economy in a negative way. And, as importantly, it’s unbelievably disrespectful,” said John Degen, WUC’s executive director, after going over CBC’s findings.

Degen says he’s not surprised that a majority of literary award nominees were included, as that recognition typically opens up international markets like the United States — and opportunities such as translation rights and foreign publication rights.

John Degen, the executive director of The Writers’ Union of Canada, says the Books3 dataset is a violation of Canadian copyright law and believes it should be addressed legally and by Parliament. (Alexis Raymon/CBC)

“No one asked for permission. No one explained the project,” he said. “To me, that’s inexcusable and needs to be addressed legally and by Parliament.”

According to Degen, the Books3 dataset is a violation of Canada’s copyright law, which protects the work of artists during their lifetime and for 70 years following their deaths, because it accessed entire works of art without prior approval by the artist.

“Copyright can be very abstract and hard to understand, but I don’t think that taking a pirated book from a pirate site and using it for your own industrial purposes, I don’t think that it’s hard to understand that that’s wrong,” said Degen.

He says WUC is in “deep research phase” and looking at all possible legal remedies, including launching a lawsuit.

LISTEN | Best-selling authors on what AI means for human creativity:

The Current24:02Could AI put authors out of business?

Hundreds of writers have learned that their books have been used to train artificial intelligence to spit out imitations. Bestselling authors Sean Michaels and Linwood Barclay discuss what AI might mean for human creativity and artist compensation.

Legality of dataset unclear, copyright expert

Osgoode Hall Law School professor Carys Craig, who specializes in intellectual property law and technology, says it’s debatable if the existence or use of Books3 is illegal under Canada’s copyright law.

“It’s not clear that the inclusion of works in a dataset used to train a generative AI model does constitute copyright infringement,” said Craig. “Even if it’s done without the consent of the rights holder, it’s not clear that it implicates copyright at all.”

Osgoode Hall Law School professor Carys Craig, who specializes in intellectual property law and technology, co-authored a 2022 submission calling for the Canadian government to broaden copyright law to allow for AI research and analysis, including text and data mining. (Alexis Raymon/CBC)

In 2022, she and other legal experts co-authored a submission to the Canadian government calling for a broadening of copyright law to allow for artificial intelligence research and analysis, including text and data mining, “without the threat of potential copyright liability.”

She says what is essential to understand is that massive datasets like Books3 are mainly used by AI as data points to understand patterns in language. That’s not the same as authorship, Craig says.

“It’s simply unrealistic to imagine that permissions are going to be sought from every individual author whose work appears there.”

Multiple U.S. lawsuits

The legality of AI-training datasets is being debated in U.S. courts, as Books3 is mentioned in multiple lawsuits launched by the U.S. Authors Guild and individual writers like John Grisham, George R.R. Martin, Jodi Picoult and Sarah Silverman.

Anti-piracy group Rights Alliance sent a takedown notice to the websites hosting the dataset and it was removed last August.

LISTEN | These authors say OpenAI stole their books to train ChatGPT: 

As It Happens5:59Authors launch lawsuit accusing Open AI of pirating their books to train ChatGPT

The Authors Guild, a U.S. trade group for writers, filed the proposed class-action on Tuesday on behalf of 17 plaintiffs. One of them, novelist Douglas Preston, spoke to As It Happens host Nil Köksal.

The lawsuits mention tech companies OpenAI — the company behind ChatGPT — Meta, Microsoft and Bloomberg, alleging they breached U.S. copyright law by training their large language models on books without the permission of authors.

Some plaintiffs believe their books were used to train ChatGPT because the chatbot generated very accurate summaries of their works.

On Nov. 20, a judge in California initially dismissed five of the six allegations in a lawsuit concerning LLaMA, another AI-training dataset owned by Facebook’s Meta. The ruling states that based on the current allegations, the dataset does not constitute “a recasting or adaptation of any of the plaintiffs’ books.”

“Copyright doesn’t protect an author’s style,” said Craig, the law professor. “It doesn’t protect their ideas, the way that they write. It protects their literary text.”

One lawsuit specifically targets EleutherAI, the non-profit artificial intelligence research lab which created and launched Books3 in October 2020.

In a series of social media posts published at the time, Shawn Presser, the independent developer who compiled Books3, described it as a “reliable, direct download” of about 200,000 e-books he found online and reformatted to put “OpenAI-grade training data at your fingertips.”

The day the first story about Books3 was published in The Atlantic, Presser tweeted: “I would gladly go to prison … for advancing science and giving you the power to replicate ChatGPT.”

 

Why Montreal writers want AI to stop stealing their work

 

Local writers, such as Heather O’Neill, Trevor Ferguson and Rosemary Sullivan, say they’re interested in participating in legal action against artificial intelligence companies for using their writing to train bots to mimic their writing styles.

Ottawa may review copyright law

Canada’s government is also pondering whether copyright law should be changed with respect to the challenges posed by AI.

This October, it launched its second consultation in less than two years on “the implications of generative artificial intelligence for copyright.”

“I think they’re in catch-up mode and it’s sort of a desperate moment at this point,” said Degen, of WUC.

Craig says the long-term implications of permanently changing Canadian law to adapt to a constantly changing technology should be weighed carefully.

“We have to be very conscious of the way in which copyright law has shaped the Internet that we now have — and think about how and in what way we want it to shape the future of artificial intelligence.”

After CBC News showed author Drew Hayden Taylor how ChatGPT could be prompted to generate a short story in his voice — it gave him pause. He noted the prose contained specific Indigenous words and cultural references and the work sounded “eerily like it could be me.”

He joked that AI should be renamed Artificially Indigenous.

“All of my work comes from my experiences as an individual, as an artist, as a First Nations man, as a human,” he said. “It was a long journey to get to where I am now. And this … in a weird sort of way, invalidates that journey.”

METHODOLOGY: How CBC News identified Canadian and Québécois authors in Books3

To identify authors, CBC News used Python programming with regular expressions (RegEx) to extract the ISBN codes contained in over 180,000 Books3 files (92%). All ISBNs were put through the ISBNdb worldwide database to retrieve their title, author(s), publisher, language and other details. When an ISBN could not be retrieved, CBC News extracted the author(s) and title from the .epub.txt file. A total of 8,820 files could not be identified (4.5%). Upon inspection 1,284 files were completely empty and 205 files were duplicates — they were all excluded from this analysis.

To identify Canadian and Québécois authors, CBC News compared the full names of writers in Books3 against a list of 7,800 Canadian and Québécois writers, including: the online member directories of the Writers’ Union of Canada (WUC) and the Union des Écrivaines et des Écrivains Québécois (UNEQ), all past contenders/winners of the Canada Reads competition (2002-2022), all longlisted/shortlisted books for the Scotiabank Giller Prize (1994-2023), Trillium Book Award (1994-2023) and Governor General’s Literary Awards (1936-2023) — in French and English — and all writers who benefited from the Writers’ Trust of Canada (1976-2023) programs or awards. Additionally, CBC News compared Books3 titles against a list of 195,000 documents published in Quebec since 2010. Every author match was verified against book titles and the author’s country of citizenship, place of birth and biography to ensure that writers with the same name living in another country weren’t included.

Data collection: Valérie Ouellet and Shaki Sutharsan (Oct.-Nov. 2023)
Data analysis and verification: Valérie Ouellet and Sylvène Gilchrist (Oct.-Nov. 2023)

 

Source link

Continue Reading

News

RCMP arrest second suspect in deadly shooting east of Calgary

Published

 on

 

EDMONTON – RCMP say a second suspect has been arrested in the killing of an Alberta county worker.

Mounties say 28-year-old Elijah Strawberry was taken into custody Friday at a house on O’Chiese First Nation.

Colin Hough, a worker with Rocky View County, was shot and killed while on the job on a rural road east of Calgary on Aug. 6.

Another man who worked for Fortis Alberta was shot and wounded, and RCMP said the suspects fled in a Rocky View County work truck.

Police later arrested Arthur Wayne Penner, 35, and charged him with first-degree murder and attempted murder, and a warrant was issued for Strawberry’s arrest.

RCMP also said there was a $10,000 reward for information leading to the arrest of Strawberry, describing him as armed and dangerous.

Chief Supt. Roberta McKale, told a news conference in Edmonton that officers had received tips and information over the last few weeks.

“I don’t know of many members that when were stopped, fuelling up our vehicles, we weren’t keeping an eye out, looking for him,” she said.

But officers had been investigating other cases when they found Strawberry.

“Our investigators were in O’Chiese First Nation at a residence on another matter and the major crimes unit was there working another file and ended up locating him hiding in the residence,” McKale said.

While an investigation is still underway, RCMP say they’re confident both suspects in the case are in police custody.

This report by The Canadian Press was first published Sept. 13, 2024.

The Canadian Press. All rights reserved.

Source link

Continue Reading

News

26-year-old son is accused of his father’s murder on B.C.’s Sunshine Coast

Published

 on

RICHMOND, B.C. – The Integrated Homicide Investigation Team says the 26-year-old son of a man found dead on British Columbia’s Sunshine Coast has been charged with his murder.

Police say 58-year-old Henry Doyle was found badly injured on a forest service road in Egmont last September and died of his injuries.

The homicide team took over when the BC Coroners Service said the man’s death was suspicious.

It says in a statement that the BC Prosecution Service has approved one count of first-degree murder against the man’s son, Jackson Doyle.

Police say the accused will remain in custody until at least his next court appearance.

The homicide team says investigators remained committed to solving the case with the help of the community of Egmont, the RCMP on the Sunshine Coast and in Richmond, and the Vancouver Police Department.

This report by The Canadian Press was first published Sept. 13, 2024.

The Canadian Press. All rights reserved.



Source link

Continue Reading

News

Metro Vancouver’s HandyDART strike continues after talks break with no deal

Published

 on

 

VANCOUVER – Mediated talks between the union representing HandyDART workers in Metro Vancouver and its employer, Transdev, have broken off without an agreement following 15 hours of talks.

Joe McCann, president of Amalgamated Transit Union Local 1724, says they stayed at the bargaining table with help from a mediator until 2 a.m. Friday and made “some progress.”

However, he says the union negotiators didn’t get an offer that they could recommend to the membership.

McCann says that in some ways they are close to an agreement, but in other areas they are “miles apart.”

About 600 employees of the door-to-door transit service for people who can’t navigate the conventional transit system have been on strike since last week, pausing service for all but essential medical trips.

McCann asks HandyDART users to be “patient,” since they are trying to get not only a fair contract for workers but also a better service for customers.

He says it’s unclear when the talks will resume, but he hopes next week at the latest.

The employer, Transdev, didn’t reply to an interview request before publication.

This report by The Canadian Press was first published Sept. 13, 2024.

The Canadian Press. All rights reserved.

Source link

Continue Reading

Trending

Exit mobile version