CBC News analysis finds thousands of Canadian authors, books in controversial dataset used to train AI | Canada News Media
Connect with us

News

CBC News analysis finds thousands of Canadian authors, books in controversial dataset used to train AI

Published

 on

A CBC News investigation has found at least 2,500 copyrighted books written by more than 1,200 Canadian authors were shared online as part of a massive — and now defunct — dataset used for artificial intelligence training and research purposes.

The dataset’s existence and general highlights were revealed earlier this year in The Atlantic. It led to an avalanche of writers expressing shock on social media that their work had been included without their permission and sharing their concerns that AI tools could use information from the dataset to generate content in their distinct artistic voice.

A CBC News analysis of the dataset, called Books3, identified thousands of Canadian authors and books in both official languages.

Although that content represents less than two per cent of the 190,000 plus files in Books3, it reads like a who’s who of the country’s literary community: three quarters of CBC’s Canada Reads contenders and Scotiabank Giller award nominees are featured, along with over a third of all Governor General’s Literary Award finalists.

 

Authors shocked to find their books used to train AI without permission

 

Some of Canada’s most famous authors were shocked to find that their books have been used without their permission to train artificial intelligence software. The Writers’ Union of Canada says it is considering a lawsuit, but one law professor says it’s not clear if using the books to train AI is illegal.

Topping the list of Canadian authors with the most books in the dataset is Margaret Atwood, of The Handmaid’s Tale fame, followed by best-selling children and young adult writer Gordon Korman and Nobel Prize winner Alice Munro.

“I’ve been writing kids books for more than three quarters of my life,” said Korman, whose career began when his Grade 7 creative writing assignment was bought by Scholastic and turned into his first book, This Can’t Be Happening at Macdonald Hall.

Korman told CBC News he had read about the dataset and knew some of his books were in it.

“They’re not really stealing your stuff,” he said. “It’s not quite like people are using excerpts or characters from your books or storylines.”

Canadian author Gordon Korman says he wants to know how 28 of his young adult books found their way onto the Books3 dataset, a huge trove of data used by Artificial Intelligence companies to train their large language models. (gordonkorman.com)

Massive datasets like Books3 are used to train artificial intelligence models to interpret human language — as in read and write like us. Perhaps the most well-known AI language tool is OpenAI’s ChatGPT, which made headlines this year for being able to write university-level essays for students.

But what concerns Korman most is how 28 of his copyrighted books were sucked into Books3 in the first place.

“When I hear about any kind of threat to the way [my writing] works as a business model, just for me to be able to pay my bills and support my family, obviously I have to be very, very concerned.”

Canadian author ‘flattered and concerned’

Stories by Quebec literary giants Michel Tremblay, Marie-Claire Blais and Leonard Cohen also make an appearance in Books3, as do works by Life of Pi author Yann Martel, murder mystery writer Louise Penny and dark horror overlord Patrick Sénécal.

“It’s a combination of being flattered and being concerned,” said writer Drew Hayden Taylor, who has nine books in Books3, including his best-selling novel Motorcycles and Sweetgrass, which was shortlisted for a Governor General’s Literary Award in 2010.

Hayden Taylor, an award-winning playwright, author, columnist, filmmaker and lecturer from Curve Lake First Nation in Ontario even wrote a short story featuring an artificial intelligence entity in 2016.

Like Korman, Hayden Taylor is concerned about copyright violations of his work.

Author Drew Hayden Taylor from Curve Lake First Nation just north of Peterborough, Ont., has nine books in the now defunct Books3 dataset. (David Hall/CBC)

“In the last 35 years that I’ve been a writer, almost all of my income has been derived from royalties. It’s literally taking the milk out of my cereal bowl. It’s very, very, very worrying.”

Hayden Taylor says he wishes the creators of Books3 had asked for permission to include his books.

“I would have considered it,” he said, noting he’d want to know much more about AI training and how it works before committing. “It would have been just more respectful.”

‘Unbelievably disrespectful’

CBC News also found one out of every six members of The Writers’ Union of Canada (WUC) — a national organization of over 2,600 professionally published writers — have at least one book in the dataset.

“It’s huge. It’s incredibly impactful on the cultural economy in a negative way. And, as importantly, it’s unbelievably disrespectful,” said John Degen, WUC’s executive director, after going over CBC’s findings.

Degen says he’s not surprised that a majority of literary award nominees were included, as that recognition typically opens up international markets like the United States — and opportunities such as translation rights and foreign publication rights.

John Degen, the executive director of The Writers’ Union of Canada, says the Books3 dataset is a violation of Canadian copyright law and believes it should be addressed legally and by Parliament. (Alexis Raymon/CBC)

“No one asked for permission. No one explained the project,” he said. “To me, that’s inexcusable and needs to be addressed legally and by Parliament.”

According to Degen, the Books3 dataset is a violation of Canada’s copyright law, which protects the work of artists during their lifetime and for 70 years following their deaths, because it accessed entire works of art without prior approval by the artist.

“Copyright can be very abstract and hard to understand, but I don’t think that taking a pirated book from a pirate site and using it for your own industrial purposes, I don’t think that it’s hard to understand that that’s wrong,” said Degen.

He says WUC is in “deep research phase” and looking at all possible legal remedies, including launching a lawsuit.

The Current24:02Could AI put authors out of business?

Hundreds of writers have learned that their books have been used to train artificial intelligence to spit out imitations. Bestselling authors Sean Michaels and Linwood Barclay discuss what AI might mean for human creativity and artist compensation.

Legality of dataset unclear, copyright expert

Osgoode Hall Law School professor Carys Craig, who specializes in intellectual property law and technology, says it’s debatable if the existence or use of Books3 is illegal under Canada’s copyright law.

“It’s not clear that the inclusion of works in a dataset used to train a generative AI model does constitute copyright infringement,” said Craig. “Even if it’s done without the consent of the rights holder, it’s not clear that it implicates copyright at all.”

Osgoode Hall Law School professor Carys Craig, who specializes in intellectual property law and technology, co-authored a 2022 submission calling for the Canadian government to broaden copyright law to allow for AI research and analysis, including text and data mining. (Alexis Raymon/CBC)

In 2022, she and other legal experts co-authored a submission to the Canadian government calling for a broadening of copyright law to allow for artificial intelligence research and analysis, including text and data mining, “without the threat of potential copyright liability.”

She says what is essential to understand is that massive datasets like Books3 are mainly used by AI as data points to understand patterns in language. That’s not the same as authorship, Craig says.

“It’s simply unrealistic to imagine that permissions are going to be sought from every individual author whose work appears there.”

Multiple U.S. lawsuits

The legality of AI-training datasets is being debated in U.S. courts, as Books3 is mentioned in multiple lawsuits launched by the U.S. Authors Guild and individual writers like John Grisham, George R.R. Martin, Jodi Picoult and Sarah Silverman.

Anti-piracy group Rights Alliance sent a takedown notice to the websites hosting the dataset and it was removed last August.

As It Happens5:59Authors launch lawsuit accusing Open AI of pirating their books to train ChatGPT

The Authors Guild, a U.S. trade group for writers, filed the proposed class-action on Tuesday on behalf of 17 plaintiffs. One of them, novelist Douglas Preston, spoke to As It Happens host Nil Köksal.

The lawsuits mention tech companies OpenAI — the company behind ChatGPT — Meta, Microsoft and Bloomberg, alleging they breached U.S. copyright law by training their large language models on books without the permission of authors.

Some plaintiffs believe their books were used to train ChatGPT because the chatbot generated very accurate summaries of their works.

On Nov. 20, a judge in California initially dismissed five of the six allegations in a lawsuit concerning LLaMA, another AI-training dataset owned by Facebook’s Meta. The ruling states that based on the current allegations, the dataset does not constitute “a recasting or adaptation of any of the plaintiffs’ books.”

“Copyright doesn’t protect an author’s style,” said Craig, the law professor. “It doesn’t protect their ideas, the way that they write. It protects their literary text.”

One lawsuit specifically targets EleutherAI, the non-profit artificial intelligence research lab which created and launched Books3 in October 2020.

In a series of social media posts published at the time, Shawn Presser, the independent developer who compiled Books3, described it as a “reliable, direct download” of about 200,000 e-books he found online and reformatted to put “OpenAI-grade training data at your fingertips.”

The day the first story about Books3 was published in The Atlantic, Presser tweeted: “I would gladly go to prison … for advancing science and giving you the power to replicate ChatGPT.”

 

Why Montreal writers want AI to stop stealing their work

Local writers, such as Heather O’Neill, Trevor Ferguson and Rosemary Sullivan, say they’re interested in participating in legal action against artificial intelligence companies for using their writing to train bots to mimic their writing styles.

Ottawa may review copyright law

Canada’s government is also pondering whether copyright law should be changed with respect to the challenges posed by AI.

This October, it launched its second consultation in less than two years on “the implications of generative artificial intelligence for copyright.”

“I think they’re in catch-up mode and it’s sort of a desperate moment at this point,” said Degen, of WUC.

Craig says the long-term implications of permanently changing Canadian law to adapt to a constantly changing technology should be weighed carefully.

“We have to be very conscious of the way in which copyright law has shaped the Internet that we now have — and think about how and in what way we want it to shape the future of artificial intelligence.”

After CBC News showed author Drew Hayden Taylor how ChatGPT could be prompted to generate a short story in his voice — it gave him pause. He noted the prose contained specific Indigenous words and cultural references and the work sounded “eerily like it could be me.”

He joked that AI should be renamed Artificially Indigenous.

“All of my work comes from my experiences as an individual, as an artist, as a First Nations man, as a human,” he said. “It was a long journey to get to where I am now. And this … in a weird sort of way, invalidates that journey.”

METHODOLOGY: How CBC News identified Canadian and Québécois authors in Books3

To identify authors, CBC News used Python programming with regular expressions (RegEx) to extract the ISBN codes contained in over 180,000 Books3 files (92%). All ISBNs were put through the ISBNdb worldwide database to retrieve their title, author(s), publisher, language and other details. When an ISBN could not be retrieved, CBC News extracted the author(s) and title from the .epub.txt file. A total of 8,820 files could not be identified (4.5%). Upon inspection 1,284 files were completely empty and 205 files were duplicates — they were all excluded from this analysis.

To identify Canadian and Québécois authors, CBC News compared the full names of writers in Books3 against a list of 7,800 Canadian and Québécois writers, including: the online member directories of the Writers’ Union of Canada (WUC) and the Union des Écrivaines et des Écrivains Québécois (UNEQ), all past contenders/winners of the Canada Reads competition (2002-2022), all longlisted/shortlisted books for the Scotiabank Giller Prize (1994-2023), Trillium Book Award (1994-2023) and Governor General’s Literary Awards (1936-2023) — in French and English — and all writers who benefited from the Writers’ Trust of Canada (1976-2023) programs or awards. Additionally, CBC News compared Books3 titles against a list of 195,000 documents published in Quebec since 2010. Every author match was verified against book titles and the author’s country of citizenship, place of birth and biography to ensure that writers with the same name living in another country weren’t included.

 

Source link

Continue Reading

News

Christian McCaffrey is placed on injured reserve for the 49ers and will miss at least 4 more games

Published

 on

SANTA CLARA, Calif. (AP) — The San Francisco 49ers placed All-Pro running back Christian McCaffrey on injured reserve because of his lingering calf and Achilles tendon injuries.

The move made Saturday means McCaffrey will miss at least four more games after already sitting out the season opener. He is eligible to return for a Thursday night game in Seattle on Oct. 10.

McCaffrey got hurt early in training camp and missed four weeks of practice before returning to the field on a limited basis last week. He was a late scratch for the opener on Monday night against the Jets and now is sidelined again after experiencing pain following practice on Thursday.

McCaffrey led the NFL last season with 2,023 yards from scrimmage and was tied for the league lead with 21 touchdowns, winning AP Offensive Player of the Year.

The Niners made up for McCaffrey’s absence thanks to a strong performance from backup Jordan Mason, who had 28 carries for 147 yards and a touchdown in San Francisco’s 32-19 victory over the New York Jets. Mason is set to start again Sunday at Minnesota.

After missing 23 games because of injuries in his final two full seasons with Carolina, McCaffrey had been healthy the past two seasons.

He missed only one game combined in 2022-23 — a meaningless Week 18 game last season for San Francisco when he had a sore calf. His 798 combined touches from scrimmage in the regular season and playoffs were the third most for any player in a two-year span in the past 10 years.

Now San Francisco will likely rely heavily on Mason, a former undrafted free agent out of Georgia Tech who had 83 carries his first two seasons. He had at least 10 touches just twice before the season opener, when his 28 carries were the most by a 49ers player in a regular-season game since Frank Gore had 31 against Seattle on Oct. 30, 2011.

The Niners also have fourth-round rookie Isaac Guerendo and Patrick Taylor Jr. on the active roster. Guerendo played three offensive snaps with no touches in the opener. Taylor had 65 carries for Green Bay from 2021-23.

San Francisco also elevated safety Tracy Walker III from the practice squad for Sunday’s game against Minnesota.

___

AP NFL:

The Canadian Press. All rights reserved.



Source link

Continue Reading

News

Canada’s Newman, Arop secure third-place finishes at Diamond League track event

Published

 on

BRUSSELS – Canada walked away with some hardware at the Diamond League track and field competition Saturday.

Alysha Newman finished third in women’s pole vault, while Marco Arop did the same in the men’s 800-metre race.

Newman won a bronze medal in her event at the recent Paris Olympics. Arop grabbed silver at the same distance in France last month.

Australia’s Nina Kennedy, who captured gold at the Summer Games, again finished atop the podium. Sandi Morris of the United States was second.

Newman set a national record when she secured Canada’s first-ever pole vault medal with a bronze at the Olympics with a height of 4.85 metres. The 30-year-old from London, Ont., cleared 4.80 metres in her second attempt Saturday, but was unable conquer 4.88 metres on three attempts.

Arop, a 25-year-old from Edmonton, finished the men’s 800 metres with a time of one minute 43.25 seconds. Olympic gold medallist Emmanuel Wanyonyi of Kenya was first with a time of 1:42.70.

Djamel Sedjati, edged out by Arop for silver in Paris last month, was second 1:42.87

This report by The Canadian Press was first published Sept. 14, 2024.

The Canadian Press. All rights reserved.



Source link

Continue Reading

News

Bologna prepares for Champions League debut with draw at Como while Juventus held

Published

 on

MILAN (AP) — Bologna’s preparations for its Champions League debut are not going well though it managed to spoil Como’s first Serie A home match in 21 years on Saturday.

Bologna came from two goals down to salvage a 2-2 draw to gather three points from its opening four matches.

Bologna hosts Shakhtar Donetsk on Wednesday. Its only other appearance in Europe’s top competition was in 1964 in the preliminary round of the old European Cup.

AC Milan is also winless as it prepares for a Tuesday Champions League match against Liverpool. The Rossoneri hosted promoted Venezia later. Juventus drew at Empoli 0-0.

Como made a great start in the fifth minute when Patrick Cutrone attempted to roll the ball across the six-yard box but it took a huge deflection off Bologna defender Nicolò Casale for an own goal.

Bologna thought it was gifted a way back into the match on the stroke of halftime when referee Marco Piccinini signalled for a penalty following an Alberto Moreno handball, but he revoked his decision and instead gave a free kick because the handball was just outside the area.

Bologna improved after the break but found itself further behind when Cutrone raced onto a through ball and cut inside past a defender and fired into the far bottom corner.

Tommaso Pobega hit the post for Bologna, which finally pulled one back in the 76th through substitute Santiago Castro.

Another substitute helped the visitors snatch a point when Samuel Iling-Junior curled a fine strike into the top left corner in stoppage time.

Unbeaten sides

Juventus, and more surprisingly Empoli, are among six unbeaten sides.

Empoli held Monza and Bologna to draws either side of a shock 2-1 win at Roma. Juventus’ perfect start to the season was ruined by Roma in a goalless draw before the international break.

On Saturday, there were few clearcut chances in Empoli although home goalkeeper Devis Vásquez made spectacular saves to fingertip out a Federico Gatti header and deny Dusan Vlahovic in a one on one with the Juventus forward.

Empoli had a good opportunity in the 73rd minute following an Alberto Grassi one-two with Pietro Pellegri but the finish was straight at Mattia Perin.

The host could have won it right at the death but Gatti flew in with a great sliding block to keep out Emanuel Gyasi’s close-range effort.

Juventus hosts PSV Eindhoven in the Champions League on Tuesday.

___

AP soccer:

The Canadian Press. All rights reserved.



Source link

Continue Reading

Trending

Exit mobile version