A CBC News investigation has found at least 2,500 copyrighted books written by more than 1,200 Canadian authors were shared online as part of a massive — and now defunct — dataset used for artificial intelligence training and research purposes.
The dataset’s existence and general highlights were revealed earlier this year in The Atlantic. It led to an avalanche of writers expressing shock on social media that their work had been included without their permission and sharing their concerns that AI tools could use information from the dataset to generate content in their distinct artistic voice.
A CBC News analysis of the dataset, called Books3, identified thousands of Canadian authors and books in both official languages.
Although that content represents less than two per cent of the 190,000 plus files in Books3, it reads like a who’s who of the country’s literary community: three quarters of CBC’s Canada Reads contenders and Scotiabank Giller award nominees are featured, along with over a third of all Governor General’s Literary Award finalists.
Authors shocked to find their books used to train AI without permission
Some of Canada’s most famous authors were shocked to find that their books have been used without their permission to train artificial intelligence software. The Writers’ Union of Canada says it is considering a lawsuit, but one law professor says it’s not clear if using the books to train AI is illegal.
Topping the list of Canadian authors with the most books in the dataset is Margaret Atwood, of The Handmaid’s Tale fame, followed by best-selling children and young adult writer Gordon Korman and Nobel Prize winner Alice Munro.
“I’ve been writing kids books for more than three quarters of my life,” said Korman, whose career began when his Grade 7 creative writing assignment was bought by Scholastic and turned into his first book, This Can’t Be Happening at Macdonald Hall.
Korman told CBC News he had read about the dataset and knew some of his books were in it.
“They’re not really stealing your stuff,” he said. “It’s not quite like people are using excerpts or characters from your books or storylines.”
Massive datasets like Books3 are used to train artificial intelligence models to interpret human language — as in read and write like us. Perhaps the most well-known AI language tool is OpenAI’s ChatGPT, which made headlines this year for being able to write university-level essays for students.
But what concerns Korman most is how 28 of his copyrighted books were sucked into Books3 in the first place.
“When I hear about any kind of threat to the way [my writing] works as a business model, just for me to be able to pay my bills and support my family, obviously I have to be very, very concerned.”
Canadian author ‘flattered and concerned’
Stories by Quebec literary giants Michel Tremblay, Marie-Claire Blais and Leonard Cohen also make an appearance in Books3, as do works by Life of Pi author Yann Martel, murder mystery writer Louise Penny and dark horror overlord Patrick Sénécal.
“It’s a combination of being flattered and being concerned,” said writer Drew Hayden Taylor, who has nine books in Books3, including his best-selling novel Motorcycles and Sweetgrass, which was shortlisted for a Governor General’s Literary Award in 2010.
Hayden Taylor, an award-winning playwright, author, columnist, filmmaker and lecturer from Curve Lake First Nation in Ontario even wrote a short story featuring an artificial intelligence entity in 2016.
Like Korman, Hayden Taylor is concerned about copyright violations of his work.
“In the last 35 years that I’ve been a writer, almost all of my income has been derived from royalties. It’s literally taking the milk out of my cereal bowl. It’s very, very, very worrying.”
Hayden Taylor says he wishes the creators of Books3 had asked for permission to include his books.
“I would have considered it,” he said, noting he’d want to know much more about AI training and how it works before committing. “It would have been just more respectful.”
‘Unbelievably disrespectful’
CBC News also found one out of every six members of The Writers’ Union of Canada (WUC) — a national organization of over 2,600 professionally published writers — have at least one book in the dataset.
“It’s huge. It’s incredibly impactful on the cultural economy in a negative way. And, as importantly, it’s unbelievably disrespectful,” said John Degen, WUC’s executive director, after going over CBC’s findings.
Degen says he’s not surprised that a majority of literary award nominees were included, as that recognition typically opens up international markets like the United States — and opportunities such as translation rights and foreign publication rights.
“No one asked for permission. No one explained the project,” he said. “To me, that’s inexcusable and needs to be addressed legally and by Parliament.”
According to Degen, the Books3 dataset is a violation of Canada’s copyright law, which protects the work of artists during their lifetime and for 70 years following their deaths, because it accessed entire works of art without prior approval by the artist.
“Copyright can be very abstract and hard to understand, but I don’t think that taking a pirated book from a pirate site and using it for your own industrial purposes, I don’t think that it’s hard to understand that that’s wrong,” said Degen.
He says WUC is in “deep research phase” and looking at all possible legal remedies, including launching a lawsuit.
The Current24:02Could AI put authors out of business?
Hundreds of writers have learned that their books have been used to train artificial intelligence to spit out imitations. Bestselling authors Sean Michaels and Linwood Barclay discuss what AI might mean for human creativity and artist compensation.
Legality of dataset unclear, copyright expert
Osgoode Hall Law School professor Carys Craig, who specializes in intellectual property law and technology, says it’s debatable if the existence or use of Books3 is illegal under Canada’s copyright law.
“It’s not clear that the inclusion of works in a dataset used to train a generative AI model does constitute copyright infringement,” said Craig. “Even if it’s done without the consent of the rights holder, it’s not clear that it implicates copyright at all.”
In 2022, she and other legal experts co-authored a submission to the Canadian government calling for a broadening of copyright law to allow for artificial intelligence research and analysis, including text and data mining, “without the threat of potential copyright liability.”
She says what is essential to understand is that massive datasets like Books3 are mainly used by AI as data points to understand patterns in language. That’s not the same as authorship, Craig says.
“It’s simply unrealistic to imagine that permissions are going to be sought from every individual author whose work appears there.”
Multiple U.S. lawsuits
The legality of AI-training datasets is being debated in U.S. courts, as Books3 is mentioned in multiple lawsuits launched by the U.S. Authors Guild and individual writers like John Grisham, George R.R. Martin, Jodi Picoult and Sarah Silverman.
Anti-piracy group Rights Alliance sent a takedown notice to the websites hosting the dataset and it was removed last August.
As It Happens5:59Authors launch lawsuit accusing Open AI of pirating their books to train ChatGPT
The Authors Guild, a U.S. trade group for writers, filed the proposed class-action on Tuesday on behalf of 17 plaintiffs. One of them, novelist Douglas Preston, spoke to As It Happens host Nil Köksal.
The lawsuits mention tech companies OpenAI — the company behind ChatGPT — Meta, Microsoft and Bloomberg, alleging they breached U.S. copyright law by training their large language models on books without the permission of authors.
Some plaintiffs believe their books were used to train ChatGPT because the chatbot generated very accurate summaries of their works.
On Nov. 20, a judge in California initially dismissed five of the six allegations in a lawsuit concerning LLaMA, another AI-training dataset owned by Facebook’s Meta. The ruling states that based on the current allegations, the dataset does not constitute “a recasting or adaptation of any of the plaintiffs’ books.”
“Copyright doesn’t protect an author’s style,” said Craig, the law professor. “It doesn’t protect their ideas, the way that they write. It protects their literary text.”
One lawsuit specifically targets EleutherAI, the non-profit artificial intelligence research lab which created and launched Books3 in October 2020.
In a series of social media posts published at the time, Shawn Presser, the independent developer who compiled Books3, described it as a “reliable, direct download” of about 200,000 e-books he found online and reformatted to put “OpenAI-grade training data at your fingertips.”
The day the first story about Books3 was published in The Atlantic, Presser tweeted: “I would gladly go to prison … for advancing science and giving you the power to replicate ChatGPT.”
Why Montreal writers want AI to stop stealing their work
Local writers, such as Heather O’Neill, Trevor Ferguson and Rosemary Sullivan, say they’re interested in participating in legal action against artificial intelligence companies for using their writing to train bots to mimic their writing styles.
Ottawa may review copyright law
Canada’s government is also pondering whether copyright law should be changed with respect to the challenges posed by AI.
This October, it launched its second consultation in less than two years on “the implications of generative artificial intelligence for copyright.”
“I think they’re in catch-up mode and it’s sort of a desperate moment at this point,” said Degen, of WUC.
Craig says the long-term implications of permanently changing Canadian law to adapt to a constantly changing technology should be weighed carefully.
“We have to be very conscious of the way in which copyright law has shaped the Internet that we now have — and think about how and in what way we want it to shape the future of artificial intelligence.”
After CBC News showed author Drew Hayden Taylor how ChatGPT could be prompted to generate a short story in his voice — it gave him pause. He noted the prose contained specific Indigenous words and cultural references and the work sounded “eerily like it could be me.”
He joked that AI should be renamed Artificially Indigenous.
“All of my work comes from my experiences as an individual, as an artist, as a First Nations man, as a human,” he said. “It was a long journey to get to where I am now. And this … in a weird sort of way, invalidates that journey.”
METHODOLOGY: How CBC News identified Canadian and Québécois authors in Books3
To identify authors, CBC News used Python programming with regular expressions (RegEx) to extract the ISBN codes contained in over 180,000 Books3 files (92%). All ISBNs were put through the ISBNdb worldwide database to retrieve their title, author(s), publisher, language and other details. When an ISBN could not be retrieved, CBC News extracted the author(s) and title from the .epub.txt file. A total of 8,820 files could not be identified (4.5%). Upon inspection 1,284 files were completely empty and 205 files were duplicates — they were all excluded from this analysis.
NEW YORK (AP) — The U.S. syphilis epidemic slowed dramatically last year, gonorrhea cases fell and chlamydia cases remained below prepandemic levels, according to federal data released Tuesday.
The numbers represented some good news about sexually transmitted diseases, which experienced some alarming increases in past years due to declining condom use, inadequate sex education, and reduced testing and treatment when the COVID-19 pandemic hit.
Last year, cases of the most infectious stages of syphilis fell 10% from the year before — the first substantial decline in more than two decades. Gonorrhea cases dropped 7%, marking a second straight year of decline and bringing the number below what it was in 2019.
“I’m encouraged, and it’s been a long time since I felt that way” about the nation’s epidemic of sexually transmitted infections, said the CDC’s Dr. Jonathan Mermin. “Something is working.”
More than 2.4 million cases of syphilis, gonorrhea and chlamydia were diagnosed and reported last year — 1.6 million cases of chlamydia, 600,000 of gonorrhea, and more than 209,000 of syphilis.
Syphilis is a particular concern. For centuries, it was a common but feared infection that could deform the body and end in death. New cases plummeted in the U.S. starting in the 1940s when infection-fighting antibiotics became widely available, and they trended down for a half century after that. By 2002, however, cases began rising again, with men who have sex with other men being disproportionately affected.
The new report found cases of syphilis in their early, most infectious stages dropped 13% among gay and bisexual men. It was the first such drop since the agency began reporting data for that group in the mid-2000s.
However, there was a 12% increase in the rate of cases of unknown- or later-stage syphilis — a reflection of people infected years ago.
Cases of syphilis in newborns, passed on from infected mothers, also rose. There were nearly 4,000 cases, including 279 stillbirths and infant deaths.
“This means pregnant women are not being tested often enough,” said Dr. Jeffrey Klausner, a professor of medicine at the University of Southern California.
What caused some of the STD trends to improve? Several experts say one contributor is the growing use of an antibiotic as a “morning-after pill.” Studies have shown that taking doxycycline within 72 hours of unprotected sex cuts the risk of developing syphilis, gonorrhea and chlamydia.
In June, the CDC started recommending doxycycline as a morning-after pill, specifically for gay and bisexual men and transgender women who recently had an STD diagnosis. But health departments and organizations in some cities had been giving the pills to people for a couple years.
Some experts believe that the 2022 mpox outbreak — which mainly hit gay and bisexual men — may have had a lingering effect on sexual behavior in 2023, or at least on people’s willingness to get tested when strange sores appeared.
Another factor may have been an increase in the number of health workers testing people for infections, doing contact tracing and connecting people to treatment. Congress gave $1.2 billion to expand the workforce over five years, including $600 million to states, cities and territories that get STD prevention funding from CDC.
Last year had the “most activity with that funding throughout the U.S.,” said David Harvey, executive director of the National Coalition of STD Directors.
However, Congress ended the funds early as a part of last year’s debt ceiling deal, cutting off $400 million. Some people already have lost their jobs, said a spokeswoman for Harvey’s organization.
Still, Harvey said he had reasons for optimism, including the growing use of doxycycline and a push for at-home STD test kits.
Also, there are reasons to think the next presidential administration could get behind STD prevention. In 2019, then-President Donald Trump announced a campaign to “eliminate” the U.S. HIV epidemic by 2030. (Federal health officials later clarified that the actual goal was a huge reduction in new infections — fewer than 3,000 a year.)
There were nearly 32,000 new HIV infections in 2022, the CDC estimates. But a boost in public health funding for HIV could also also help bring down other sexually transmitted infections, experts said.
“When the government puts in resources, puts in money, we see declines in STDs,” Klausner said.
___
The Associated Press Health and Science Department receives support from the Howard Hughes Medical Institute’s Science and Educational Media Group. The AP is solely responsible for all content.
WASHINGTON (AP) — Scientists can’t know precisely when a volcano is about to erupt, but they can sometimes pick up telltale signs.
That happened two years ago with the world’s largest active volcano. About two months before Mauna Loa spewed rivers of glowing orange molten lava, geologists detected small earthquakes nearby and other signs, and they warned residents on Hawaii‘s Big Island.
Now a study of the volcano’s lava confirms their timeline for when the molten rock below was on the move.
“Volcanoes are tricky because we don’t get to watch directly what’s happening inside – we have to look for other signs,” said Erik Klemetti Gonzalez, a volcano expert at Denison University, who was not involved in the study.
Upswelling ground and increased earthquake activity near the volcano resulted from magma rising from lower levels of Earth’s crust to fill chambers beneath the volcano, said Kendra Lynn, a research geologist at the Hawaiian Volcano Observatory and co-author of a new study in Nature Communications.
When pressure was high enough, the magma broke through brittle surface rock and became lava – and the eruption began in late November 2022. Later, researchers collected samples of volcanic rock for analysis.
The chemical makeup of certain crystals within the lava indicated that around 70 days before the eruption, large quantities of molten rock had moved from around 1.9 miles (3 kilometers) to 3 miles (5 kilometers) under the summit to a mile (2 kilometers) or less beneath, the study found. This matched the timeline the geologists had observed with other signs.
The last time Mauna Loa erupted was in 1984. Most of the U.S. volcanoes that scientists consider to be active are found in Hawaii, Alaska and the West Coast.
Worldwide, around 585 volcanoes are considered active.
Scientists can’t predict eruptions, but they can make a “forecast,” said Ben Andrews, who heads the global volcano program at the Smithsonian Institution and who was not involved in the study.
Andrews compared volcano forecasts to weather forecasts – informed “probabilities” that an event will occur. And better data about the past behavior of specific volcanos can help researchers finetune forecasts of future activity, experts say.
(asterisk)We can look for similar patterns in the future and expect that there’s a higher probability of conditions for an eruption happening,” said Klemetti Gonzalez.
___
The Associated Press Health and Science Department receives support from the Howard Hughes Medical Institute’s Science and Educational Media Group. The AP is solely responsible for all content.
Waymo on Tuesday opened its robotaxi service to anyone who wants a ride around Los Angeles, marking another milestone in the evolution of self-driving car technology since the company began as a secret project at Google 15 years ago.
The expansion comes eight months after Waymo began offering rides in Los Angeles to a limited group of passengers chosen from a waiting list that had ballooned to more than 300,000 people. Now, anyone with the Waymo One smartphone app will be able to request a ride around an 80-square-mile (129-square-kilometer) territory spanning the second largest U.S. city.
After Waymo received approval from California regulators to charge for rides 15 months ago, the company initially chose to launch its operations in San Francisco before offering a limited service in Los Angeles.
Before deciding to compete against conventional ride-hailing pioneers Uber and Lyft in California, Waymo unleashed its robotaxis in Phoenix in 2020 and has been steadily extending the reach of its service in that Arizona city ever since.
Driverless rides are proving to be more than just a novelty. Waymo says it now transports more than 50,000 weekly passengers in its robotaxis, a volume of business numbers that helped the company recently raise $5.6 billion from its corporate parent Alphabet and a list of other investors that included venture capital firm Andreesen Horowitz and financial management firm T. Rowe Price.
“Our service has matured quickly and our riders are embracing the many benefits of fully autonomous driving,” Waymo co-CEO Tekedra Mawakana said in a blog post.
Despite its inroads, Waymo is still believed to be losing money. Although Alphabet doesn’t disclose Waymo’s financial results, the robotaxi is a major part of an “Other Bets” division that had suffered an operating loss of $3.3 billion through the first nine months of this year, down from a setback of $4.2 billion at the same time last year.
But Waymo has come a long way since Google began working on self-driving cars in 2009 as part of project “Chauffeur.” Since its 2016 spinoff from Google, Waymo has established itself as the clear leader in a robotaxi industry that’s getting more congested.
Electric auto pioneer Tesla is aiming to launch a rival “Cybercab” service by 2026, although its CEO Elon Musk said he hopes the company can get the required regulatory clearances to operate in Texas and California by next year.
Tesla’s projected timeline for competing against Waymo has been met with skepticism because Musk has made unfulfilled promises about the company’s self-driving car technology for nearly a decade.
Meanwhile, Waymo’s robotaxis have driven more than 20 million fully autonomous miles and provided more than 2 million rides to passengers without encountering a serious accident that resulted in its operations being sidelined.
That safety record is a stark contrast to one of its early rivals, Cruise, a robotaxi service owned by General Motors. Cruise’s California license was suspended last year after one of its driverless cars in San Francisco dragged a jaywalking pedestrian who had been struck by a different car driven by a human.
Cruise is now trying to rebound by joining forces with Uber to make some of its services available next year in U.S. cities that still haven’t been announced. But Waymo also has forged a similar alliance with Uber to dispatch its robotaxi in Atlanta and Austin, Texas next year.
Another robotaxi service, Amazon’s Zoox, is hoping to begin offering driverless rides to the general public in Las Vegas at some point next year before also launching in San Francisco.