With careers that have advanced both artificial intelligence and the tools used by other researchers, two members of the Northeastern University computer science community have been elected Fellows of the Association for Computing Machinery (ACM).
Kenneth Church, professor of the practice in the Khoury College of Computer Sciences and senior principal research scientist in The Institute for Experiential AI, began his career in computational linguistics and natural language processing.
“You’ve probably heard of the recent excitement about chatbots,” Church says, referring to the release of ChatGPT and the many questions it raised. “A lot of people think that all this happened instantly, overnight, but it wasn’t like that. I got involved in computational linguistics in the late 1970s, and I’ve been working at it ever since.”
One of Church’s early — and his most cited, he says — papers described “corpus-based methods” to build dictionaries, and was written in collaboration with British lexicographer Patrick Hanks.
The kind of dictionaries Church was interested in are built on corpora — that is, large samples of text taken from real-world sources. One famous dictionary, called the Brown Corpus, included “a million words, consisting of 500 samples of 2,000 words each.” Each sample, Church says, was “chosen by a committee to be representative of the kind of language that people would be exposed to.”
Using these samples, artificial intelligence programs like ChatGPT learn the frequency of the words in those samples — how often they occur — and, in more advanced iterations, their co-occurrences — which words appear in close proximity to other words.
By “looking at all the co-occurrences of two words within a window,” Church says, “you could find that ‘doctor’ and ‘nurse’ do co-occur more often than chance.”
Church says that this is still “the way modern chatbots work. They’ve just processed more text than any [human] could read in their lifetime — or many, many lifetimes.”
Now Church is working on a project he calls Better Together that is aimed at providing the research community with another tool to identify papers relevant to their topics.
“The number of papers published doubles every nine years,” Church says. “So 90% of the literature was written since I was in graduate school.” With such a huge volume of material published every year, it can seem impossible for a researcher to gain a hold on their subject.
“Most papers aren’t cited much, and most of the highly cited papers aren’t recent,” Church says, “because it takes the community a while to figure out what’s important.”
“A lot of recommendation systems recommend stuff that’s buzzword compliant,” but not heavily cited, he continues. This means that many research search engines wind up with a heavy recency-bias.
“I’m more interested in stuff that is important, but maybe not very recent.” With Better Together, Church hopes he can provide a “diversity of perspectives that would be valuable.”
In addition to Church, the ACM has also included a Northeastern University graduate as a new fellow. Natasha Noy, now a research scientist at Google, received her Ph.D. from Northeastern in 1997.
She leads a research team building Google’s Dataset Search, a search engine specifically designed to find datasets — of all kinds — that are publicly available on the web. “You can think of this as Google Scholar but for data,” Noy says.
“The majority of our users are students,” she continues. “We have data on pretty much everything and anything you’re interested in.”
Dataset Search works by looking at a webpage’s metadata — information about the webpage that isn’t immediately visible on its surface, but is encoded in its html code. This data — about a webpage’s contents — informs Google (and other entities) what that page holds: a blog, a storefront, a dataset, etc.
While most websites incorporate metadata in their code, “when we started the work,” she says, “there were only a handful of sites that were both publishing data and describing it [in their] metadata — even though by then we knew there were hundreds, if not thousands of sites publishing data[sets].”
“After Dataset Search came along there was a huge rise in sites on the web that were actually publishing that metadata.”
“Metadata is not just for Google,” she notes, “metadata is not just for Dataset Search. It’s part of this whole larger ecosystem.”
Prior to joining Google, Noy’s research was in the “Semantic Web,” she says, built on the idea that “we can actually have sites connect to each other, and we can understand the data better and build tools on top of that information.” Tools not unlike Dataset Search.
“I’ve been in the AI field my whole life,” Noy says. Now she’s interested in questions of responsibility. “Was this data that was acquired responsibly? Is it representing the world we want to represent? Is it safe?” she asks. Approaching these questions “from the point of view of data, I think, is an interesting way of trying to approach the responsibility angle.”
Noy also notes that she’s now on a prestigious list, which includes “many of the scientists that I looked up to for my whole career,” she says.
Among that list of ACM Fellows this year are several Turing Award winners, sometimes called the “Nobel Prize of Computing.” “To be in a club with those people is a real big deal,” Church says, and he encourages budding computer scientists, who might also like to be in that club, to make connections in their community as they’re getting started.
Church joins nine other members of Northeastern’s faculty who are ACM fellows: Gregory Abowd, Ricardo Baeza-Yates, Carla Brodley, Usama Fayyad, Matthias Felleisen, Kevin Fu, David Kaeli, Renée Miller and Mitchell Wand.
Noah Lloyd is a senior writer for Northeastern Global News and NGN Research. Email him at n.lloyd@northeastern.edu. Follow him on X/Twitter at @noahghola.