AI ethics: Dr. Carissa Véliz on privacy, advertising and machine learning

Editor’s note: This is one of the occasional posts we do speaking to someone with something to say on topics that we find interesting. If you are seeing this and are not yet signed up to Deepnews, click the button bellow to start receiving our blog posts every week, and a Digest of quality news on an important subject every Friday

By Christopher Brennan

Data is everywhere. We generate it all the time. “Data is the new oil” has become a hackneyed phrase. But instead of thinking about a large flood of “data,” there are important distinctions that can be made between different types of data. 

One of the most important is personal data, the subject of a new book from Dr. Carissa Véliz, an associate professor in the faculty of philosophy and the Institute for Ethics in AI at the University of Oxford. While data is often viewed from a business or even political angle, Véliz uses an ethical perspective to argue for ending the focus on personal data in fields such as advertising.

There are of course connections to Deepnews, which provides contextual data on articles through the quality score, though AI ethics has also become a hot topic in the news recently and I reached out to Dr. Véliz with a lot to talk about. Below is a transcript of our conversation, edited for length and clarity. 

I was interested, because there are people talking about AI ethics from a lot of different backgrounds, how did you personally become interested in AI ethics?

Dr. Carissa Véliz. Photo by Fran Monks

I first got interested in the concept of privacy for personal reasons. I started researching the history of my family, and went to the archives in Spain, and there was a lot that I didn’t know about them. It got me wondering about what the privacy of the dead is. I went to philosophy and philosophy didn’t have much to say about it, and I was kind of disappointed, being a philosopher. Shortly thereafter Snowden revealed that we were being surveilled on a mass scale. I was already aware of the gap in the literature in philosophy and so I thought this is really important and somebody should work on it. That’s how I got started and then that led me to be interested in the ethics of AI because of course, the most common kind of AI at the moment is machine learning, which means a lot of data. There is a very tight connection between the ethics of AI and privacy. 

Many people are not super well versed in philosophy. Can you explain what ethics means?

Ethics can be defined in a lot of different ways. Basically it’s a part of philosophy that deals with how we should act, not how we actually do act, but how we should or how we ought to act. Also it asks what kind of a life we want, what kind of life should we have and how do we get there.

When it comes to data you make a big distinction on personal data versus other types of data. I know that you have guidelines for what personalized data is, what are those?

In some cases that distinction is quite gray but there are areas in which it’s less gray. For instance, non-personal data would be something like data about the quality of the air in your city, and that kind of data is really important to share widely. It can help to do research and to allow people to have ideas as to how to tackle some of the challenges of society. Personal data is data about you. It’s typically very, very sensitive data and it’s data that can be used to infer even more sensitive things. For instance, many people think that they don’t care about Facebook knowing what kind of music they like. They type in data about what they liked in particular, and it doesn’t seem very sensitive. But one of the characteristics of data is that it’s very hard to imagine the kind of inferences that can be made from it. When you share your taste about music, it doesn’t occur to you that some company or another might be using that data to infer your sexual orientation, your political tendencies, even your IQ. That seems like very sensitive information that you wouldn’t want just anyone to have.

You focus on the data economy and a desire to end the data economy. Why in your view does the data economy need to end?

This is the first book that calls for the end of the data economy, and the basic problem is that as long as we allow personal data to be bought and sold as a commodity we’re incentivizing two things that are really toxic in our society. The first one is for people to collect more personal data than they need, because it’s something so valuable that they can then sell out. And that puts us at great risk because personal data can be used for all kinds of bad purposes. If you make it so valuable and you make it so coveted and you have so much of it, just sloshing around, it makes it a lot more likely that sooner or later it will be misused. 

And the second incentive that is very toxic is for data to be sold to basically the highest bidder. The highest bidders many times are institutions or people that don’t have the best interest of data subjects at heart. So it could be authoritarian governments, it could be insurance companies, banks, prospective employers, It’s very easy to misuse data and it’s very hard to prove that data have been misused. For instance, you can apply for a job. There are two candidates, and both are equally well prepared for that job, equally suited and equally qualified. But one of them has a particular disease, or they’re thinking about having kids, or something like that. If that information is sold to data brokers, to prospective employers, it’s very easy to discriminate against people and they will never know. They would never be able to complain because they would never realize that they’re the victim of discrimination.

So I wanted to also talk about what replaces it. Do you think if people take up an ethical approach to data, there will be more of a distinction between good ethical data and bad ethical data in terms of what’s allowed to be used in instances such as ads?

I think I think there will be. In terms of ads, what will replace personalized ads is just contextual ads. Say you’re looking for running shoes. You do a search for running shoes. That’s the only thing we need to know about to show you, ads for running shoes, we don’t really need to know whether you’re gay or straight, where exactly you come from, whether you’re liberal or conservative and so on and so forth. We just need to know that you’re looking for shoes.

Your work on personalization also can be seen alongside another criticism, which is the focus of online systems on generating engagement. I’m wondering what’s your take on personalized data being the issue versus the issue being, using personalized data or not, a focus on engagement from some of these algorithms.

The only reason why platforms use engagement and want you to stay on the platform for as long as possible is precisely so they can harvest your personal data, the more you stay, the more you say, more you click, the more personal data they have on you and the more they can monetize it. If say, you were to pay $10 a month or a year to Twitter and that’s it, Twitter wouldn’t care much if you spent one hour on it or if you spent 10 hours on it a day. As long as you want to stay on the platform. Both things are intimately related.

Switching gears a bit, an issue that has come up in recent weeks at Google and other places is the idea of auditing algorithms, specifically the difficulty of large language models. What are the ethical concerns of an un-auditable or non-transparent amounts of data used to develop machine learning algorithms?

It’s really concerning for a lot of reasons. One reason is that algorithms may be using data that we don’t want them to use or proxies of that data. For instance, we don’t want algorithms to decide what happens to a person based on their race or their gender, but it’s very easy for an algorithm to do so even if you take out that information. The algorithm might still be using a proxy for that like zip code or taste in music or different things. And that’s hugely problematic because basically we want people to have the same rights online as they had before the digital age. Before the digital age we had common ways of trying to limit possible discrimination on the basis of things like like sex and race, and in the digital age it seems like we’ve lost our foothold. We don’t have as much control over that. That’s a huge problem. 

But there’s a more general problem that I see and it isn’t really talked about enough. I’ve been actually reticent to talk about it because I’ve been wanting to write about it and I just haven’t had time, is how algorithms are being tested on people. People are being used as guinea pigs. We have a lot to learn from medical ethics in this regard. So if you went to the hospital in the 1950s, you might be on an experiment without even knowing you were on the experiment. The doctor would just give you one drug or another and you would be part of research and you would never know about it. And  as we developed a medical ethics code, we establish that that’s not okay. If you want somebody to participate in a research experiment, you have to notify them, you have to ask for their consent, you have to compensate them in some way or another. There are all these safeguards to be able to make sure you protect people, and we don’t have that with algorithms at the moment. An algorithm gets out into the world, you can have just as big an effect as a drug, because it can decide whether you’re going to jail or not or whether you can get a job or a loan, really important decisions for people. 

And these algorithms are not being tested with a randomized control trial, like we asked for drugs. They’re being tested on people in the real world. Many times we’ve realized that they were incredibly inaccurate and harming 1000s of people, and we’ve learned about this years later once these algorithms have broken lives. One example is the Michigan unemployment agency used an algorithm for a few years and it falsely accused people of fraud, people who were already unemployed, who are already in a very precarious situation. Some of these people got fined hundreds of 1000s of dollars, they lost their families. And it turned out that the algorithm was inaccurate 93% of the time. This is totally unacceptable. We have to subject algorithms to randomized control trials, just like we do with drugs.

For that particular issue, how do you view information that’s publicly available on the internet, the texts of websites, things like that, things that can be easily scraped. Where do you think there is an ethical issue with using data that’s just out there and throwing it all in a model?

Before the internet it was very common to think that if information was public then by definition it couldn’t be private. The internet has showed us that was completely wrong. One reason why is because how accessible information is matters a lot. Suppose you have a record on you, that is very private, but that is held in a very small town, somewhere remote, and that is not digitized. The fact that it’s a public record doesn’t really affect your privacy too much, because not enough people can access it easily. When we post something online it makes it incredibly public. Of course, even then you have degrees of publicity, right? It’s different to have something about you on a blog post that nobody reads than to be on the front page of a newspaper. But one of the risks with information online is that even a blog post that nobody reads, you never know when it can become viral, and it’s very unpredictable when something will become viral.

Furthermore, there’s another problem about how data can be aggregated in ways that make it much more sensitive than people realize. So again, like when you share what kind of music you like and where you live, and what you ate last week, it never occurs to you that somebody might be joining all the dots together and inferring things about your health or your life expectancy.

Back to what replaces the data economy. You mentioned something on a podcast about Google making a decision to go from a search that was perhaps more loyal to users’ needs to one that was more loyal to advertisers and less caring about the quality of the search engine. So, how do you get back to a point where companies, platforms, etc algorithms are focused on delivering something to users rather than monetizing them?

There are a couple of ways. One way is to introduce fiduciary duties. Fiduciary duties are the kind of duties that force, for instance, doctors to only act on behalf of their patients’ interests and not their own interest. A doctor cannot perform surgery on you because they want to practice their skills or they want to earn more money, or they want an extra data point for their research. They can only perform surgery on you if it’s the best thing for your health. 

In the same way, we can implement data fiduciary duties, such that the first and foremost duty of anyone who holds our personal data is to benefit us. A second way in which we can make sure that companies owe their loyalty to us is by us being the client. Perhaps the main reason why Google changed its loyalties, and it was conscious of this because the founders Page and Brin had written a paper in which they mentioned this topic in 1998, is that companies typically owe their loyalty to their clients. When the user stops being the client and the client is advertisements and advertising companies, then it shouldn’t surprise us that we are not the priority for platforms like Google and Facebook. Another way to regain your loyalty is to become clients once again and that means having a different kind of business model. A classic one would be just to directly pay for their services like we used to pay for newspapers or for other kinds of services.

There are some people who still pay for newspapers, but … (laughs)

Yeah, yeah. Sorry, I pay for my newspapers actually (laughs). We just got used to not paying for things, and that turned out to be incredibly toxic and it comes back to bite us.

I’m wondering if part of the way that things got mixed up between the idea of the quality of a service and the engagement generated by a service, and that relates to advertising, is that advertising you can measure more authoritatively or more easily than perhaps the overall quality of a service. In terms of that fiduciary duty and in that new business model, is there going to be a need for a way of measuring what is good and what is bad, which can be fairly subjective? 

That’s a really interesting point. This is something that I haven’t thought through carefully, and any kind of answer is going to be tentative. But one book that I really liked and I read more or less recently is “The Tyranny of the Metrics,” and it shows how depending on how we measure things that might lead us astray. We end up like focusing more on catering to the metrics rather than providing or coming up with a service or work or whatever it is that is good for society, good for people and meaningful. So the fact that something becomes more subjective isn’t necessarily negative.

But there are bigger issues with subjectivity and oftentimes subjectivity favors those who already are in positions of power. For example if you’re subjectively looking at job candidates. If you offer a written test or something, then you may get a different top candidate then if you’re just looking at CVs and you just look at which university someone went to.

Of course, there’s not going to be an answer that fits all purposes for which we use data and all contexts. But, for instance, if a platform like Google was less able to measure the exact value of what it gives users, but users are happier, you use it just as much, and the data is completely protected. That wouldn’t be a negative thing.

You have talked before about what we would be giving up if we stopped using personalized data, and my impression is that you’d think it’s not a whole lot, but it’d be interesting to hear more about that.

Yeah, that’s right. I think that the benefits that we get, for instance, from personalized ads are very small, we can get them in other ways like contextual ads, and the disadvantages are huge. And in the same way in most other fields. If we really benefit from personal data then we will still continue to use it. It’s just we won’t sell it.

One of the disadvantages that you’ve mentioned before is the “fractured public sphere” where people are in their own bubbles. In terms of re-putting together the fractured public sphere. I mean would getting rid of personalized data, even really help?.

It will help. It wouldn’t be a solution to every problem but it will help in the sense that, at the moment, part of what is making the public sphere so fractured is that we are seeing content that is personalized based on our data. So if we get rid of that, we minimize it a lot. Suddenly, we’re going to have a lot more common content. That, I think, leads to quality because when our common content is really bad, then you will have journalists and academics and ordinary people criticizing and pushing for further quality. The fact that we can all see the same thing isn’t a guarantee, but tends to favor higher quality content.