The Dataset

The first challenge we faced was to create a training set large enough to be used in the model. The training set is an essential part of a deep learning system. It is composed of a large number of examples that will be fed into the program so that it can learn what it wants to find.`

For image recognition, it is pretty straightforward. The model is fed with hundreds of thousands of images; and each of them is labeled. For instance, an animal dataset will contain vast quantities of images representing multiple versions of different animals, each with their label. The model ingests all of this, reduce each image to a set of small features and use these features to determine that a cat’s ear has always the same shape at its tip, and so on. Later, when presented with an image it has never seen before, the model will draw from its accumulated knowledge of features to identify the animal or the object.

Unfortunately, there was no such thing as a labeled training set of data for news articles. We had to build our own. To do that, we relied on two approaches. 

We first organized clusters of articles based on their sources, type, length, and various parameters like the extent of vocabulary, readability index, etc. We drew from a library of millions of articles. Then, to increase the precision of our model, we relied on humans to evaluate a smaller number of articles with greater precision. Altogether, journalism students read and scored 10,000 news articles using a strict testing protocol (students were paid for the task). Each story was evaluated by three people to get a reliable result.

Measuring the Accuracy

To assess the performance of the model, we compared the score returned by humans and by the machine, under a variety of configurations. It turned out our model is right about 80~85 percent of the time in its ability to sort the quality of an article. That is not bad for a deep learning system. A sophisticated analysis of news articles is a challenging task for artificial intelligence. While perfectly clean datasets of images, for instance, will yield nearly 100 percent accuracy, deep learning has a difficult time dealing with fuzzy and subjective material such as news.

In the Future

A deep learning model is not a static thing. It requires endless refinements in terms of raw scoring performance, but also for speed and its ability to generalize, i.e. being able to return a consistent score across a wide variety of articles. In the future, we might take advantage of the ever-evolving field of deep learning with new approaches, new tools, and new algorithms.