Below are two variations of a story, one is fake news and the other is real. Read both and see if you can work it out. It comes from about January in 2017.
Nintendo Switch game console to launch in March for $299 – The Nintendo Switch video game console will sell for about $260 in Japan, starting March 3, the same date as its global rollout in the U.S. and Europe. The Japanese company promises the device will be packed with fun features of all its past machines and more. Nintendo is promising a more immersive, interactive experience with the Switch, including online playing and using the remote controller in games that don’t require players to be constantly staring at a display. Nintendo officials demonstrated features such as using the detachable remote controllers, called ”Joy-Con,” to play a gun-duel game. Motion sensors enable players to feel virtual water being poured into a virtual cup.
New Nintendo Switch game console to launch in March for $99 – Nintendo plans a promotional roll out of it’s new Nintendo switch game console. For a limited time, the console will roll out for an introductory price of $99. Nintendo promises to pack the new console with fun features not present in past machines. The new console contains new features such as motion detectors and immerse and interactive gaming. The new introductory price will be available for two months to show the public the new advances in gaming. However, initial quantities will be limited to 250,000 units available at the sales price. So rush out and get yours today while the promotional offer is running.
So what do you think (without googling or knowing where they came from?
Pause and give it some thought, then read on.
If I told you that example 1 came from Fox News (see here), what would you now think?
The Alpha source for Example 1 actually comes via a press release and also appears in many other more reputable outlets. It is example 2 that is fake.
If you are up to speed with Game Consoles then the $99 price would have been an immediate tip off. For others not so familiar then you might struggle a bit.
Now imagine that you need to write a software algorithm to work out what is and is not real, where do you even start?
University of Michigan researchers have been working on this very problem and have come up with an interesting approach.
Automatic Detection of Fake News
About a year ago they published a paper that outlined their research. Titled “Automatic Detection of Fake News” and available here, it discusses how they collected examples of both fake and real news articles, and then using rules managed to identify fake news 76% of the time.
One part of their collection of fake and real was created by the team as follows. As part of their study they recruited students (via Amazon Mechanical Turk) and paid them to take a real story and create a fake version that mimicked the journalistic style of the real articles, so it was as good a fake as they could possibly get.
They also went to the web and gathered up examples of fake and real stories concerning various well-known Celebrities.
They then describe how their software algorithms analyse and identify the linguistic differences between fake and legitimate news content. They not only ran their dataset through their automatic detectors but they also had it all manually reviewed by humans to see which did the best job.
Michigan News, the University of Michigans paper, ran an article on 21st Aug a couple of weeks ago that highlights this on going work by the team …
Fake news detector algorithm works better than a human
.. Rada Mihalcea, the U-M computer science and engineering professor behind the project, said an automated solution could be an important tool for sites that are struggling to deal with an onslaught of fake news stories, often created to generate clicks or to manipulate public opinion.
Catching fake stories before they have real consequences can be difficult, as aggregator and social media sites today rely heavily on human editors who often can’t keep up with the influx of news. In addition, current debunking techniques often depend on external verification of facts, which can be difficult with the newest stories. Often, by the time a story is proven a fake, the damage has already been done.
Linguistic analysis takes a different approach, analyzing quantifiable attributes like grammatical structure, word choice, punctuation and complexity. It works faster than humans and it can be used with a variety of different news types.
“You can imagine any number of applications for this on the front or back end of a news or social media site,” Mihalcea said. “It could provide users with an estimate of the trustworthiness of individual stories or a whole news site. Or it could be a first line of defense on the back end of a news site, flagging suspicious stories for further review. A 76 percent success rate leaves a fairly large margin of error, but it can still provide valuable insight when it’s used alongside humans.”
Both the algorithm along with their dataset, is freely available.
How Exactly does it work?
It’s a bit of a well-known joke and yet also often true (and funny), that many purveyors of BS appear to operate with the caps lock key stuck in the on position. In one sense they are tapping into something akin to this.
First, to establish some context it is worth pointing out that there are basically three categories of fake news (I’m dismissing and ignoring the Trump definition of stuff that is actually true and he simply dislikes it).
- serious fabrications – news items about false and non-existing events or information such as celebrity gossip
- hoaxes – providing false information via, for example, social media with the intention to be picked up by traditional news websites
- satire – humorous news items that mimic genuine news but contain irony and absurdity
Their focus is on the first. Hoaxes and Satire tend to be laced with irony and absurdity, and that’s not very useful for training an algorithm to detect fake news that is created to mislead.
So what exactly does their algorithm look at?
It uses variations of rules for these four linguistic attributes which they collectively refer to as feature sets…
- Punctuation: Usage of periods, commas, dashes, question marks and exclamation marks.
- Readability: complex words, long words, etc…
- Ngrams: a contiguous sequence of n items from a given sample of text or speech
What they have managed to identify is a specific combination of rules for each of the above that works quite well for flagging something as fake.
They conclude that there does appear to be linguistic differences in fake news content as compared to legitimate news content and that their software was as good as a human reading and evaluating the same stories.
It is not an easy problem to crack, and quite frankly astonishing that it actually worked.
I suspect that if it was deployed in the real world there would be several issues …
- The false positives : the suppression of stuff that is quite real would annoy not only the content creator, but also many others, so the best you could use this for is to flag something as dubious and express a degree of probability about that flagging.
- The arms race : the creators of fake news would learn to adapt by perhaps washing their stuff with similar software until it passes prior to publication.
There are alternative strategies
One thought is to flip the coin over and instead of looking for items that are fake, look for items that are not fake. The Associated Press has specific standards, so hunting for stuff that conforms to that has potential.
I’m not pulling that out of the air, that’s the approach taken by Aaron Edell’s FakeBox and he claims a 95% success rate.
The bottom line is that we still have a problem to crack, but there are people who are making progress. In the end it may prove to be a problem that requires the deployment of multiple strategies. Rules for identifying what is fake, rules for identifying what is real, source analysis because we also know that specific sources have no credibility (Daily Mail, Breitbart, Fox News, etc…) so anything from such domains should be deemed false by default until positively verified.