Twitter Political Index Launches, But Is It Actually Measuring "Voter Sentiment?"
BY Micah L. Sifry | Wednesday, August 1 2012
Today, Twitter announced the launch of the "Twitter Political Index" in partnership with the social data analysis firm Topsy and pollsters The Mellman Group and North Star Opinion Research, and the twittering class swooned.
"Ignore the 'Twindex' at your peril," stomped Chris Cilizza of The Washington Post's "The Fix" blog. "This is a big deal: Twitter just launched a very powerful new political sentiment tool," tweeted Ben Smith, the editor of BuzzFeed.
"Twitter Will Gauge Voter Sentiment in New Venture" was the headline at National Journal--never mind the fact that this is neither a measure of voters or of sentiment.
Rather, what the Twindex number for Obama or Romney purports to be is a ranking of where attitudes expressed about those two men stand compared to all the other "sentiment" currently being expressed on a daily basis on Twitter, with some smoothing done over a three-day cycle to deal with the daily fluctuations of topics on the platform. Says Adam Sharp, Twitter's lead for government, news and social innovation, "If a candidate's rating is 30, that would mean that 30% of tweets on all other subjects are more negative than the ones about the candidate, and 70% are more positive."
Look out, everyone, the web just invented another leaderboard. And now the conversation about politics in America is going to get even dumber, because we have a new magic number to chatter about. Obama is at 34, down 4 from yesterday. Romney is at 25, up 2. Sell Obama! Buy Romney!
Over on BuzzFeed, staff reporter Matt Buchanan breathlessly gave the details on how Twindex would supposedly enable anyone to directly "check the pulse of millions of actual people, simultaneously and directly, second by second."
The rough version of how it works: Topsy pores through every single tweet in real time, determines which ones are about Obama or Romney, and then assigns a sentiment score to each tweet based on its content. That is, whether it's positive or negative toward Obama or Romney, and just how positive or negative it is. Add all the data up together and you have something like a real-time approval score for Obama and Romney, determined by what tens of millions of people are saying, which Twitter is going to release daily at election.twitter.com.
….The Twitter Political Index leverages the growing science of sentiment analysis, in which computers — machines! — try to assess what the meaning or feeling of a piece of writing really is. Which isn't easy — particularly for "short form content like tweets, which often lack context," Topsy's chief scientist Rishab Ghosh tells me. But Topsy's algorithms agree with a randomly selected human 90 percent of the time on what a tweet means, validated over 30,000 tests. (It's hard to get to 100 percent because humans often disagree on what a tweet means. Also, machines are still terrible at picking out or assessing sarcasm or irony. So, "Good job, Obama!" with a link to an article about crappy job numbers might fly past the algorithm as a positive tweet. )
Well, that's quite an admission, isn't it? If Topsy's algorithms were right 90 percent of the time, that means they were wrong 10 percent of the time. Or, shall I say, they and the human scorer(s) of a tweet were in disagreement 10 percent of the time, and no one could agree what a tweet meant.
But what if both machine and human scorers are in agreement, but also mistaken about someone's sentiment? Stop and think. How often have you misunderstood something that you read in an email or tweet that was written by someone you know, let alone a stranger? It could be that Topsy's algorithms AND human readers are both validating mistaken impressions. It's for these kinds of reasons that serious academics studying the challenge of doing sentiment analysis insist that researchers publish their methodology in detail, something Topsy has yet to do. A request to Topsy for comment was unanswered as of the end of business today.
I put this question to Sharp, and he said Topsy's testing of its algorithms against the choices of multiple human scorers gave him confidence that most ambiguities would be ironed out of the index's ratings. He added, "At the end of the day, there is no such thing as perfection on this front. Nor do we hold out that the Twindex is perfection. It is an imperfect model."
What about the claim that Twindex is reflecting American political sentiment? Only about 30 percent of Twitter's user base is American, Sharp says, but "obviously, U.S. users are much more vocal about the campaign than non-U.S. users."
And how do we know that a tweet mentioning Obama or Romney was written by a voter? "We are essentially taking the hypothesis that someone who is actively tweeting about the election is more likely to be a voter than not," he answered. The fact that the Obama's Twindex number seems to move in relatively close correlation to his approval numbers in the Gallup Poll, Sharp says, also suggests that the metric is reflecting something useful about public attitudes.
Marc Smith, founder of the Social Media Research Foundation, is less sanguine. "I think that this is more horse race stuff with limited methodological validity," he told me in an email this afternoon.
The idea that the metric claims to represent the percentage of tweets that are more or less negative than tweets about that person is hard to comprehend. I propose that simple alternatives would have more validity and comprehensibility. Raw counts of tweets with each candidate name is a simple and easy to understand metric. A list of the ten terms most often associated with the candidate's names would also be informative (and better express the "sentiment" than a percentage score).
I continue to find "sentiment" metrics hard to believe and understand and compare to other's metrics. I think it is good to see this kind of data in use and I appreciate the interest in social media as a novel source of insights. But I do think this data is over sold.
Smith also warned that the Twindex could also be gamed, even though the number of tweets being analyzed on a daily basis could number in the millions.
"As it stands, this metric does not recognize the fact that twitter is factionalized. If people who tweet #tcot post negative things about Obama is that news? Sub-groups with a partisan view can now ramp up negative content creation in an effort to move this metric. If we normalized these metrics based on which group the tweet came from we could see when people who had not been negative become negative. Just measuring the volume of negative seems to miss the idea that determined interest groups can generate more negative on demand. This is different from seeing a shift among positive or neutral people to a negative stance."
To be fair, Sharp is careful to emphasize that this a work in progress. "This is a young, three hours old experiment, an imperfect one that we'll keep trying to make stronger, based on a public data set that we'll invite others to take a stab at."
"None of us are claiming to have cracked the code."