Government Use of Social Data


“I know there was just a state-wide internet outage here; for the record, I had nothing to do with that.”

That was the opening line from Andrew Hallman, Deputy Director of CIA for Digital Innovation.  That level of self-awareness and an understanding of the biases typically held against the Intelligence Community would prove to be not just a central theme of this session, but also an effective way to challenge those listening to keep their minds open.

Hallman described his particular role as one whose inception was a natural extension of an agency-wide modernization in 2015. Much like brands and other organizations, CIA realized that it was not optimized to deal with the growing complexity of the global threat landscape. Hallman’s job was to come in and unify efforts and help digitally remaster the business of intelligence as the Intelligence Community had always known it.  An organization whose primary focus for many decades had been on nation states had to improve its effectiveness in addressing the growing diffusion of global power and increasingly complex threats that could emanate from anywhere.

The diffusion and fragmentation of the information environment has also meant a challenge to the standard methods of intelligence collection, albeit with new and more transparent ways to monitor untoward activity. “There is an increasing abundance of propagation of extremist messaging we need to be mindful of,” says Hallman. “It’s one thing to know of a known terrorist in a media space–it’s another thing to be monitoring those would-be or not-yet-known terrorists and watch the evolution of radicalization and advancement of plotting as it happens.” Hallman speaks of taking the lessons from 70 years of social science research about how societal and political instability forms and, leveraging the growing sensory environment, monitor the emerging conditions of change to anticipate crises that could impact national security, and provide policy makers more time and space to formulate options for shaping more favorable outcomes.

Hallman did a lot of mythbusting as well. “There have been volumes written and plenty of cinematography about the CIA that is inaccurate,” he says. He clarifies that the CIA’s mission is not to monitor everyday social media activity for the sake of it, but to look for patterns that reveal emerging threats, including to the safety of innocent life. “If you’re not a bad guy, don’t worry about it. If you are a bad guy, well, then worry about it.” He stresses that good collection of data is fairly similar to how brands should most effectively collect data: the CIA does not assign any value judgment to individual behaviors in social data unless they reveal a move toward radicalization or violent behavior, such as jihadist messaging or plotting attacks.

Twitter has also been effective for the Intelligence Community in terms of providing real-time information from around the world. No one person or agency can have eyes and ears in all places at all times; Twitter users around the world help the Intelligence Community by tweeting out real-time news happening, some of which is happening prior to, during and after violent events. When users share what’s happening, the CIA is better able to understand what is happening on the ground when they don’t have someone there.

Hallman makes an ask of the Big Boulder community: keep an open mind. There are plenty of misconceptions about the Intelligence Community, but he stresses that they–like the social media community–have a commitment to defending democratic ideals, free speech, respect for human dignity and safety of innocent lives. He notes that there are striking similarities between how the Intelligence Community collects and analyzes data and how brands and organizations do the same. The information may be different, the methods of collection may differ, but ultimately the CIA is trying to do what we all try to do with our collected social data: determine what story it tells and how then we should move forward.

Additional Takeaways:

While the CIA focuses on foreign adversaries and collection of human intelligence, the NSA is focused on collections of signal intelligence and communication intercepts, while the FBI is focused domestically on intelligence collection and law enforcement.

The CIA is seeking to integrate data at the data layer and find meaning in it to provide data driven insights to policy makers and war fighters.

“Our identities are what the data says we are.” – Hallman

Hallman does not believe we will get to a point where we can precisely predict future behaviors, but does believe they are getting better at forecasting what those conditions look like before they get to that point.

The Changing Interplay Between News, Government and Society

While we were enjoying Big Boulder 2016, a global event of far-reaching implications took place halfway across the world: Britain, against all polling predictions, voted to leave the European Union. Only five months later, the United States experienced our own upset with the presidential election, resulting in an outcome that even our best data analysts didn’t see coming. The aftermath of these events left a lot of people asking the same question: How were we all so wrong?

IMG_5422Deb Roy, Director of the Laboratory for Social Machines at the MIT Media Lab and Chief Media Scientist at Twitter, was already uniquely set up to study this very question. Using the collective power of the MIT Media Lab professionals, the team began collecting data in the months leading up to the election, attempting to determine the outcome. When those results were flagrantly different than anticipated, the team wanted to know why.

From August 2015 to the US Election Day in November, Roy’s team documented one billion tweets discussing the election specifically. They sought to map out what they called the “horse race of ideas,” filtering tweets through their deep learning network and separating them into topic classifications. Yes, the social media public was talking about the election online–but what do we talk about when we talk about elections? The deep learning network would “read” news sources across the political spectrum, from Huffington Post to Breitbart, and then “listen in” to the Twitter users as they chatted about those same conversation topics.

As the tweets were mined, the network narrowed the list of topics down to 19 particular conversations ranging from gun control to immigration to education and race. Using these tweets, the MIT Media Lab constructed debate briefs for presidential debate moderators leading up to the election, allowing moderators to comb through and select questions that would matter most to the American people.

The problem, they would eventually find, is that the topics that were the most important to American citizens were not necessarily the topics that were being discussed in common news coverage. Even more challenging was the discovery that users had a tendency to read only what they agreed with, following primarily their candidate of choice and, perhaps unwittingly, committing themselves to an insular “tribe,” as Roy called it.

With regard to media coverage, Roy made two important observations surrounding the role that journalists played in election influence. First, per a Pew study that year, more than 70% of Americans got their news from television. Whether the media loved or hated soon-to-be President Trump, he was the candidate who easily dominated television coverage across outlets, partisan or not. With 70% of potential voters taking in a 24-hour news cycle disproportionately covering one candidate, it’s not hard to understand how that candidate stayed top-of-mind for many.


Roy also noted that conversation is able to destroy brands and people, much like the conversation surrounding the very public snafus of brands like United Airlines or Pepsi…but the conversation is still centered around those brands or people. The old adage goes: there is no such thing as bad press.

The second major observation Roy discussed was the disparity in the type of topics covered by media outlets versus the topics that social media users seemed to care about. For instance, in the weeks leading up to the Vice Presidential debates, over 30% of media coverage revolved around the VP candidates; on Twitter, only 3% of conversations seemed to care about these candidates at all. Additionally, the data showed a major divergence in topics discussed by journalists online versus civilians: while campaign finance was heavily covered by the media, users online seemed disinterested in those talking points, preferring to focus on conversations about race.

So what does it all mean for the future of media coverage as it relates to the public?

First, we must start bridging the gap between journalists, pollsters, and the general constituency. Roy noted that 80% of journalists live in 1 of 3 major cities in America, not leaving much room for the topics that rural Americans care most about. “The data shows that people in rural Wisconsin don’t care about Russia,” he remarked. “They care about local issues that affect them.” Roy and the Media Lab are working on ways to show journalists and pollsters their own research bias through network data visualization. Additionally, they are beginning pilot programs to build networks of influencers in rural America whose voices are respected by citizens in those areas, and need to be heard.

Another point of interest is exposing social media users to their own “tribe,” and the information bias to which they leave themselves vulnerable. The Media Lab quietly released a Chrome extension in 2017 called FlipFeed, which allows Twitter users to see into the feed of a user completely unlike them. The user experience feels like ChatRoulette: a user can click “Flip My Feed” on their Twitter interface and the extension, using social listening tools to determine what tribe that user may belong to, “flips” to the feed of a Twitter user in a distinctly different tribe. The extension gives users a view into worlds they could otherwise ignore, which would only create wider expanses between people groups and political ideologies.

Finally, the question remains: how now should news be created and affected? Should journalists lead by following, listening to social media users and the topics that are important to them, and creating story topics from those conversations? Do we need to retire the old way of news creation in favor of listening to the data in front of our faces? Most importantly, what will this take and how will it affect publishers?

What we do know is that post-game analysis shows us the data we didn’t know existed, and we can’t afford to ignore it any longer.

Additional Takeaways:

  • Social conversation online is evolving so rapidly that data collectors have a hard time maintaining the accuracy of which topics are relevant; the accuracy of that relevance has a tendency to drastically decrease within a two-month period.
  • Polling data differed starkly from Twitter conversations as well: the Media Lab found that Twitter users were very interested in foreign policy, while polling data indicated the opposite.
  • The MIT Media Lab team found that if Twitter users followed only one of the 19 early primary candidates, those users also tended to lean toward voting for that candidate.
  • Surprisingly, most “Sanders tribes” also had two-way connections with “Trump tribes.”
  • Through looking at network data visualization, it was discovered that most journalists tended to follow Twitter users connected to nearly every tribe but the “Trump tribes.”
  • Citizens have begun to tire of “mainstream media agendas,” Roy notes. “There’s a real feeling that, my God, we’re being manipulated here.” When this happens, citizens simply create “new” media, giving credibility to previously discounted outlets and giving rise to “fake news” phenomena.
  • There were more people who followed both Trump and Clinton than just one or the other. Determining language and intent played a big part in parsing out this demographic: for instance, did they use the word “illegals” or “undocumented?”


Data Science and Social Data

A discussion with William Cukierski from Kaggle and Scott Hendrickson from Twitter. 

This discussion on data science kicked off with the topic of how the field has changed during the last 12 months. Scott started off by talking about the emergence of data science bootcamps and training programs to take people in other scientific principles and make them into data scientists. He discussed the pros and cons associated with this; pros being that the attention on new training will raise the bar of what it means to be a data scientist, while cons being that the proliferation will make people question what it means to be a data scientist.

Will has noticed a shift in the last 12 months towards deep learning and representation learning. He described these practices as the creation of algorithms that let the computer understand more of the algorithm and develop its own learning without the need for constant human input and refinement. Using the example of a stump on stage, he pointed out that with machine learning a human teaches the computer what the features of the stump are, but in deep learning the computer get thousands of pictures of stumps and finds the commonalities to create the features. He then went on to talk about how deep learning could allow computers to understand the difference between a stump and a table once a picture of a stump with a cup on it was used as an input.

Will gave a real example of deep learning from Kaggle where there was a contest held by Microsoft. The challenge was to correctly identify whether the picture was of a cat or a dog from a set of 15,000 pictures of cats and 15,000 pictures of dogs of varying breeds that Microsoft had picked out. Within 1 week someone had 99% accuracy using a deep -earning library and in time got closer to 100% accuracy.

Scott talked about the balance of human input in the analysis process. He mentioned a project he worked on where they analyzed conversations happening on Disqus around the topic of texting and driving. When they dug into the content, they found conversations originally about texting and driving had shifted to topics such as drunk driving, teens driving, fake IDs, buses, bikes and other topics. With human input he said that natural language processing and learn more about human meaning from text and result in significant improvements in algorithms.

He also talked about the need for scientific literacy to help other people better understand what data science is and what is happening in the analysis. According to Scott, some data scientists like to just show results and let people think that they worked some kind of magic but he would prefer that they show their work and explain how they got the result to foster better understanding of the analysis and the profession.

Will agreed that there needs to be more explanation of what is being done, mentioning that there is a lot of smoke and mirrors around what data science is and what it can be used for. He mentioned that there is still great difficulty in finding the signal from a single Tweet in real-time that could change the world. But because of the way data scientists have gone back and looked at a Tweet that did change the world, people think that it is possible.

Conversations from the panel also covered the topic of the term “Big Data.” Will said that very few people are actually doing big data, mentioning companies like Twitter, Google, and “other web-scale companies.” He went on to say that big data was really more about the model than the amount of data, and defined it by the ability to create a model that got better as more data is inputted. Scott brought up how the term is poorly defined and while some people like to believe it is about the amount of data being consumed, he agreed with Will that it is about the model and added that big really applies to complexity rather than the amount of data.

This lead into a discussion on how to solve problems with data science. Will started by saying that at Kaggle they want to make sure that a person coming to them thinks about the problem they want to solve, whether it is a solvable problem, and whether they have the data to solve it. Only with those items can the challenge be put out to the crowd to try and use data science to solve a problem.

Scott talked about cycle times on projects and how to iterate in data science to get projects accomplished. He said it usually takes around 10-15 cycle times of asking the questions, getting the data, putting it out to a larger group of stakeholders, trying to find the answer, and the iterating on those steps to finish a project.

The session wrapped with a conversation of experimentation and productization. Both speakers agreed that the role of the data scientist within the organization is to answer a business question and to productize their process, even if only for the internal customer. Because of this experimentation happens within an existing business priority as opposed to the data scientist selecting experiments to run at random.