“X” Marks the Spot

Eric Gundersen, Javier de la Torre, Sean Gorman, & Francesco D’Orazio discuss social data and geo.

We are at the point in social where there is a shift happening in the way we look at data. Literally. Data visualization is becoming more necessary and more powerful. This panel showcased the ability to use maps to simplify analysis and make things more visual, but more importantly it makes it easier to separate the noise from the signal in social data.

Sean Gorman from Timbr discussed how this company focuses on the backend of data analysis and visualization. They are a platform for enabling algorithmic orchestrations with social data but he paraphrased this as giving people the tools to structure and clean social data to facilitate easy enrichment and binding of that data. Some enrichments around location he mentioned were friend-of-friend triangulation, finding location in text of Tweets, Gnip’s Profile Geo, and other ways to get location appended to social activities. He also pointed out that standard dashboard analytics providers want to add maps but don’t have the resources so they use Timbr.

Sean spoke about how there are anomaly detection algorithms as well as other algorithms on git and other places that they enable people to use and customize in their map making process.  That map making process on Timbr includes live iteration of code to maps for quick visualizations.

All of the panelists on stage are working to lower the bar for coming up with custom tailored analytics for the questions people have with social data.

Part of custom tailored analytics is the visualization aspect which numerous panelists hit on at different times. Eric from Mapbox showcased their product which is a platform for designers and developers to make custom maps within their apps. Companies such as Foursquare, Pinterest, and the Financial Time use Mapbox to display their data. Eric showcased a number of maps made in conjunction with Gnip which show 3 billion geotagged Tweets and pointed out that the most amazing part of these visualizations is that there is not map behind the dots, the data is actually what is creating the maps.

Mobile Devices on Twitter in Washington DC

Eric pointed out the analysis that can come from these types of visualizations. You can see the economic disparity in cities by looking at the regions where people post from an iPhone or and Android device. You can see the buying trends for countries such as the Blackberry usage in Malaysia, but nowhere else.

The panel also discussed how low opt-in on platforms for geolocation is an issue for creating great analyses. Only 1-2% of people opt-in t share their location along with their Tweet. Those who do share their location end up speaking for larger population when the analysis is done through a map visualization, which ends up creating a bias. Sean mentioned a preliminary study using 100,000 users that shows that the portion of Twitter that shares location is skewed to have a higher proportion of African Americans, a lower age, a higher proportion of renters, and a smaller household size than the general population. They are expanding the study to 1 million users.

Javier from CartoDB showcased a few visualization he made in conjunction with Twitter which show both a map and time in a single visualization for millions of Tweets. He talked about how social activities happen in a place which is important to see but also the concept that social activities happen at a time and that adds additional context to analysis. These visualizations allow you to explore the connection of time and place to understand how an idea spreads on social, he used a Beyonce surprise album release to showcase how Twitter “explodes” with news. Ideally for Javier you don’t have to be a designer or developer to tell these types of stories in powerful ways.

Sunrises vs Sunsets on Twitter

The panel also discussed the amount of conversation dedicated to data analysis but that there is not much talk of data visualization. Javier said that investment in the analytics and not the visualization is like having lots of power without any control.

Francesco from Pulsar talked about how they like to add in another filter, audience intelligence, to their analysis. He mentioned the example of people on Twitter talking about Coca-Cola is interesting but what are moms saying about Coca-Cola. Adding in this additional layer adds valuable context to the analysis.

He showed a map of people complaining about bad cell phone signal. A mobile carrier knows that when a network goes down it doesn’t happen in an instance but rather as a series of slow failures. Showing the maps of people complaining can allow the carrier to see where the network is failing and do something about it, which is a great example of social data being used in an engineering case, and not just for marketing or PR.

Conversation turned to geo outside of the USA and on other platforms besides Twitter. Sean mentioned that 10% on Sina Weibo activities have location attached, likely because of the emergence of phones along with emergence of internet in China and Weibo being a mobile first app. This lead to a discussion on how to incentivize people to share their location more. While issues like privacy were touched on it was clear that the panelists agreed that the biggest challenge in getting people to share location more was the lack of clear benefit to the consumer for doing so. Right now the conversation is more focused around how the industry can benefit from this resulting geo data and not on how the user can benefit from sharing it.

Many times in this panel the panelists mentioned that there is lots of challenges in the social geo space but also lots of opportunity. Javier said the next 12-18 months should be incredibly interesting in this space, as he touched on ideas around ability to control zoom and speed on maps that have time and place dimensions.

 

Driving Actionable Insights

Brian Melinat from Dell and Justin DeGraaf from Coca-Cola discuss non-marketing use cases for social data.

The next panel was hosted by Susan Etlinger of Altimeter Group and paired Brian Melinat from Dell with Justin DeGraaf of Coca-Cola as they discussed the use of social data throughout the enterprise. While the marketing department is often the initial group clamouring for social data in many businesses, both Dell and Coke have found valuable applications elsewhere in their respective organizations. Susan led a lively discussion that explored how social media data can be put to use in supply chain management, customer support, product development and other business units.

Brian shared that Dell’s use of social data was seen to track back to 2006 and the monitoring of a blog created by a customer who had a less than satisfactory experience. In addition to improving customer support by listening to social channels, the company also uses social content to shape (and adjust) the launch of a recent ultrabook product. While the ultrabook’s specs and OS directly addressed the needs of a unique audience, they also learned that it was priced too high upon release. Literally overnight, they used social inputs to adjust the price of the device back to parity with a similar Windows-based machine and audience sentiment trended positive immediately thereafter.

Justin talked about how Coke’s public affairs, Knowledge and Insights (K&I) and HR teams were all extremely interested in the use of social data to improve their business decisions. One example that he shared was how the K&I team was using social inputs to add more context and relevance to internal Coke performance reports, making them more “normal”. He went on to talk about how the packaging team learned that a seasonal can featuring the famous polar bears was generating customer confusion and frustration. Thanks to social data monitoring, they learned that the reversed red and white colorway on the can made people think that they were actually opening a can of Diet Coke versus the regular flavor. Lesson learned – you don’t want to mess with a fan base as loyal as Diet Coke’s!

To wrap, Susan asked them both what they thought was on their “wish list” when it came to future uses of social data. Brian wanted to see stronger correlation between social identities and actual user emails, thereby enabling Dell to create incredible new lead generation activities. It would also allow Dell to better monitor the full 360 view of the customer and their specific product needs. Justin spoke about wanting to find more explicit causal relationships with their investments in social. It can be difficult to find the impact that social has on brand awareness when a product like Coke already has 100% awareness. Justin closed by challenging the audience to employ more use cases when demonstrating the business application of their products… eye candy isn’t enough. This is certainly something vendors should take to heart when pitching major brands like Coca-Cola and Dell in the future!

Social Data + Financial Markets

Leigh Drogen, Igor Gonta and Jon Najarian discuss the growing role of social data in financial markets.

In today’s world, news is breaking faster on social media than anywhere else. These outlets have become marketers, brand managers, and crisis responders’ first point of reference for major shifts in their industries – and those in the finance space are listening in too. When the NASDAQ takes a sudden dip, financial professionals need to know about it fast, monitor the conversation around the behavior, then make timely adjustments to maximize the opportunity or in some cases, cut losses. Big Boulder’s social data and financial markets panel discussion proved there is much excitement and anticipation around social data and its application to the finance world. Industry experts Leigh Drogen of Estimize, Igor Gonta of Market Prophit, and Jon Narajian, of optionMONSTER.com provided lively conversation around the current and future prospects for the industry.

In some regards, the finance industry’s integration of social data into methods of trading is still in its infancy. Drogen, who has pioneered many of the platforms for social finance, noted that the industry is moving into a period of structured platforms. Previously, financial professionals often struggled with vast amounts of disorganized data. Companies like Thomson Reuters and FactSet are taking social philosophies and bringing them to finance. Structured financial data means professionals can take a deeper look at the conversation and identify who’s predicting the market change and those who are reacting to it. By placing the predictors against the reactors, analysts are removing the noise from the crowd, resulting in a signal.. and a competitive edge. When identified in a timely manner (and in today’s world, that’s down to the second) it can create conditions for arbitrage opportunities and therefore, profits.

As the industry continues to grow and more data is collected and integrated, the financial markets should become more efficient, smoothing the reactionary curve as financial trading professionals arbitrage social data in all aspects of their business.

The Power of the Link

An interview with Mark Josephson from Bitly.

Bitly – what do they do, exactly? As Mark Josephson explained, Bitly takes long links and makes them shorter so they can travel better. And the value of this service is validated by the company’s metrics. At present, Bitly works closely with more than 40,000 brands to optimize the performance of the links they share. It works with marketers to encode all of their assets so that when they build an audience, it correlates to unique profiles. Essentially, Bitly simplifies the ‘stuff’ that exists between the marketer and customer. Bitly sees 9 billion clicks each month and boasts approximately 800 million unique profiles. In short and as Chris suggested, Bitly is “not link shorteners, [it is] link smarteners.”

While Bitly does not have direct competition, Mark was not shy about acknowledging the importance of having competition. To this end, he believes apathy and ongoing improvement of the Bitly free product are two ‘competitive’ factors that drive change and innovation within the company. Mark is also excited by marketers, such as those at Chipotle and GE, who understand that their brand must consistently be part of their campaigns and intelligently leverage services offered by Bitly.

Driven by its success and demonstrated demand by existing clients, Bitly has just this week launched the Bitly Certified Partner Program with six of its partners. Today, Bitly has 75 open API and is integrated into over 3,000 different applications. It is then no surprise that Bitly customers asked recommendations for trusted partners. This new initiative “highlights platforms that integrate Bitly data and functionality while improving security” to allow Bitly customers to “control, measure and optimize” all of their assets.

Changing the World with Social Data (Part 2)

Twitter #DataGrants winners discuss the potential impact their research holds. 

We heard from three of the Twitter Data Grant winners and their plan for changing the world. Three more data grant winners took the stage today with Stu Shulman to talk about their plans to use social data in innovative ways.  This panel echoed much of the previous panels academic sentiment, primarily about the importance of the data industry working with researchers to make sure that the necessary datasets get into their hands, so that innovation can thrive.

Today’s data grant winners included,Kyonari Ohtake of Japan’s National Institute of Information and Communications Technology (NICT). His plans stemmed from the massive earthquake and tsunami that hit Japan in 2011, and how to improve disaster response and communication on the government level and make sure that the right information is being spread. Tijs van den broek Tente has spent that last two years studying online communities and campaigns, such as the Movember Campaign, and their real-world impacts. His research mostly centered on whether or not social media involvement classifies as “slacktivism” or actual engagement with social issues. Finally, James Reade from the University of Reading is an economist who is interested in seeing how researchers can forecast events surrounding sporting events, and how to predict and prevent unpleasant events, such as riots or domestic abuse, that occur after an unfavorable end to a game.

These researchers seemed most interested in the sharing of processes and systems that the industry uses to digest and analyze such large amounts of social data. Although there is a massive amount of social data, and new ways to use it, they don’t necessarily want to reinvent the wheel with how to sift through it. The hope is to do this without losing any relevant data, and to ensure that they are able to differentiate the signal from the noise. In the case of disaster response situations, finding credible social data is critical, because rumors can quickly spread. In life or death scenarios, where people are relying on social data to stay informed about disasters, inaccurate information can be devastating.

The researchers also pointed out that the high volume of Twitter data grant proposals speaks to the need for this data for a broader audience and for an academic community, perhaps an extension of the Big Boulder Initiative, who can share academic insights and best practices to spread the academic value of social data. The panel expressed a hope for more federally funded social data based grants to help make the use of social data more accessible, credible and inclusive.

Bridging the Gap between Data Science and Journalism

An interview with Chris Wiggins from The New York Times.

There’s a lot to be excited about at The New York Times these days: new digital properties, a shift towards unbundled apps, and better hygiene? Scientific hygiene, that is. According to Christopher Wiggins, associate professor of applied mathematics at Columbia University and Chief Data Scientist at The New York Times, “every single field eventually becomes computational” and “that spirit is now happening in journalism.” To Wiggins, good scientific hygiene requires two things:  sharing data sets and creating code that’s reusable and well enough documented that you can explain it to others “including yourself in six months when you’ve totally forgotten about it.” Added Wiggins, “if you believe in science” then you must also believe in reproducibility and “reproducibility breaks down if you only share four pages of prose.”

All of the buzz around data at the Times might feel out of place to some or like the latest, greatest fad to others but Wiggins assured the audience that the Times is getting serious about data in a big way. “Over the last few years, The New York Times has made a real investment in bespoke tracking solutions, making data science possible,” explained Wiggins. “All of the plumbing had to be done first before you could start doing interesting predictive analytics.”

One of the most visible byproducts of the Times’ dive into data is TheUpshot – a new property that “presents news, analysis and data visualization about politics and policy.” Not only do The Upshot’s stories rely heavily on data to illuminate pressing issues but they also invite readers to check out the code and the data sets behind the stories – placing a huge emphasis on open source statistical tools and replicability. According to Wiggins, this focus on transparency is a big deal: “There’s a huge difference between interacting with spatial temporal data (and the way you can interrogate the data) and simply working with counts. The Upshot is doing a great job with that.”

In fact, Wiggins is bullish on the data revolution taking place in newsrooms throughout the industry – pointing to Nate Silver’s FiveThirtyEight and Vox as early leaders in this new world. “It’s a very exciting time for journalism,” declared Wiggins. “Each new product opens up new opportunities and new questions to ask. And normal citizens are really interrogating the data, interrogating the code, and interrogating the interpretations.”

With so much focus on data and even machine learning in the newsroom, some in the audience were left wondering whether the value of the journalist is diminishing before our very eyes. As Wiggins tells it, the opposite is true: “We’re dealing with complex, messy data sets where you can easily fool yourself. The input of domain knowledge is the difference between ‘computer assisted reporting’ and data journalism, and data journalism is rightfully getting a lot of attention.”

Data Science and Social Data

A discussion with William Cukierski from Kaggle and Scott Hendrickson from Twitter. 

This discussion on data science kicked off with the topic of how the field has changed during the last 12 months. Scott started off by talking about the emergence of data science bootcamps and training programs to take people in other scientific principles and make them into data scientists. He discussed the pros and cons associated with this; pros being that the attention on new training will raise the bar of what it means to be a data scientist, while cons being that the proliferation will make people question what it means to be a data scientist.

Will has noticed a shift in the last 12 months towards deep learning and representation learning. He described these practices as the creation of algorithms that let the computer understand more of the algorithm and develop its own learning without the need for constant human input and refinement. Using the example of a stump on stage, he pointed out that with machine learning a human teaches the computer what the features of the stump are, but in deep learning the computer get thousands of pictures of stumps and finds the commonalities to create the features. He then went on to talk about how deep learning could allow computers to understand the difference between a stump and a table once a picture of a stump with a cup on it was used as an input.

Will gave a real example of deep learning from Kaggle where there was a contest held by Microsoft. The challenge was to correctly identify whether the picture was of a cat or a dog from a set of 15,000 pictures of cats and 15,000 pictures of dogs of varying breeds that Microsoft had picked out. Within 1 week someone had 99% accuracy using a deep -earning library and in time got closer to 100% accuracy.

Scott talked about the balance of human input in the analysis process. He mentioned a project he worked on where they analyzed conversations happening on Disqus around the topic of texting and driving. When they dug into the content, they found conversations originally about texting and driving had shifted to topics such as drunk driving, teens driving, fake IDs, buses, bikes and other topics. With human input he said that natural language processing and learn more about human meaning from text and result in significant improvements in algorithms.

He also talked about the need for scientific literacy to help other people better understand what data science is and what is happening in the analysis. According to Scott, some data scientists like to just show results and let people think that they worked some kind of magic but he would prefer that they show their work and explain how they got the result to foster better understanding of the analysis and the profession.

Will agreed that there needs to be more explanation of what is being done, mentioning that there is a lot of smoke and mirrors around what data science is and what it can be used for. He mentioned that there is still great difficulty in finding the signal from a single Tweet in real-time that could change the world. But because of the way data scientists have gone back and looked at a Tweet that did change the world, people think that it is possible.

Conversations from the panel also covered the topic of the term “Big Data.” Will said that very few people are actually doing big data, mentioning companies like Twitter, Google, and “other web-scale companies.” He went on to say that big data was really more about the model than the amount of data, and defined it by the ability to create a model that got better as more data is inputted. Scott brought up how the term is poorly defined and while some people like to believe it is about the amount of data being consumed, he agreed with Will that it is about the model and added that big really applies to complexity rather than the amount of data.

This lead into a discussion on how to solve problems with data science. Will started by saying that at Kaggle they want to make sure that a person coming to them thinks about the problem they want to solve, whether it is a solvable problem, and whether they have the data to solve it. Only with those items can the challenge be put out to the crowd to try and use data science to solve a problem.

Scott talked about cycle times on projects and how to iterate in data science to get projects accomplished. He said it usually takes around 10-15 cycle times of asking the questions, getting the data, putting it out to a larger group of stakeholders, trying to find the answer, and the iterating on those steps to finish a project.

The session wrapped with a conversation of experimentation and productization. Both speakers agreed that the role of the data scientist within the organization is to answer a business question and to productize their process, even if only for the internal customer. Because of this experimentation happens within an existing business priority as opposed to the data scientist selecting experiments to run at random.

Data at Scale

Data at Scale

Dmitriy Ryaboy from Twitter discusses the challenges of scale.

Twitter’s fail whale days may be a thing of the past but it’s still fun to reminisce. For Dmitriy Ryaboy, head of Twitter’s 40-person Data Platform team, the 2010 World Cup was a particularly memorable and trying experience: “We would all sit in this one room and watch the game on one monitor and all of the internal Twitter metrics on another screen. Everyone was praying that neither team would score a goal.” Fast forward four years and Dmitriy’s organization – responsible for building all of the tools that Twitter’s data scientists, product managers, and engineers use to collect data inside Twitter – has come a long way, supporting real-time data analysis and powering new data driven products and new ways to provide recommendations to Twitter users.

In describing his experience helping Twitter navigate and outgrow the fail whale years, Dmitriy highlighted two key challenges of scale. The first, managing more and more data, is “reasonably straightforward” and the more “predictable scale problem.” The second challenge – scaling human capital – took Dmitriy “by surprise” and is “way harder than scaling petabytes.” He added, “When you grow to dozens of teams and hundreds of people you need to create systems and services so that data is discoverable and you don’t need to find the right person to get what you need…From the beginning we wanted everyone in the company to be able to access the data that they need to do their job.”

On the technical side, Dmitriy cited Twitter’s shift from Ruby on Rails to JVM, along with his organization’s switch to Scala, as a huge enabler. Dmitriy made the change “so that our engineers were all using the same language” and could access data directly without having to learn specialized technical skills. He also stressed that when building tools for internal usage, “It’s OK to have a product that’s more complex and takes longer to figure out. The tradeoff is that you can answer really hard questions and our focus is on making things that are impossible become possible.” That same approach doesn’t work as well on the consumer side, Dmitriy noted: for small businesses or general users, the questions tend to be “less complicated but what people want is very high SLAs.”

Turning to human capital challenges, Dmitriy focuses on building a culture of continuous learning and hiring people who are really smart, rather than people who are experts at one specific technology “because that technology will likely be outdated within a year.” With Twitter University – the company’s in-house training organization – Twitter has “a very structured way of onboarding people and teaching people new skills.” Dmitriy also touted Twitter’s significant investment in open source technologies: “We use open source technologies, we open source our own stuff, and when we hire new people they are already trained on our technology.”

Dmitriy made one surprising – at least to this correspondent – revelation when asked about the scale of Twitter’s analytic platform: “99.9% of data is stuff other than Tweets and the follow graph. Data related to how people interact with the product, the results of A/B experiments, etc., absolutely dwarf things like the social graph and Tweets.”

At the close of the session, Dmitriy fielded a final question from the audience and left us with this: “Elastic search is one technology that I wish we used more of and figured out how to add to our stack.” To the members of the Big Boulder community, we can’t help but ask, who’s working on that?

An Interview with Dachen Chu from Sina.

Dachen Chu from Sina

An interview with Dachen Chu from Sina.

So who, and what, is Sina Weibo exactly? The panel kicked off with a description of both Sina and Weibo. In Dachen’s words, and for comparison’s sake, think of Sina as China’s Yahoo and Weibo as China’s Twitter. Adoption of media tools, both social and traditional, has relatively been slower in China. Weibo was one of the first tools that enabled people to interact through social media. In addition to posting updates, users also create object pages for things like music and movies, and utilize the e-commerce functionality to sell products.

Who is using Weibo? Close to 90% of the company’s user base is in China and the remaining percentage of users who aren’t based in China are most likely part of the global Chinese community. Weibo does get high-profile English language users, like celebrities, and also built an English language app. The company does have hundreds of global brands engaging on the platform, accessing the giant and growing Chinese market.

So where is Weibo going? Right now, their main focus is on increasing the number of users and user activity. China has a huge population. However adoption of mobile use, a driver of Weibo platform use, is slower in China’s lower-tiered cities. Partnering with Alibaba, Weibo is building e-commerce functionalities and a social-commerce eco-system. This will continue to play a big part in the company’s revenue growth–another focus for Weibo right now. E-commerce is huge in China, driven by a combination of China’s role as a global manufacturer, cheap labor in logistics, and the lack of traditional big box offline retailers.

So who, and what, is Sina Weibo exactly? The panel kicked off with a description of both Sina and Weibo. In Dachen’s words, and for comparison’s sake, think of Sina as China’s Yahoo and Weibo as China’s Twitter. Adoption of media tools, both social and traditional, has relatively been slower in China. Weibo was one of the first tools that enabled people to interact through social media. In addition to posting updates, users also create object pages for things like music and movies, and utilize the e-commerce functionality to sell products.

Who is using Weibo? Close to 90% of the company’s user base is in China and the remaining percentage of users who aren’t based in China are most likely part of the global Chinese community. Weibo does get high-profile English language users, like celebrities, and also built an English language app. The company does have hundreds of global brands engaging on the platform, accessing the giant and growing Chinese market.

So where is Weibo going? Right now, their main focus is on increasing the number of users and user activity. China has a huge population. However adoption of mobile use, a driver of Weibo platform use, is slower in China’s lower-tiered cities. Partnering with Alibaba, Weibo is building e-commerce functionalities and a social-commerce eco-system. This will continue to play a big part in the company’s revenue growth–another focus for Weibo right now. E-commerce is huge in China, driven by a combination of China’s role as a global manufacturer, cheap labor in logistics, and the lack of traditional big box offline retailers.

 

Avid, Topic-Driven Communities

Ro Gupta from Disqus and Martin Remy from Automattic

An interview with Ro Gupta from Disqus and Martin Remy from Automattic.

This session saw Chris Moody interviewing Ro Gupta from Disqus and Martin Remy from Automattic about their respective commenting and blogging platforms. Chris’ questions looked into the evolution of their products, both from a technology perspective and as well as how users are interacting with each other and other platforms. The really exciting aspect of their responses was learning just how much reach both Disqus and Automattic have created.

Disqus has surpassed the 1 billion monthly uniques mark in terms of user visits, a sign that Ro’s audience has certainly moved beyond simple conversations and into organic communities. WordPress, on the other hand is now the CMS behind 22% of the top 1 million sites on the web. Marty talked about how Automattic has really worked to “humanize” the relationship between users on WordPress.com and has a stonger mobile focus than other.

A common theme that both Ro and Marty commented on in response to Chris’ questioning was the idea of proactive content discovery. That is to say that both platforms are working hard to introduce users to new topics and discussion threads that they might not have otherwise found on their own. Part of this strategy is making it easier for users to take their conversations across platforms and mediums, and then back again. WordPress has the Publicize feature that automatically generates a Tweet anytime a user creates a new blog. And Disqus has event taken some of their online discussion communities into the real world with in-person get togethers focused on avid foodies and gamers.

Both Ro and Martin talked about their platforms’ continued push into international markets, and don’t see any slowing in this trend. There is a huge opportunity for international growth for each, and Disqus is already available in 51 languages and 64% of its visitors originating outside of the US. WordPress has an equally undeniable global presence and continues to add new countries to the list, with especially high interest in Brazil.

All in, it is readily apparent that long-form data is far from dead and each platform continues to prove the incredible level of engagement that comes from their topic-focused audiences. Whether via comments or blog posts, both platforms continue to create strong stand-alone communities while also enabling users to discover exciting new areas of interest.