All of us that work with social data know the signal to noise ratio is ever increasing, and not in our favor. As the irrelevant baby picture or the 1000th Trump meme pops up in our social, that micro-dopamine hit of ingesting content from social media may not fire like it used to, and you lose interest. In fact, it is this increasing noise that can cause businesses to become disillusioned or skeptical that they can use social data to achieve their goals.
As these noise levels increase, it’s not hard to recognize that businesses need ever more powerful tools to find those sparkly signal-diamonds in the social rough. The basics of attributing sentiment scores and broad topic categories are outgrowing their usefulness and competitive advantages to drive significant results.
So, what is the next step for reaching an even a higher level of value when tackling the signal to noise problem with your social data? After all, the more accurate you are at collecting contextually relevant social data for your use case, the better your outcome and analysis will be.
You may have thought about writing some of your own rules on top of your dataset to improve it’s relevancy and quality. We’ve all seen the complex Boolean search queries your teams may use in TweetDeck or similar tools. For example – perhaps you want to analyze all social posts that are about US President Donald Trump. You might create some search queries for his handle(s), as well as keywords like “Donald Trump” and “President”. However, the latter keyword term will bring back contextually irrelevant noise, compromising your dataset. You might then add some heuristics on top to reduce the noise, such as, must mention keyword “President” but NOT “Obama”. However, these rules can soon get out of control and unscalable, or give you misleading results.
Machine learning is a term coined in the 1950s, but has only recently become so prevalent in the tech press that it may leave you feeling left behind. Put simply, machine learning allows us to give the computer the ability to learn from data provided, and predict an outcome given incoming data—like predicting which social posts mentioning the keyword “President” are actually about Donald Trump versus someone else. With the massive increases in computing power at ever cheaper costs, improved tool sets and a growing abundance of data, we are living in a golden age for machine learning.
So, can machine learning solve all your complex signal to noise problems? Maybe. If you have enough data, patience and time – almost certainly. Here are the steps to get you started:
- Define the desired outcome clearly. Take samples from your dataset and examine them carefully to really understand what is signal and what is noise. Have your team manually tag say 20 samples with what they think is the desired outcome. This may sound silly or you make think the desired outcome is obvious, but I promise you will find that your team (or even your customers) will differ in opinion at this step, especially with any noisy or complex dataset. Getting on the same page of what is actually signal and what is truly noise will save you a lot of pain down the road.
- Evaluate if this is actually more than one problem. When you examine the data as part of step 1, does it seem like there are multiple different contributing factors that define noise? For example, are there certain types of accounts you don’t want in your dataset regardless of the content they contribute? If there are multiple different contributing factors, each sub-problem is usually better tackled separately. Define these problems here.
- Don’t re-invent the wheel. Next, understand if any of your problems have already been solved. For example, if you think accounts with profile pictures that aren’t people is contributing to your noise, don’t go build a face detection machine learning model as this is a well-solved problem. Instead, use face detection output from existing technologies, and test it as a feature in your machine learning model unique to your problem.
- Get technical. For any of your problems that are not already publicly solved, you now need to pick the best machine learning approach to apply. Unless you’re feeling completely wild, crazy and academic, you will likely not need to invent a new machine learning algorithm, you just need to find the most suitable one for your task to build a model unique to your problem. (You also may now need to gather a large labelled dataset using your clear desired outcome definition, but this is a separate blog post!). Microsoft has a great cheat-sheet to help you through this step:
We know how tempting it is to jump straight to step 4, especially for a team of smart engineers and scientists. At Twizoo, we’ve been solving signal to noise problems with machine learning for years, and have the battle wounds to show first-hand how diligently completing steps 1-4 may save you months of pain and ultimately drive higher precision and accuracy. If you want to talk more about your machine learning problem, or if you want to learn more about how Twizoo can help you mine social media for great user-generated content, feel free to reach out at madeline <at> twizoo <dot> com. See you at Big Boulder 2017!