A discussion with William Cukierski from Kaggle and Scott Hendrickson from Twitter.
This discussion on data science kicked off with the topic of how the field has changed during the last 12 months. Scott started off by talking about the emergence of data science bootcamps and training programs to take people in other scientific principles and make them into data scientists. He discussed the pros and cons associated with this; pros being that the attention on new training will raise the bar of what it means to be a data scientist, while cons being that the proliferation will make people question what it means to be a data scientist.
Will has noticed a shift in the last 12 months towards deep learning and representation learning. He described these practices as the creation of algorithms that let the computer understand more of the algorithm and develop its own learning without the need for constant human input and refinement. Using the example of a stump on stage, he pointed out that with machine learning a human teaches the computer what the features of the stump are, but in deep learning the computer get thousands of pictures of stumps and finds the commonalities to create the features. He then went on to talk about how deep learning could allow computers to understand the difference between a stump and a table once a picture of a stump with a cup on it was used as an input.
Will gave a real example of deep learning from Kaggle where there was a contest held by Microsoft. The challenge was to correctly identify whether the picture was of a cat or a dog from a set of 15,000 pictures of cats and 15,000 pictures of dogs of varying breeds that Microsoft had picked out. Within 1 week someone had 99% accuracy using a deep -earning library and in time got closer to 100% accuracy.
Scott talked about the balance of human input in the analysis process. He mentioned a project he worked on where they analyzed conversations happening on Disqus around the topic of texting and driving. When they dug into the content, they found conversations originally about texting and driving had shifted to topics such as drunk driving, teens driving, fake IDs, buses, bikes and other topics. With human input he said that natural language processing and learn more about human meaning from text and result in significant improvements in algorithms.
He also talked about the need for scientific literacy to help other people better understand what data science is and what is happening in the analysis. According to Scott, some data scientists like to just show results and let people think that they worked some kind of magic but he would prefer that they show their work and explain how they got the result to foster better understanding of the analysis and the profession.
Will agreed that there needs to be more explanation of what is being done, mentioning that there is a lot of smoke and mirrors around what data science is and what it can be used for. He mentioned that there is still great difficulty in finding the signal from a single Tweet in real-time that could change the world. But because of the way data scientists have gone back and looked at a Tweet that did change the world, people think that it is possible.
Conversations from the panel also covered the topic of the term “Big Data.” Will said that very few people are actually doing big data, mentioning companies like Twitter, Google, and “other web-scale companies.” He went on to say that big data was really more about the model than the amount of data, and defined it by the ability to create a model that got better as more data is inputted. Scott brought up how the term is poorly defined and while some people like to believe it is about the amount of data being consumed, he agreed with Will that it is about the model and added that big really applies to complexity rather than the amount of data.
This lead into a discussion on how to solve problems with data science. Will started by saying that at Kaggle they want to make sure that a person coming to them thinks about the problem they want to solve, whether it is a solvable problem, and whether they have the data to solve it. Only with those items can the challenge be put out to the crowd to try and use data science to solve a problem.
Scott talked about cycle times on projects and how to iterate in data science to get projects accomplished. He said it usually takes around 10-15 cycle times of asking the questions, getting the data, putting it out to a larger group of stakeholders, trying to find the answer, and the iterating on those steps to finish a project.
The session wrapped with a conversation of experimentation and productization. Both speakers agreed that the role of the data scientist within the organization is to answer a business question and to productize their process, even if only for the internal customer. Because of this experimentation happens within an existing business priority as opposed to the data scientist selecting experiments to run at random.