"Zendesk's Answer Bot uses deep learning to understand customer queries, responding with relevant knowledge base articles that allow customers to self-serve. Research and development behind the ML models underpinning Answer Bot has been rewarding but punctuated with pivotal deviations from our charted course"
I recently had the opportunity to hear Zendesk Data Scientists, Arwen Griffioen and Chris Hausler talk about their journey from product ideation to launch, starting with a traditional customer-base d machine learning approach, and ending with a single global deep learning model that serves tens of thousands of accounts.
This was a fantastic talk that gave a really good insight into how the Zendesk Machine Learning team works and what they value. Both Arwen and Chris have research backgrounds, which is always great to see. Arwen has a computer science background and finished her PhD on ecological modelling (using MLA) in 2015. Chris has a computational neuroscience background and finished his PhD in 2014.
Apparently the data team at Zendesk is like a team sport a real mix of talent: engineers, software developers and data scientists all working together towards a common goal. I love that any additional trainging (i.e., deep learning) is done as a team and includes everyone, regardless of their specific role. I’ve heard of data science teams where only the most senior are allowed to up skill at work and then pass on the knowledge — the rest have to do it in their own time — which is ludicrous. High performance teams work when people are encouraged grow and learn and develop new skills.
The talk began with the anatomy of a data product. I loved their iceberg analogy. While things may appear to be advancing smoothly (at least in press releases, conference talks, shareholder letters), the bulk of the time is really spent researching new methods, trying things that ultimately fail — life would be terribly boring if we had all the answers º — designing, testing, and re-engineering.
To build the Answer Bot, the team started out with a fairly simple machine learning model. By simple I mean supervised learning using NLP on a software ticket, and using a logistic classifier to predict the most relevant help document or article. The assumption was that this would be fairly accurate, performant, familiar and explainable. Because the industry and therefore context around tickets varies broadly for all of Zendesk’s clients, labelled data would need to be provided for each client, and because the Answer Bot learns on the job and improves with more data, you can’t really switch it on from the get go. For this to work well the team needed to spend a lot of time preprocessing data.
The team explored unsupervised classification, using both tickets and articles as inputs which worked well, except that it would require ~100,000 different models (for each client) and it takes a really long time to train. Part of the reason is that the same words can has a very different meaning depending on the user, and different industries have different sets of words. For example "ticket" may mean an issued ticket, given so that someone can join a queue, or it could be something that is purchased, for example a movie ticket. Answering a question such as "what do I do if I lose my ticket" requires a good understanding of context. If you try to build a single model will all the words the dictionary, you're going to run out of parameters pretty quickly.
Pivoting to Deep Learning:
This happened quite a few months down the track and came partly out of their journal club. They essentially started from scratch, and this required loads of reading and retraining the whole team. A lot of uncertainty and not really knowing what they were doing, but with some knowledge that NLP problems work well with deep learning and the more data you can throw at it the better. Zendesk has no shortage of data. After the talk I asked Arwen how much she and the team knew about deep learning before coming to Zendesk and her answer was “basically nothing” (I love this company!)
The team split into two groups and tackled various aspects of the problem. I wasn’t surprised to hear they use TensorFlow. I was really pleased to hear Chris say that problem solving is a creative process — the mark of a great researcher, and not something you can learn easily.
The initial perceptions of deep learning were that you could develop one robust model, that it would work well, and that the more data you threw it at the better it would work. This is one of my big worries about machine learning and deep learning. Weights are determined as if by magic, loss functions are calculated and "accurate" results are taken as gospel. From my experience with astronomy data, I can tell you right now that if you start with ALL THE CRAPPY DATA you can still get a good fit, after all you just need to keep adding parameters — seven dimensional string theory anyone? BUT.... The result will inevitably meaningless. Chris summed this up eloquently; "if you put shit in, you're going to get shit out"... or something to that effect. So this is where things get really exciting. This is where you have to go back and figure out each step of the miracle that is deep learning and exploring everything that’s going on and what could be implemented, whether the data introduces unintended biases — turns out datasets with large numbers tickets were artificially skewing things, and whether there are overfitting problems (Hint: unless you have an underlying physical model there will almost always be overfitting problems).
Of course the hard work paid off and it sounds like they’ve come up with a bloody good solution. The entire process took six months and it was a good year before the product was considered reliable enough for deployment. They spent quite a lot of time validating the model, developing reliable performance metrics, ensuring consistency, and taking the time to do proper human user testing. I was both surprised and pleased that Zendesk allowed the team spend so much time researching. Since I’ve never worked at a tech company I’m not sure what would be considered normal, but my impression is that many data science teams are expected to data analysis results out on pretty short timescales, regardless of data quality.
Lessons from the team:
ML products are really hard work.
“Vanilla” ML works really well. Logistic regression and Random Forrest work really well.
Always start with the simplest model.
Deep learning isn’t magic
When it finally works, it’s great.