Keyword Research Using Python, RAKE, and Support Chat Transcripts
Over at WordPress.com, our main avenue of customer support is live chat. We previously were on Olark, but has since built out a chat system that we call HappyChat (support folks are referred to as Happiness Engineers). There are a number of excellent features that the development team has built in, but an often underutilized one is chat tagging. If a user joins a chat, and asks about a domain renewal, that chat might be tagged with “domains” or “domain-renewal”; we aren’t very strict on tagging, except in certain circumstances. We can pull out the data we need even if the tagging is a little fuzzy.
While I’ve been writing search marketing and SEO articles for our users for a while, I wanted to know exactly what our users were asking in chat, and what kind of things were we spending chat time teaching them about. This is where the chat tagging comes in. Most of the Happiness Engineers are pretty diligent about tagging their chats, so it was pretty easy for me to get an SQL dump of all the transcripts tagged with “seo”…however, it ended up being roughly about 18,000 lines of chat transcripts over the past few months.
Time for some Python magic.
I exported the chats out of my local MySQL instance into a CSV file. This thing needed some cleanup first before I could really process it. I like Atom, but any decent text editor that supports regex will do. I needed to get this CSV, with every detail about every chat (which included many columns of diagnostic-type data), down to the bare messages. After a number of different regex find-and-replaces in Atom, I finally trimmed it down to about 15,000 lines of pure chat messaging, both the user and the Happiness Engineer side of the conversation.
Ok, now it’s time for some Python magic.
There’s probably a better way to do this with R, but I already somewhat know Python, so that was my main tool of choice. I’d dabbled with nltk before, but all I needed for now was a simple keyword extraction with weighting of the terms. I wanted to know all the related terms, but I wanted to make sure I knew which were occurring frequently. After a bit more research, the RAKE package seemed to be exactly what I needed.
RAKE is pretty simple as far as these things go, but I came across a great tutorial here that walks you through doing an extraction. While you don’t need to get super fancy with it, I ended up adding a fair amount of functionality to allow me to use a text file as the input, and output a CSV with the scored keywords to make it easier to use. You can see what I ended up with here (feel free to fork and put your own spin on it).
I ran my chats through the RAKE extraction, and on the first run, I got some okay results. A lot of expected keywords pulled out, but a lot of noise from URLs that were pasted into chat messages. So, I went back into Atom and used more regex to clean those out.
My second pass came out quite a bit cleaner, but still a bit iffy on the words it was ranking highly. At that point, I discovered that Medelyan had included an optimization script in her repo for RAKE. I ran the `optimize_rake.py` script to determine the best settings for the keyword set, and what I had (defaults) were far different than the eventual suggestion from the optimization script. I changed the settings, and let RAKE loose again on the chat messages.
After an hour or so of processing, RAKE finally spit out a pretty solid list of ranked keywords.
As you can see, these are pretty on-point search marketing and SEO-related terms, which is exactly what I was looking for with this experiment. These are the things our users are actually asking us about, and not speculation or anecdotal evidence.
Your clients might not using chat for customer interactions; it requires a fair amount of “babysitting” and isn’t necessarily useful for all industries. However, they’ve probably got a CRM system that customer interactions run through, or a contact form for pre-sales questions. They might record phone calls that could be transcribed by an outsourced transcription service. There are a million ways that you can get customer input/inquiries in a manner that this technique can be used on. The silver bullet is that you’re not looking at all the same competitors and AdWords data that everyone else is; you’re gleaning keyword research directly from real, live customers. Using these weighted keyword lists as part of your research strategy will help drive down click budgets, create better targeted content, and come up with better answers for those common customer questions.