We’re in the process of building out personal recommendations for folksemantic.com. The basis for the recommendations is user attention metadata. The data we use includes:
- Identity feeds – RSS feeds that users register that represent their interests. For example, their blog or their delicious account.
- Clicks – The articles that the user clicks on.
- Shares – The articles that the user shares to others.
- Comments – Articles that the user comments on.
- Time on page – Amount of time that a user spends on an article before moving on.
- Searches – Searches the user executes.
Some of our assumptions are:
- Semantic relatedness – The more semantically similar an article is to articles that a user has paid attention to, the more interesting to the user.
- Attention types – Different types of attention should be given different weights. For example, following a link to an article should not give it as much weight as writing the article.
- Attention details – The particulars of a given type of attention might make it more important than another attention of the same type. For example, if a person shares an article with 100 people, it might be reasonable to infer that it is more important than an article that they share with one person.
- Entry recency – The more recently an article has been added to the system, the more interesting to the user (they probably haven’t seen it before).
- Attention recency – The more recently a user has showed attention to an article, the more weight that should be given to it.
- Attention frequency – The more frequently a user has showed attention to an article, the more weight that should be given to it.
Stating these assumptions reminds me of the difference between relevance and certainty. So while an item that a user clicks on may be more relevant than an blog article they have written, it is harder to be certain of that. Our approach is to give the click less weight than the article.
Right now, we score articles using the formula:
(relevance)(attention type)(attention details)(attention recency)(article recency)
For all articles that a user has paid attention to, we score the 20 “related articles” using this algorithm; rank the scores and cache the top 20 (that the user hasn’t already clicked on) to recommend to the user. There are obvious weaknesses to this approach, but we are going to start there and see where to go next.
Possible Extensions / Improvements
We are considering:
- Collaborative Filtering, Bipartite Graph, and Discriminative Weight Algorithms – Putting into production, the algorithm that combines discriminative weights with a novel sparse matrix clustering method we conducted research on.
- Modeling user interests separately – Modeling users interests using multiple term vectors (one for each interest) by extracting vectors from closely related (clustered documents) that users have paid attention to.
Posted on November 23rd, 2009 by joel
Filed under: recommender