life is a rum go guv’nor, and that’s the truth

OER Recommender Released

Here is the updated OER Recommender White Paper.

Yesterday we released the OER Recommender system that I have worked on. There are still many things that could be added or tweaked, but it does something useful already so out the door it goes! I’m concerned that we are calling it a recommender as the “recommendations” it currently provides are not specific to a user, they are related resources generated via a content-based approach. The proposal and plans are to make it a recommender based on user profiles. See my previous post for details about where it is intended to go.

The Problem. The William and Flora Hewlett Foundation, the National Science Digital Library, and other large organizations have made large investments in the development of electronic resources that can be used for teaching, learning, and research. They are keen on finding ways to increase the impact of their investment. One way to do that is to create tools that make it easier for poeple to find resources that are useful to them. One approach to solving this problem are search and browsing tools that help people find good resources such as NSDL, OCW Finder, and of course Google. Recommender systems approach the problem by helping bringing resources to peoples’ attention without having them to go look for them specifically Google’s Adsense technology and Amazon’s recommendations are two of the most common examples. Some of the challenges with general search tools such as Google and even more focused ones such as NSDL’s main search is that they cast too wide a net and you get back resources do not match. On the other hand, while most collections of open courseware and digital libraries provide search, it is nice to be able to search across repositories at the same time to find the best resources wherever they reside.

The Vision. In creating the OER Recommender we set out to create a service that would help people find relevant open education resources. The first step in this was to create an automated process for clustering related resources. The hope is that presenting links to related resources to users when looking at a resource could help people stumble upon resources that are relevant to them even though they might have not been actively looking for them. We could tune this service by monitoring what resources people visit, share, rate, tag, and otherwise pay attention to. We could use this attention metadata to create profiles of what people’s interest are. The profiles could be used to push recommendations to people even when they are not browsing perhaps via email or other means.

Exploration. Since learning about it I have felt that the recommender would be one of the more interesting folksemantic tools to work on since I initially thought to pursue a latent semantic analysis approach. I had heard about LSA from my exposure to Andy Walker’s and Art Graesser’s work. I have also had a number of interesting discussions about clustering with Adele Cutler about her work on Random Forests. I got a hold of the Handbook of Latent Semantic Analysis and explored the LSA online resources I could find as well as the lsa module for R. After spending significant time reading and playing I came to the conclusion that it would likely more complex to implement and computationally expensive than I wanted. I also concluded that everything that I would need to do to implement a standard Term Vector Model approach could be used to later implement an LSA approach. During my exploration, I came across a number of useful online resources and books including Gerard Salton’s work.

An Explanation for My Mother. I’ve already been asked by Shelly to explain the implementation of the recommender in language that my mother would understand, so I’ll give that explanation first. We use the folksemantic feed harvester to gather information (metadata) about OERs into databases. For each pair of resources that are related enough, the recommender uses the titles, description, and tags to calculate a score indicating how related they are. Recommended resources are the ones scored to be the most similar. The similarity of two resources is based on an automated analysis of the words in their metadata. Got that Mom?

We use four phases to arrive at recommendations: (1) parse metadata, (2) calculate local term weights, (3) calculate global term weights, (4) calculate similarity scores.

Parse Metadata. Using a string tokenizer we break metadata text into terms. We throw away stop words (common words such as ‘and’, ‘is’, ‘of’ that do not add meaning). Next we convert terms into their stems using a common algorithm; for example ‘running’, ‘ran’, and ‘runs’ all get collapsed into ‘run’.

Calculate Local Term Weights. A local term weight is a measure of how important a word is for describing a document. The more frequently a word appears in a document, the more important it is… up to a point. Of course, where a word appears in a document is probably a good indicator of how important the word is as well. For example if a word appears in a title, it is probably more important than if it appears in the body. One more important factor to consider is document length (total number of terms in the document); all other things being equal, the longer a document is, the more times a term will appear in it. So we normalize term frequencies using the document length. One last issue relates to filler words.

To calculate local term weights OER Recommender uses a function that looks like this.

Global Term Weights Function

The x-axis represents term frequency and the y-axis represents the local term weight. As you can see, the more time a term appears in a document, the more weight it is given. However after reaching a threshold, it plateaus. I think this is especially important in situations where document creators might be trying to game the system. The formula used is:

ltwi = 1 / (1 + (e(.0044)dlen) .7(fti – 1))

ltwi is the local term weight for a given term in a document.

dlen is the total number of terms in the document.

fti is the number of times a given term appears in a document.

Note that the constants .0044 and .7 in the equation determine the shape of the curve. I chose those values based on their use in the MeSH system. As explained there, the values should be tuned to your data set using an empirical approach. I played with the parameters some and the values they used seemed fine, so I adopted them.

Calculate Global Term Weights. Global term weights are a measure of how important a word is for distinguishing documents within a collection. The more documents a word appears in, the less value it has for characterizing documents. For example, the term ‘USU’ appears in every metadata record for resources in USU’s OpenCourseware. As a result, it has no value for characterizing clustering resources. The term ‘USU’ becomes in a sense, a stop word, similar to those thrown away during the parse phase. To calculate the global term weight for a term, the number of documents that it appears in are counted as well as the total number of documents.

To calculate global term weights OER Recommender uses a function that looks like.

Local Term Weights Function

The x-axis represents the number of documents that a term appears in. The y-axis represents the global weight assigned to the term. The formula used is:

gtwi = log (D/dfi)

gtwi is the global term weight for term.

D is the total number of documents.

dfi is the number of documents that the term appears in.

See Mi lslita’s explanation of Term Vector Theory for details.

Calculate Similarity Scores. Once local term weights and global term weights are calculated, we are ready to create recommendations. To create recommendations for a document we first find all documents that have any of the same terms in it. It turns out that this can be a very large number of documents (e.g. 40,000 in our system). For each pairing of the document being considered and a document with matching terms we calculate a similarity score. Because calculating 40,000 scores can take a long time (about 15 seconds in our current system), we shorten the list to consider to 200. We do this by sorting the pairs according to the number of overlapping terms. To calculate the similarity score for a pair of documents, we sum over all of the terms that they share in common. The contribution from each term is a combination of the local term weight in the first document, the local term weight in the second document and the global term weight. Because in OER recommender, each feed can be considered a separate collection, we calculate global term weights for each of the feeds and use them in calculating the similarity score.

To calculate the contribution of an individual term to the similarity score, OER Recommender uses:

ssti = (gtw1i)(ltw1i)(gtw2i)(ltw2i)

ssti Is the contribution to the similarity score for a given term.

gtw1i is the global term weight from the feed that the first document is in.

ltw1i is the local term weight from the first document.

gtw2i is the global term weight from the feed that the second document is in.

ltw2i is the local term weight from the second document.

See NCBI’s explanation of how MeSH calculates Related Articles for details. Once OER Recommender has calculated similarity scores for the 200 documents, it sorts the documents by those scores and stores the top 10 in a database.

Displaying Recommendations. Now that the recommendations are stored in a database, displaying them to the user is straightforward. We have created a Greasmonkey script that requests recommendations for each web page that a user browses. If OER Recommender returns any, the script inserts HTML for the recommendations into the web page. The eduCommons team is building a plone tool so that anyone running eduCommons can turn on recommendations. That way users won’t have to have install the Greasemonkey script in order to see recommendations on the eduCommons website. The OER Recommender site publishes the XML format it returns so others could add recommendations into their websites.

Getting Resources into the Recommender. Currently OER Recommender only provides recommendations for URLs that it has metadata for, so in order to get recommendations, you need to register a metadata feed. If you have an OCW site that provides an RSS feed, you can do that at OCW Finder. If you have an OAI feed or an RSS feed for learning objects or other OERs, you can add your metadata by sending a feed title, URL, and display URL (home page of the repository) to oerrecommender AT cosl DOT usu DOT edu.

Future Directions. There are many things left to be done: (1) adapt recommendations by monitoring which recommended resources people click on and how long they stay at recommended resources, (2) provide links to more recommendations (so users can see more than the default 5), (3) provide a way for people to indicate when a recommendation doesn’t work, (4) provide a way for people to register and login, so they can receive personalized recommendations (perhaps via email), (5) create a process whereby recommendations can be updated without going through all the documents in the database, (6) add functionality for retrieving recommendations for arbitrary web pages (recommending on demand) perhaps via a bookmarklet, (7) create whatever Greasemonkey-like add on is supported on IE. We also plan to integrate the recommender with the aggregator and feed reader to provide recommended news items based on the attention people have previously paid to news articles.

Wow! I realize that this is a long post, but it is not a simple topic. Hopefully the explanation is useful.

4 Responses to “OER Recommender Released”

  1. Stephen gives mostly positive feedback but comments that

    I would prefer the feature to be browser-neutral. Much more web-based than browser-based.

    I realize that he might not have looked at the writeup linked to at the bottom of the OER Recommender home page. To clarify, an OER creator can make the recommendations be browser neutral by requesting requesting the recommendation XML and inserting it into their page before it is served to the user. Right now, that will take a bit of work. I seem to remember on Stephen’s website a Javscript that you can use to insert content from an RSS into a web page. I would like to provide that, but wasn’t able to find it in my brief looking. I could also enhance to recommender service to look at the referrer so that a URL wouldn’t have to be passed to it.

  2. Joel, sorry not to comment earlier, only just now got pointed back here. As I wrote on David’s blog when he posted about this, fantastic work! And thanks so much for these additional details, they make lots of sense. A few comments on the ‘future directions’ section:

    - re 7, IE greasemonkey support – you’ve likely already seen it, but http://www.reifysoft.com/turnabout.php seems in the ballpark, and with luck your scripts may also be some of the ones that work with Opera (cf. http://www.opera.com/support/tutorials/userjs/examples/#greasemonkey)

    - I think the javascript you refer to above might be http://feed2js.org/ (but maybe not). Anyways, I think it does what you describe. Stephen’s one used to be a ‘referrer script’, different thing I think. Anyways, I thought the comment was a bit nit picky, and as you say, there are some ways to work around it. Plus it gives people good motivations to use a modern web browser ;-)

    - If I had a vote, I’d put my hand up for #6, the ability to add recommendations for arbitrary web pages. The solution you’ve developed is great and works with the resources it is intended to (and the metadata it provides) but it would feel more like participatory culture to me if end users were also contributing new resources and recommendations.

    Anyways, lots more thoughts coming up to me but I’ll leave it at these short comments for now. Great work!

  3. The page for the referrers is: http://www.downes.ca/referrers.htm

    It’s an old page and I don’t support it any more. The links don’t work (I’ll probably patch it all up again one day). The code is here:

    http://www.downes.ca/code/referrersa.txt

    and here’s the Javascript:

    http://www.downes.ca/code/referrersjs.txt

    Creating a script that will insert Javascript code into a web page using a one-line Javascript is pretty trivial. If you send me the Javascript you want to put into the page (be clear about any variables) I can give you a script.

  4. [...] It recommends other useful resources related to the page you are viewing. I found the page explaining the approach and mathematics very insightful. It certainly gave me a few ideas on how I could put some Machine [...]

Leave a Reply