<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>undesigned &#187; information retrieval</title>
	<atom:link href="http://www.joelduffin.com/blog/category/information-retrieval/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.joelduffin.com/blog</link>
	<description>life is a rum go guv’nor, and that’s the truth</description>
	<lastBuildDate>Wed, 28 Jul 2010 04:12:44 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>OER Recommender Released</title>
		<link>http://www.joelduffin.com/blog/2007/08/23/oer-recommender-released/</link>
		<comments>http://www.joelduffin.com/blog/2007/08/23/oer-recommender-released/#comments</comments>
		<pubDate>Fri, 24 Aug 2007 04:15:06 +0000</pubDate>
		<dc:creator>joel</dc:creator>
				<category><![CDATA[information retrieval]]></category>
		<category><![CDATA[web development]]></category>
		<category><![CDATA[folksemantic]]></category>
		<category><![CDATA[nsdl]]></category>
		<category><![CDATA[recommender]]></category>

		<guid isPermaLink="false">http://www.joelduffin.com/blog/?p=12</guid>
		<description><![CDATA[Here is the updated OER Recommender White Paper.
Yesterday we released the OER Recommender system that I have worked on.  There are still many things that could be added or tweaked, but it does something useful already so out the door it goes! I&#8217;m concerned that we are calling it a recommender as the &#8220;recommendations&#8221; [...]]]></description>
			<content:encoded><![CDATA[<p>Here is the updated <a title="OER Recommender White Paper" href="http://www.joelduffin.com/blog/wp-content/uploads/2008/01/recommender.pdf">OER Recommender White Paper</a>.</p>
<p>Yesterday we <a href="http://opencontent.org/blog/archives/367">released</a> the <a href="http://www.oerrecommender.org/">OER Recommender</a> system that I have worked on<a href="http://opencontent.org/blog/archives/367"></a>.  There are still many things that could be added or tweaked, but it does something useful already so out the door it goes! I&#8217;m concerned that we are calling it a <a href="http://en.wikipedia.org/wiki/Recommendation_system">recommender</a> as the &#8220;recommendations&#8221; it currently provides are not specific to a user, they are related resources generated via a content-based approach. The proposal and plans are to make it a recommender based on user profiles. See <a href="http://www.joelduffin.com/blog/2007/05/09/implementing-a-recommender-system/">my previous post</a> for details about where it is intended to go.</p>
<p><strong>The Problem</strong>. The <a href="http://www.hewlett.org/" target="_blank">William and Flora Hewlett Foundation</a>, the <a href="http://nsdl.org/">National Science Digital Library</a>, and other large organizations have made large investments in the development of electronic resources that can be used for teaching, learning, and research. They are keen on finding ways to increase the impact of their investment. One way to do that is to create tools that make it easier for poeple to find resources that are useful to them. One approach to solving this problem are search and browsing tools that help people find good resources such as NSDL, <a href="http://www.ocwfinder.org/">OCW Finder</a>, and of course Google. Recommender systems approach the problem by helping bringing resources to peoples&#8217; attention without having them to go look for them specifically Google&#8217;s Adsense technology and Amazon&#8217;s recommendations are two of the most common examples. Some of the challenges with general search tools such as Google and even more focused ones such as NSDL&#8217;s main search is that they cast too wide a net and you get back resources do not match. On the other hand, while most collections of open courseware and digital libraries provide search, it is nice to be able to search across repositories at the same time to find the best resources wherever they reside.</p>
<p><strong>The Vision</strong>. In creating the OER Recommender we set out to create a service that would help people find relevant open education resources. The first step in this was to create an automated process for clustering related resources. The hope is that presenting links to related resources to users when looking at a resource could help people stumble upon resources that are relevant to them even though they might have not been actively looking for them. We could tune this service by monitoring what resources people visit, share, rate, tag, and otherwise pay attention to. We could use this attention metadata to create profiles of what people&#8217;s interest are. The profiles could be used to push recommendations to people even when they are not browsing perhaps via email or other means.</p>
<p><strong>Exploration</strong>. Since learning about it I have felt that the recommender would be one of the more interesting <a href="http://www.folksemantic.org/">folksemantic</a> tools to work on since I initially thought to pursue a <a href="http://en.wikipedia.org/wiki/Latent_semantic_analysis">latent semantic analysis</a> approach. I had heard about LSA from my exposure to <a href="http://sitcogblog.blogspot.com/">Andy Walker&#8217;s</a> and <a href="http://home.autotutor.org/graesser/">Art Graesser&#8217;s</a> work. I have also had a number of interesting discussions about clustering with <a href="http://http://www.math.usu.edu/~adele/home.htm">Adele Cutler</a> about her work on <a href="http://www.math.usu.edu/~adele/forests/index.htm">Random Forests</a>. I got a hold of the <span class="sans"><a href="http://www.amazon.com/Handbook-Semantic-University-Institute-Cognitive/dp/0805854185">Handbook of Latent Semantic Analysis</a> and explored the <a href="http://del.icio.us/jduffin/lsa">LSA online resources I could find</a> as well as the <a href="http://cran.r-project.org/src/contrib/Descriptions/lsa.html">lsa module for R</a></span>. After spending significant time reading and playing I came to the conclusion that it would likely more complex to implement and computationally expensive than I wanted. I also concluded that everything that I would need to do to implement a standard <a href="http://en.wikipedia.org/wiki/Vector_space_model">Term Vector Model</a> approach could be used to later implement an LSA approach. During my exploration, I came across a number of <a href="http://del.icio.us/jduffin/recommender">useful online resources</a> and books including <a href="http://www.cs.cornell.edu/Info/Department/Annual95/Faculty/Salton.html">Gerard Salton&#8217;s</a> work.</p>
<p><strong>An Explanation for My Mother</strong>. I&#8217;ve already been asked by <a href="http://shelleylyn.blogspot.com/">Shelly</a> to explain the implementation of the recommender in language that my mother would understand, so I&#8217;ll give that explanation first. We use the folksemantic feed harvester to gather information (metadata) about OERs into databases. For each pair of resources that are related enough, the recommender uses the titles, description, and tags to calculate a score indicating how related they are. Recommended resources are the ones scored to be the most similar. The similarity of two resources is based on an automated analysis of the words in their metadata. Got that Mom?</p>
<p>We use four phases to arrive at recommendations: (1) parse metadata, (2) calculate local term weights, (3) calculate global term weights, (4) calculate similarity scores.</p>
<p><strong>Parse Metadata</strong>. Using a string tokenizer we break metadata text into terms. We throw away stop words (common words such as &#8216;and&#8217;, &#8216;is&#8217;, &#8216;of&#8217; that do not add meaning). Next we convert terms into their stems using a <a href="http://www.tartarus.org/~martin/PorterStemmer">common algorithm</a>; for example &#8216;running&#8217;, &#8216;ran&#8217;, and &#8216;runs&#8217; all get collapsed into &#8216;run&#8217;.</p>
<p><strong>Calculate Local Term Weights</strong>. A local term weight is a measure of how important a word is for describing a document. The more frequently a word appears in a document, the more important it is&#8230; up to a point. Of course, where a word appears in a document is probably a good indicator of how important the word is as well. For example if a word appears in a title, it is probably more important than if it appears in the body. One more important factor to consider is document length (total number of terms in the document); all other things being equal, the longer a document is, the more times a term will appear in it. So we normalize term frequencies using the document length. One last issue relates to filler words.</p>
<p>To calculate local term weights OER Recommender uses a function that looks like this.</p>
<p><a title="Global Term Weights Function" href="http://www.joelduffin.com/blog/wp-content/uploads/2007/08/gtw.jpg"><img src="http://www.joelduffin.com/blog/wp-content/uploads/2007/08/gtw.jpg" alt="Global Term Weights Function" /></a></p>
<p>The x-axis represents term frequency and the y-axis represents the local term weight. As you can see, the more time a term appears in a document, the more weight it is given. However after reaching a threshold, it plateaus. I think this is especially important in situations where document creators might be trying to game the system. The formula used is:</p>
<p><em>ltw<sub>i</sub> = 1 / (1 + (e<sup>(.0044)dlen</sup>) .7<sup>(fti &#8211; 1)</sup>)</em></p>
<p><em>ltw<sub>i</sub></em> is the local term weight for a given term in a document.</p>
<p><em>dlen</em> is the total number of terms in the document.</p>
<p><em>ft</em><em><sub>i</sub></em> is the number of times a given term appears in a document.</p>
<p>Note that the constants .0044 and .7 in the equation determine the shape of the curve. I chose those values based on their use in the <a href="http://lhncbc.nlm.nih.gov/lhc/docs/published/2001/pub2001045.pdf">MeSH system</a>. As explained there, the values should be tuned to your data set using an empirical approach. I played with the parameters some and the values they used seemed fine, so I adopted them.</p>
<p><strong>Calculate Global Term Weights</strong>. Global term weights are a measure of how important a word is for distinguishing documents within a collection. The more documents a word appears in, the less value it has for characterizing documents. For example, the term &#8216;USU&#8217; appears in every metadata record for resources in USU&#8217;s OpenCourseware. As a result, it has no value for characterizing clustering resources. The term &#8216;USU&#8217; becomes in a sense, a stop word, similar to those thrown away during the parse phase. To calculate the global term weight for a term, the number of documents that it appears in are counted as well as the total number of documents.</p>
<p>To calculate global term weights OER Recommender uses a function that looks like.</p>
<p><a title="Local Term Weights Function" href="http://www.joelduffin.com/blog/wp-content/uploads/2007/08/ltw.jpg"><img src="http://www.joelduffin.com/blog/wp-content/uploads/2007/08/ltw.jpg" alt="Local Term Weights Function" /></a></p>
<p>The x-axis represents the number of documents that a term appears in. The y-axis represents the global weight assigned to the term. The formula used is:</p>
<p><em>gtw<sub>i</sub> = log (D/df<sub>i</sub>) </em></p>
<p><em>gtw<sub>i</sub></em> is the global term weight for term.</p>
<p><em>D</em> is the total number of documents.</p>
<p><em>df<sub>i</sub></em> is the number of documents that the term appears in.</p>
<p>See <a href="http://www.miislita.com/term-vector/term-vector-1.html">Mi lslita&#8217;s explanation of Term Vector Theory</a> for details.</p>
<p><strong>Calculate Similarity Scores</strong>. Once local term weights and global term weights are calculated, we are ready to create recommendations. To create recommendations for a document we first find all documents that have any of the same terms in it. It turns out that this can be a very large number of documents (e.g. 40,000 in our system).  For each pairing of the document being considered and a document with matching terms we calculate a similarity score. Because calculating 40,000 scores can take a long time (about 15 seconds in our current system), we shorten the list to consider to 200. We do this by sorting the pairs according to the number of overlapping terms. To calculate the similarity score for a pair of documents, we sum over all of the terms that they share in common. The contribution from each term is a combination of the local term weight in the first document, the local term weight in the second document and the global term weight. Because in OER recommender, each feed can be considered a separate collection, we calculate global term weights for each of the feeds and use them in calculating the similarity score.</p>
<p>To calculate the contribution of an individual term to the similarity score, OER Recommender uses:</p>
<p><em>sst<sub>i</sub> = (gtw<sub>1i</sub>)(ltw<sub>1i</sub>)</em><em>(gtw<sub>2i</sub>)</em><em>(ltw<sub>2i</sub>)</em></p>
<p><em>sst<sub>i</sub></em> Is the contribution to the similarity score for a given term.</p>
<p><em>gtw<sub>1i</sub></em> is the global term weight from the feed that the first document is in.</p>
<p><em>ltw<sub>1i</sub></em> is the local term weight from the first document.</p>
<p><em>gtw<sub>2i</sub></em> is the global term weight from the feed that the second document is in.</p>
<p><em>ltw<sub>2i</sub></em> is the local term weight from the second document.</p>
<p>See <a href="http://www.ncbi.nlm.nih.gov/entrez/query/static/computation.html">NCBI&#8217;s explanation of how MeSH calculates Related Articles</a> for details. Once OER Recommender has calculated similarity scores for the 200 documents, it sorts the documents by those scores and stores the top 10 in a database.</p>
<p><strong>Displaying Recommendations</strong>. Now that the recommendations are stored in a database, displaying them to the user is straightforward. We have created a <a href="https://addons.mozilla.org/en-US/firefox/addon/748">Greasmonkey script</a> that requests recommendations for each web page that a user browses. If OER Recommender returns any, the script inserts HTML for the recommendations into the web page. The eduCommons team is building a plone tool so that anyone running eduCommons can turn on recommendations. That way users won&#8217;t have to have install the Greasemonkey script in order to see recommendations on the eduCommons website. The OER Recommender site publishes the XML format it returns so others could add recommendations into their websites.</p>
<p><strong>Getting Resources into the Recommender</strong>. Currently OER Recommender only provides recommendations for URLs that it has metadata for, so in order to get recommendations, you need to register a metadata feed. If you have an OCW site that provides an RSS feed, you can do that at <a href="http://www.ocwfinder.org/">OCW Finder</a>. If you have an OAI feed or an RSS feed for learning objects or other OERs, you can add your metadata by sending a feed title, URL, and display URL (home page of the repository) to oerrecommender AT cosl DOT usu DOT edu.</p>
<p><strong>Future Directions</strong>. There are many things left to be done: (1) adapt recommendations by monitoring which recommended resources people click on and how long they stay at recommended resources, (2) provide links to more recommendations (so users can see more than the default 5), (3) provide a way for people to indicate when a recommendation doesn&#8217;t work, (4) provide a way for people to register and login, so they can receive personalized recommendations (perhaps via email), (5) create a process whereby recommendations can be updated without going through all the documents in the database, (6) add functionality for retrieving recommendations for arbitrary web pages (recommending on demand) perhaps via a bookmarklet, (7) create whatever Greasemonkey-like add on is supported on IE. We also plan to integrate the recommender with the aggregator and feed reader to provide recommended news items based on the attention people have previously paid to news articles.</p>
<p>Wow! I realize that this is a long post, but it is not a simple topic. Hopefully the explanation is useful.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.joelduffin.com/blog/2007/08/23/oer-recommender-released/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>implementing a recommender system</title>
		<link>http://www.joelduffin.com/blog/2007/05/09/implementing-a-recommender-system/</link>
		<comments>http://www.joelduffin.com/blog/2007/05/09/implementing-a-recommender-system/#comments</comments>
		<pubDate>Thu, 10 May 2007 06:07:39 +0000</pubDate>
		<dc:creator>joel</dc:creator>
				<category><![CDATA[information retrieval]]></category>
		<category><![CDATA[cosl]]></category>
		<category><![CDATA[folksemantic]]></category>
		<category><![CDATA[recommender]]></category>

		<guid isPermaLink="false">http://www.joelduffin.com/blog/?p=6</guid>
		<description><![CDATA[I&#8217;m going to start implementing a recommender system soon. It seems like recommender systems get a bad rap from many people. My experiences with them have not been so stellar either. I think my basic gripe is that they speak up when they shouldn&#8217;t, that is when they don&#8217;t really have anything good to recommend [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m going to start implementing a recommender system soon. It seems like recommender systems get a bad rap from many people. My experiences with them have not been so stellar either. I think my basic gripe is that they speak up when they shouldn&#8217;t, that is when they don&#8217;t really have anything good to recommend or at a time that I&#8217;m not prepared to listen. A good friend wouldn&#8217;t do that. On the other hand, Google might be considered a recommender system, and for the most part it does a great job. In fact, I&#8217;ve joked that instead of coming up with a fancy algorithm we should take the descriptors of the context in which we are trying to match, throw them into a google search and give back the top 5 search results <img src='http://www.joelduffin.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>I&#8217;m currently thinking about the problem in the context of <a href="http://www.ozmozr.com/">ozmozr</a> and the <a title="NSDL" href="http://nsdl.org/">NSDL</a>. To tackle the problem, I&#8217;ve broken it down into the following questions:</p>
<ul>
<li><span style="font-weight: bold">Who</span> &#8211; Who are we recommending things to? (people or groups)</li>
<li><span style="font-weight: bold">What</span> &#8211; What are we recommending? (groups, feeds, stories/web pages, tags)</li>
<li><span style="font-weight: bold">When</span> &#8211; When do we offer recommendations? (when searching, when visiting home page)</li>
<li><span style="font-weight: bold">When not</span> &#8211; When do we refrain from making recommendations? (when we don&#8217;t reach a certain threshold of certainty)</li>
<li><span style="font-weight: bold">Criteria</span> &#8211; What factors should we include when considering what recommendations to make? What weight should be given to those criteria.</li>
<li><span style="font-weight: bold">Algorithm</span> &#8211; What algorithm should we use to implement the recommender?</li>
<li><span style="font-weight: bold">Implementation</span> &#8211; How do we implement recommender? (use R to do the analysis and store the results in the DB, present it via rails)</li>
<li><span style="font-weight: bold">Priority</span> &#8211; What is the priority of implementation? Where will we get the biggest payoff for our efforts?</li>
</ul>
<p>The factors considered in the algorithm will depend on the what is being recommended and who it is being recommended to.</p>
<h3>Recommending Stories to Users</h3>
<p>Our approach will combine content filtering (co), collaborative filtering (cl), and rational analysis (ra) (rules that make sense).</p>
<p>Factors to consider:</p>
<ul>
<li>Titles, tags, and contents of stories they have <span style="font-weight: bold">read, tagged, shared, voted, </span>or <span style="font-weight: bold">externally linked</span></li>
</ul>
<ul>
<li><span style="font-weight: bold">Feeds</span> they subscribe to</li>
<li><span style="font-weight: bold">Groups</span> they belong to</li>
<li>Stories that <span style="font-weight: bold">people like them</span> have read, tagged, or shared, and how recently they read, tagged or shared them</li>
<li>Whether or not they have <span style="font-weight: bold">read the story before</span></li>
<li>How <span style="font-weight: bold">recent</span> the story was published</li>
<li>How often the story was viewed</li>
</ul>
<h3>Algorithm:</h3>
<ol>
<li>(Rational) Get the set of stories published within the specified recency.</li>
<li>(Rational) Eliminate stories the user has read before.</li>
<li>(Rational) Eliminate stories from the feeds that the user subscribes to?</li>
<li>(Rational) Eliminate stories from the feeds that groups subscribe to that the user belongs?</li>
<li>(Content) Get the set of stories similar to ones that the user has read, tagged, or shared before.</li>
<li>(Collaborative) Get the set of stories that similar users have read, tagged, or shared.</li>
<li>Store in a &#8220;user_story_recommendation&#8221; table (user_id, story_id, content_score, collaborative_score)</li>
<li>Display recommendations by querying the r-table for stories except 2-4, ranked according to a combination of the content and collaborative scores and recency.</li>
</ol>
<p>Proposed formula for composite score: rank = r ( A(cn) + B(cf) ). Where cn = content score, cf = collaborative score, r = recency score, and A and B are arbitrary weights to allow us to tune the relative contribution of the content and collaborative filtering scores.</p>
<p>Should the recency factor decay linearly, logarithmically, or exponentially?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.joelduffin.com/blog/2007/05/09/implementing-a-recommender-system/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
