<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>undesigned &#187; web development</title>
	<atom:link href="http://www.joelduffin.com/blog/category/web-development/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.joelduffin.com/blog</link>
	<description>life is a rum go guv’nor, and that’s the truth</description>
	<lastBuildDate>Thu, 30 Sep 2010 07:08:21 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Getting started with Hadoop on Amazon&#8217;s elastic mapreduce</title>
		<link>http://www.joelduffin.com/blog/2010/09/08/getting-started-with-hadoop-on-amazons-elastic-mapreduce/</link>
		<comments>http://www.joelduffin.com/blog/2010/09/08/getting-started-with-hadoop-on-amazons-elastic-mapreduce/#comments</comments>
		<pubDate>Wed, 08 Sep 2010 15:23:54 +0000</pubDate>
		<dc:creator>joel</dc:creator>
				<category><![CDATA[java]]></category>
		<category><![CDATA[recommender]]></category>
		<category><![CDATA[web development]]></category>

		<guid isPermaLink="false">http://www.joelduffin.com/blog/?p=474</guid>
		<description><![CDATA[After playing with Hadoop a bit in the past, I&#8217;m now trying out some things on Amazon&#8217;s Elastic MapReduce.
I signed up for a new AWS account and ran their sample LogAnalyzer Job Flow using the AWS console. That was easy enough. Next I attempted to run the same sample from the command line using the [...]]]></description>
			<content:encoded><![CDATA[<p>After playing with Hadoop a bit in the past, I&#8217;m now trying out some things on <a href="http://aws.amazon.com/elasticmapreduce/">Amazon&#8217;s Elastic MapReduce</a>.</p>
<p>I signed up for a new AWS account and ran their <a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2440">sample LogAnalyzer Job Flow</a> using the AWS console. That was easy enough. Next I attempted to run the same sample from the command line using the <a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264">Amazon Elastic MapReduce Ruby Client</a>.</p>
<p><strong>Note</strong>: The <a href="http://github.com/tc/elastic-mapreduce-ruby/blob/master/README">Ruby Client README</a> turns out to be very helpful.</p>
<p>Next I downloaded the source and looked at. Seems simple enough. I notice that this sample uses a library called <a href="http://www.cascading.org/">Cascading</a>, which appears to be a way to simplify common job flow tasks.</p>
<p>After adding the elastic-mapreduce app to my path and setting up my credentials file, I ran:</p>
<p>elastic-mapreduce &#8211;create &#8211;jar  s3n://elasticmapreduce/samples/cloudfront/logprocessor.jar &#8211;args  &#8220;-input,s3n://elasticmapreduce/samples/cloudfront/input,-output,s3n://folksemantic.com/cloudfront/log-reports,-start,any,-end,2010-09-07-21,-timeBucket,300,-overallVolumeReport&#8221;</p>
<p>It produced:</p>
<p>INFO Exception Retriable invalid response returned from RunJobFlow: {&#8221;Error&#8221;=&gt;{&#8221;Details&#8221;=&gt;#&lt;SocketError: getaddrinfo: nodename nor servname provided, or not known&gt;, &#8220;Code&#8221;=&gt;&#8221;InternalFailure&#8221;, &#8220;Type&#8221;=&gt;&#8221;Sender&#8221;}} while calling RunJobFlow on Amazon::Coral::ElasticMapReduceClient, retrying in 3.0 seconds.</p>
<p>After some poking around, I realized that I specified &#8220;west-1&#8243; as my region when it should have been &#8220;us-west-1&#8243;. This resulted in the client trying to contact a non-existent server I&#8217;m guessing.</p>
<p>So now, my jobs started, but failed immediately. I logged into the AWS console and clicked on one of the failed job flows to see the reason for the failure (Last State Change Reason):</p>
<p><span>The given SSH key name was invalid</span></p>
<p><span>Googling found: </span>http://developer.amazonwebservices.com/connect/message.jspa?messageID=166768</p>
<p>Which at first confused me, then I went ahead and followed <a href="https://console.aws.amazon.com/ec2/home#c=EC2&amp;s=KeyPairs">the link</a> (while logged in) and did what it said to. (Amazing how that works sometimes <img src='http://www.joelduffin.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> ) It prompted me to create a new key and to assign it a name.</p>
<p>After I had generated the key and put its name in the credentials.json, things worked like a charm. It turns out that if you run a job from scratch, it has to fire up an EC2 instance in order to run the job, and that can take a while. To avoid that start up time, you can run:</p>
<p>elastic-mapreduce &#8211;create &#8211;alive &#8211;log-uri s3://my-example-bucket/logs</p>
<p>As mentioned in the README.TXT</p>
<p>My next steps are to:</p>
<ol>
<li>Modify the job flow and run that job flow.</li>
<li>Run the job flow locally.</li>
<li>Debug the MapReduce portion of the job flow.</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://www.joelduffin.com/blog/2010/09/08/getting-started-with-hadoop-on-amazons-elastic-mapreduce/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Solving aggregation problems</title>
		<link>http://www.joelduffin.com/blog/2010/09/02/hadoop-based-aggregation-and-recommendation/</link>
		<comments>http://www.joelduffin.com/blog/2010/09/02/hadoop-based-aggregation-and-recommendation/#comments</comments>
		<pubDate>Thu, 02 Sep 2010 23:27:21 +0000</pubDate>
		<dc:creator>joel</dc:creator>
				<category><![CDATA[recommender]]></category>
		<category><![CDATA[web development]]></category>

		<guid isPermaLink="false">http://www.joelduffin.com/blog/?p=456</guid>
		<description><![CDATA[In Folksemantic, we run into the following problems:

Duplicate entries. Search and recommendation results that list multiple entries for the same resource.
Catalog pages. Search and recommendation results that link to catalog pages for resources (people would rather go directly to the resource, but the metadata providers want people to go to their catalog entry for the [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://www.folksemantic.com">Folksemantic</a>, we run into the following problems:</p>
<ul>
<li><strong>Duplicate entries</strong>. Search and recommendation results that list multiple entries for the same resource.</li>
<li><strong>Catalog pages</strong>. Search and recommendation results that link to catalog pages for resources (people would rather go directly to the resource, but the metadata providers want people to go to their catalog entry for the resource).</li>
<li><strong>Dead links</strong>. Results that link to resources that no longer exist.</li>
<li><strong>Urls without metadata</strong>. When someone shares a resource or inserts the recommender widget in a page for which we don&#8217;t have metadata, we need to be able to generate metadata.</li>
</ul>
<p>Duplicate entries show up because:</p>
<ul>
<li>Two feeds specify entries with the same permalink.</li>
<li>The same feed gets added twice (maybe different formats for the same feed, eg. RSS, Atom)</li>
<li>Multiple catalogs provide metadata for the same resource.</li>
</ul>
<h3>Dealing With Duplicate Feeds</h3>
<p><strong>Problem</strong>: In folksemantic a user can enter the url of their blog and we will detect the feeds from the page and add them. We use the feeds to generate personal recommendations. The problem is, a blog typically has 3 or more feeds all of which contain the same content, just provided in different formats (e.g. RSS, Atom etc). So we really don&#8217;t want all of the feeds to be generated.</p>
<p><strong>Solution 1</strong>: One approach to solving this is to try to detect the duplicate feed the first time we harvest it, don&#8217;t add its entries to the index, and then flag the feed as &#8220;duplicate&#8221; so that we don&#8217;t harvest it again. Store in the feed the id of the feed it duplicates. One potential problem with this is that if someone registers a feed that has just the entries tagged a certain way (e.g. all of the entries tagged apple on the gizmodo feed), then if the main feed is already registered, all of the entries on the filtered feeds duplicate the entries in the main feed, so the entries are duplicate, but the feed is not. If we want to use the feed as a basis for making recommendations to the user, we don&#8217;t want to use the main feed.</p>
<p><strong>Solution 2</strong>: Another approach to the problem is to just add the feed, and harvest it, but then flag the entries as duplicates. Our thought about doing this is to store in each entry a list all of the feeds that the entry belongs to. We need to verify that this won&#8217;t slow down our Lucene queries.</p>
<p><strong> </strong>It seems that Solution 2 may be best and make it up to the app to avoid adding duplicate feeds (like the 4 feeds for the same blog that Folksemantic does).</p>
<h3>Dealing with Catalog Entries</h3>
<p><strong> </strong>A number of NSDL and other projects such as OER Commons  have created large catalogs of online resources. Sometimes their metadata is harvested directly from the resource websites. Sometimes they enhance that metadata with new information. Sometimes they create metadata for resources that don&#8217;t provide their own metadata. The catalog websites often provide services such as rating, discussion, and other valuable services and so they want people to come to their websites and use them. While, these services are nice, when people are searching for resources, they likely want to look at the resource first and make their own judgement if that is possible, and then read more about it if they are interested. I think this is because the cost of looking at an online resource is minimal (as compared to buying something or attending a course, for example). So the catalog issue leads to two problems:</p>
<p><strong>Problem</strong>: When people see search results, they likely want to go directly to the resource instead of to a catalog page.</p>
<p><strong>Solution</strong>: When a catalog page is the only entry for a resource, that entry is flagged &#8220;primary&#8221;. As soon as we create an entry that goes directly to the resource, we flag that new entry primary, and the catalog entry as not primary; we also store the id of the catalog entry in the list of duplicate entries that we store in the new entry. When searching, by default return only primary entries unless the application explicitly requests all entries. Return a flag indicating that an entry has catalog entries. Provide an API for requesting catalog entries for a specific entry.</p>
<p><strong>Problem</strong>: In most cases, catalog metadata does not provide the url of the resource it is cataloging.</p>
<p><strong>Solution</strong>: Initially flag the entry as &#8220;primary&#8221; so it will show up in search results. Later, asynchronously crawl the catalog pages to find the url of the catalogued resource. Once the direct url is known, create a new entry for the resource and store the id of the catalog entry in the  list of &#8220;related entries&#8221; that we store for the new entry. Flag the catalog entry as not primary and the new entry as primary. Copy the metadata from the catalog entry into the new entry. Use the resource domain as the key for the feed to add the new entry to. If the feed does not already exist, create one for it.</p>
<p><strong>Problem</strong>: If there are multiple entries (catalog etc for a resource), which metadata should we use to calculate the recommendation for the resource?</p>
<p><strong>Solution</strong>: Options might be: (a) the metadata provided by the resource, (b) metadata generated by a crawl of the resource &#8211; I think this is bad because frequently metadata is more descriptive than the page itself, (c) the first catalog entry found for the resource, (d) the largest set of metadata for the resource. My thought it to always use the largest set of metadata for the resource unless there is no catalog entries (like in the case of where we crawl a website), in which case we must use the metadata generated by the crawl. In order to facilitate this approach, we: (1) for entries, we store whether or not the metadata came from that resource itself or not, (2) whenever we detect a new catalog entry for a resource that already has an entry, we look to see if the metadata in the existing entry was copied from a catalog entry; if it was, compare the size of the metadata from the two entries and update the metadata with the new catalog entry metadata if it is larger. For the purpose of calculating recommendations it might make sense to use all of the metadata from all of the sources.</p>
<p><strong>Problem</strong>: When a website requests recommendations for a url, normally we want to return non-catalog entries, but when a catalog requests recommendations for one of its urls, they likely want their own catalog entries back if they exist.</p>
<p><strong>Solution</strong>: When generating recommendations, for recommended entries that have catalog entries, check those and recommend those catalog entries instead.</p>
<h3>Detecting and Handling Feed Entry Deletions</h3>
<p><strong>Problem</strong>: OAI has a way to tell you that an entry has been deleted, but RSS does not. How can you detect when an entry has been deleted, and what should you do when it is deleted?</p>
<p><strong>Solution</strong>: My thought is that this is just part of what our dead link handler does. It finds entries with dead links and flags them deleted or actually deletes them. When we re-index we remove items from the index that have been flagged deleted.</p>
<h3>Dealing with Dead Links</h3>
<p><strong>Problem</strong>: Many times the resources in our indexes get taken down or moved without notification (the source of the metadata doesn&#8217;t get updated or it doesn&#8217;t get updated for a while). What should we do in that situation?</p>
<p><strong>Solution</strong>: We will write a bot that will flag entries dead. Once entries are dead they won&#8217;t show up in search or recommendation results. Should they still be used as the basis for recommendations? Probably not. Maybe we create another process that looks for the new location of the dead entries?</p>
<h3>Generating Metadata for a URL</h3>
<p><strong>Problem</strong>: When someone adds an entry but doesn&#8217;t provide metadata, we need to be able to generate metadata for the entry. We also need to know which feed to put it into.</p>
<p><strong>Solution</strong>: The application should give us a feed id, or a display url for the feed along with the entry URL. If it does not send a feed ID, we will look for feed using the host portion of the entry permalink. If one does not exist, it will create one and specify the host the display url for the feed, that way future entries for that feed will always go into that feed.</p>
<h3>Generating Recommendations for Web Pages We Haven&#8217;t Indexed Yet</h3>
<p>This is similar to the previous issue, we want people to be able to add the OER Recommender widget to their pages and have them just start working, removing the requirement that they add their resources to our index before we can provide recommendations. We can analyze and provide recommendations in real time, but that tends to bury our server if it gets a bunch of requests for real time recommendations  all at once.</p>
<p><strong>Problem</strong>: Provide recommendations for URLs that haven&#8217;t been indexed yet.</p>
<p><strong>Solution</strong>: When the recommendation is requested, add an entry for the URL, flagging it needing to be scraped. Flag the feed as being not-recommendable. If we don&#8217;t have a domain feed for the URL, add a domain feed for the entry and specify it as the feed for the entry. Queue the feed for approval by site admins.</p>
<p>This brings up the issue of being able to narrow the scope of the space into which recommendations are made. Depending on the context, we want to consider different sets of items to recommend. For example, in folksemantic, for personal recommendations we let users add feeds they produce, but we don&#8217;t necessarily want to include their stuff in recommendations that we make to other people.</p>
<p><strong>Problem</strong>: Narrow the scope of the space that we recommend items from.</p>
<p><strong>Solution</strong>: Define recommendation tasks by specifying the aggregation of feeds that we are recommending from and the aggregation of feeds that we are recommending into. Store those ids in the recommendation table.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.joelduffin.com/blog/2010/09/02/hadoop-based-aggregation-and-recommendation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Configuring Apache and Tomcat to serve my java web application through port 80</title>
		<link>http://www.joelduffin.com/blog/2010/08/16/configuring-apache-and-tomcat-to-serve-my-java-web-application-through-port-80/</link>
		<comments>http://www.joelduffin.com/blog/2010/08/16/configuring-apache-and-tomcat-to-serve-my-java-web-application-through-port-80/#comments</comments>
		<pubDate>Mon, 16 Aug 2010 14:15:33 +0000</pubDate>
		<dc:creator>joel</dc:creator>
				<category><![CDATA[java]]></category>
		<category><![CDATA[web development]]></category>

		<guid isPermaLink="false">http://www.joelduffin.com/blog/?p=445</guid>
		<description><![CDATA[Default Tomcat installations run on port 8080 so you get urls like:
http://mydomain.com:8080/lms/index.jsp
Some firewalls block port 8080 so I wanted my site to be available on port 80 so that it uses urls like:
http://mydomain.com/lms/index.jsp.
One option was to modify the Tomcat configuration to listen on port 80. However, I already have Apache installed and listening on port [...]]]></description>
			<content:encoded><![CDATA[<p>Default Tomcat installations run on port 8080 so you get urls like:</p>
<pre>http://mydomain.com:8080/lms/index.jsp</pre>
<p>Some firewalls block port 8080 so I wanted my site to be available on port 80 so that it uses urls like:</p>
<pre>http://mydomain.com/lms/index.jsp.</pre>
<p>One option was to modify the Tomcat configuration to listen on port 80. However, I already have Apache installed and listening on port 80 (to serve other content) so I couldn&#8217;t that. Instead I configured Apache to route requests for my web application to Tomcat. I&#8217;ve been through this process a number of times before, and it never seems to go smoothly, so I document it here. I am running Apache 2.2 and Tomcat 6 on 32 bit Ubuntu Linux.</p>
<h2>Overview</h2>
<p>The steps are:</p>
<ol>
<li>Download the Apache jk connector module (mod_jk.so).</li>
<li>Create Apache module configuration files for the jk connector (jk.load and jk.conf) and enable the module.</li>
<li>Create a worker.properties file to configure the Tomcat worker for the connector.</li>
<li>Define an AJP connector in your Tomcat configuration file (server.xml)</li>
<li>Assign urls to Tomcat in your Apache virtual hosts file.</li>
</ol>
<h2>Download the Apache jk connector module (mod_jk.so)</h2>
<p>Apache uses the jk connector module to talk to Tomcat. I downloaded it from a subdirectory of <a href="http://www.apache.org/dist/tomcat/tomcat-connectors/jk/binaries/">http://www.apache.org/dist/tomcat/tomcat-connectors/jk/binaries/</a>. I wasn&#8217;t sure which OS I was running and whether or not I was running a 32 bit version (i586 directory) or a 64 bit version (x86_64). To find this out, I ran:</p>
<pre>file /usr/bin/file
/usr/bin/file: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.15, stripped</pre>
<pre><span style="font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif; line-height: 19px; white-space: normal; font-size: 13px;">So I downloaded: <a href="http://www.apache.org/dist/tomcat/tomcat-connectors/jk/binaries/linux/jk-1.2.28/i586/mod_jk-1.2.28-httpd-2.2.X.so">http://www.apache.org/dist/tomcat/tomcat-connectors/jk/binaries/linux/jk-1.2.28/i586/mod_jk-1.2.28-httpd-2.2.X.so</a></span></pre>
<p>I chose that version because I am running Apache 2.2. There are different versions for different versions of Apache.</p>
<p>I put the file in my Apache modules directory (/usr/lib/apache2/modules/) and renamed it to mod_jk.so.</p>
<h2>Create Apache module configuration files for the jk connector (jk.load and jk.conf) and enable the module</h2>
<p>In order to get Apache to load and configure the jk connector module, I created jk.load and jk.conf files (in /etc/apache2/mods-available/) and then enabled them. jk.load just tells Apache where to find the module:</p>
<pre>LoadModule jk_module /usr/lib/apache2/modules/mod_jk.so</pre>
<p>jk.conf ties everything together by configuring the jk connector module:</p>
<pre># Where to find workers.properties
# Update this path to match your conf directory location (put workers.properties next to httpd.conf)
JkWorkersFile /etc/apache2/workers.properties

# Where to put jk shared memory
# Update this path to match your local state directory or logs directory
JkShmFile     /var/log/apache2/mod_jk.shm

# Where to put jk logs
# Update this path to match your logs directory location (put mod_jk.log next to access_log)
JkLogFile     /var/log/apache2/mod_jk.log

# Set the jk log level [debug/error/info]
JkLogLevel    info

# Select the timestamp log format
JkLogStampFormat "[%a %b %d %H:%M:%S %Y] "</pre>
<p>For more information about the Apache jk connector module configuration, see the <a href="http://tomcat.apache.org/connectors-doc/webserver_howto/apache.html">Tomcat Connector &#8211; Apache Webserver HowTo</a>.</p>
<p>Initially I set the JkLogLevel to debug, so I could see any error messages, but then changed it to info once I had everything working.</p>
<p>After creating the files, I enabled the module using:</p>
<pre>sudo a2enmod jk</pre>
<p>That creates symlinks to the jk.load and jk.conf files in the mods-enabled directory where my Apache is configured to look for modules to load.</p>
<h2 style="font-size: 1.5em;">Define an AJP connector in your Tomcat configuration file (server.xml)</h2>
<p>AJP is an efficient protocol that Apache and Tomcat can be configured to use to talk to each other. I set up an AJP connector in my Tomcat configuration file (/etc/tomcat6/server.xml). The default configuration file has the connector defined but commented out, so I uncommented it:</p>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">&lt;!&#8211; Define an AJP 1.3 Connector on port 8009 &#8211;&gt;</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">&lt;Connector port=&#8221;8009&#8243; protocol=&#8221;AJP/1.3&#8243; redirectPort=&#8221;8443&#8243; /&gt;</div>
<pre style="font: normal normal normal 12px/18px Consolas, Monaco, 'Courier New', Courier, monospace;">&lt;!-- Define an AJP 1.3 Connector on port 8009 --&gt;
&lt;Connector port="8009" protocol="AJP/1.3" redirectPort="8443" /&gt;</pre>
<p>For details see the <a href="http://tomcat.apache.org/tomcat-6.0-doc/config/ajp.html">Tomcat AJP Connector documentation</a>.</p>
<h2 style="font-size: 1.5em;">Create a worker.properties file to configure the Tomcat ajp worker for the connector</h2>
<p>&#8220;A <strong>Tomcat worker </strong>is a Tomcat instance that is waiting to execute servlets or any other content on behalf of some web server&#8221;.</p>
<p><em>Note: this quote from the documentation is a bit curious, because, nowhere in the Tomcat configuration files do I tell Tomcat about the worker. I think that the worker is actually a process that the jk connector spawns</em>.</p>
<p>I configured a worker to listen to Apache requests by creating a worker.properties file in the same directory as the Apache configuration file (/etc/apache2/workers.properties).</p>
<pre style="font: normal normal normal 12px/18px Consolas, Monaco, 'Courier New', Courier, monospace;"># Define 1 real worker using ajp13
worker.list=worker1

# Set properties for worker1 (ajp13)
worker.worker1.type=ajp13
worker.worker1.host=localhost
worker.worker1.port=8009</pre>
<p>The jk connector knows how to talk to this worker, because the file name is specified in the Apache jk connector configuration file (/etc/apache2/mods-available/jk.conf). For more information see the <a href="http://tomcat.apache.org/connectors-doc/generic_howto/quick.html">Tomcat Connector Quick Start</a> or the <a href="http://tomcat.apache.org/connectors-doc/reference/workers.html">Tomcat Connector Reference Guide</a>.</p>
<h2>Assign urls to Tomcat in your Apache virtual hosts file</h2>
<p>After I configured a Tomcat worker to listen to AJP requests and configured Apache to use the jk connector module to talk to that worker, the last thing that was needed was to configure my web site&#8217;s virtual host (/etc/apache2/sites-available/default) to route urls to Tomcat:</p>
<pre>&lt;VirtualHost *:80&gt;
...
        JkMount /lms/* worker1
...
&lt;/VirtualHost&gt;</pre>
<p>Note that <em>worker1</em> is the name I gave to the worker I set up in the workers.properties file. Note that by using a * mask, I routed all requests (including static files) through Tomcat. Alternatively I could have configured only jsp requests to be routed to Tomcat, using:</p>
<pre>&lt;VirtualHost *:80&gt;
...
        JkMount /lms/*.jsp worker1
...
&lt;/VirtualHost&gt;</pre>
<div>I would then have needed added Directory configurations to the virtual host telling Apache where to serve the static files from.</div>
<h2>Restart Tomcat and Apache</h2>
<div>Of course, after I had done all of this, I had to restart Tomcat and Apache:</div>
<pre>sudo /etc/init.d/tomcat6 restart
sudo /etc/init.d/apache2 restart</pre>
<h2>Lifecycle</h2>
<p>To the best of my understanding, the relevant lifecycle is:</p>
<ol>
<li>When Tomcat starts up, it begins listening for AJP requests on port 8009 (because the connector is defined in /etc/tomcat6/server.xml).</li>
<li>When Apache starts up, it loads the jk connector module (because it is defined in /etc/apache2/mods-enabled/jk.load).</li>
<li>When Apache loads the jk connector, its configuration file (/etc/apache2/mods-enabled/jk.conf) tells it to send requests to the specified Tomcat worker and to use shared memory to do that.</li>
<li>It is not clear to me whether or not the Tomcat worker gets spawned when Apache starts up or on each request. I don&#8217;t see how it could get spawned when Tomcat starts up since Tomcat has no way of knowing about it.</li>
<li>Apache receives a request for a url that is mapped to Tomcat (in the virtual host file &#8211; /etc/apache2/sites-enabled/default).</li>
<li>Apache uses the jk connector module (mod_jk.so) to generate a request to send to Tomcat via a Tomcat worker.</li>
<li>The Tomcat worker communicates with Tomcat using the protocol (AJP) and port (8009) defined in the workers configuration file (/etc/apache2/workers.properties).</li>
<li>Tomcat processes the request and returns the response back through the worker and connector to Apache which returns it to the client.</li>
</ol>
<p>Kind of complicated, huh? And of course, this is just my best guess.</p>
<h2>Questions and problems</h2>
<div>The questions / problems I ran into this time I went through this process were:</div>
<div>
<ul>
<li>Not knowing which version of the JK connector to download</li>
<li>Forgetting I needed to configure the AJP connector in the Tomcat configuration file</li>
<li>Initially I routed only requests for JSP pages to Tomcat and so my stylesheets and images did not show up</li>
</ul>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.joelduffin.com/blog/2010/08/16/configuring-apache-and-tomcat-to-serve-my-java-web-application-through-port-80/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Compiling bcrypt-ruby gem for Windows</title>
		<link>http://www.joelduffin.com/blog/2009/05/11/compiling-bcrypt-for-windows/</link>
		<comments>http://www.joelduffin.com/blog/2009/05/11/compiling-bcrypt-for-windows/#comments</comments>
		<pubDate>Mon, 11 May 2009 21:40:18 +0000</pubDate>
		<dc:creator>joel</dc:creator>
				<category><![CDATA[rails]]></category>
		<category><![CDATA[web development]]></category>

		<guid isPermaLink="false">http://www.joelduffin.com/blog/?p=287</guid>
		<description><![CDATA[I am a heretic. I develop Rails apps on Windows. I own a Macbook, but I usually boot it into Windows. Justin thinks I should just should just move to developing on a Mac. I keep holding out. But periodically I try to install a gem that needs to compile native extensions for Windows, and [...]]]></description>
			<content:encoded><![CDATA[<p>I am a heretic. I develop Rails apps on Windows. I own a Macbook, but I usually boot it into Windows. <a href="http://justinball.com/">Justin</a> thinks I should just should just move to developing on a Mac. I keep holding out. But periodically I try to install a gem that needs to compile native extensions for Windows, and it fails. This just makes me mad. My latest encounter was with <a href="http://blog.codahale.com/2007/02/28/bcrypt-ruby-secure-password-hashing/">the bcrypt gem</a>. I did some googling and finally found a solution:</p>
<ol>
<li>Dug up my old copy of Visual Studio 6 CDs and installed the command line utilities. Apparently, not just any version will do; you have to be in sync with the version used to compile ruby.</li>
<li>Added VS6&#8217;s bin directories to my Windows path. Default install locations are:<br />
C:\Program Files\Microsoft Visual Studio\VC98\Bin;C:\Program Files\Microsoft Visual Studio\Common\MSDev98\Bin;</p>
<p>Alternatively, you can get a command prompt in  C:\Program Files\Microsoft Visual Studio\VC98\Bin and run VCVARS32.BAT to add those directories to your path.</li>
<li>Added some typedefs and a function to C:\Program Files\Microsoft Visual Studio\VC98\Include\SYS\TYPES.H that are not available on Windows:
<pre>#ifndef _UINT_T_DEFINED
typedef unsigned char  u_int8_t;
typedef unsigned short u_int16_t;
typedef unsigned int   u_int32_t;
typedef unsigned __int64 u_int64_t;
#define _UINT_T_DEFINED
#endif</pre>
<pre>#ifndef snprintf
#define snprintf _snprintf
#endif</pre>
</li>
</ol>
<p>Now I can do:</p>
<pre>gem install bcrypt-ruby</pre>
<p>I get the &#8220;Building native extensions. This could take a while&#8230;&#8221; message and it works!</p>
<p>I&#8217;m happy once again. I&#8217;m satisfied. And I didn&#8217;t even have to change religions.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.joelduffin.com/blog/2009/05/11/compiling-bcrypt-for-windows/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Multilingual Google search mashup</title>
		<link>http://www.joelduffin.com/blog/2008/04/21/multi-lingual-google-search-mashup/</link>
		<comments>http://www.joelduffin.com/blog/2008/04/21/multi-lingual-google-search-mashup/#comments</comments>
		<pubDate>Mon, 21 Apr 2008 23:08:31 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[conferences]]></category>
		<category><![CDATA[rails]]></category>
		<category><![CDATA[web development]]></category>
		<category><![CDATA[multi-lingual]]></category>

		<guid isPermaLink="false">http://www.joelduffin.com/blog/?p=56</guid>
		<description><![CDATA[For sometime I have envisioned a web browser that allows me to search and browse all of the web-pages of the world and view them in English. I figure there have got to be lots of cool things going on in the non-English speaking world that I would be interested in but I never hear [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.joelduffin.com/blog/wp-content/uploads/2008/04/sombrero.jpg"><img class="alignright alignnone size-medium wp-image-57" style="float: right;" title="multi-lingual Kedward" src="http://www.joelduffin.com/blog/wp-content/uploads/2008/04/sombrero.jpg" alt="" width="300" height="225" /></a>For sometime I have envisioned a web browser that allows me to search and browse all of the web-pages of the world and view them in English. I figure there have got to be lots of cool things going on in the non-English speaking world that I would be interested in but I never hear about them because I don&#8217;t speak those languages.</p>
<p>While attending the <a href="http://mtnwestrubyconf.org/">2008 Mountain West Ruby Conference</a> and needing something to hack on I decided to take a crack at the project. I already hacked <a href="http://translate.google.com/translate_t">Google Translate</a> for Send2Wiki so I figured it would be a snap to do for this project. My plan was to take the search text, run it through the translator for each of the languages to search, then pass the translated queries off to the Google search sites for each of the languages and then pass those pages through Google translate to get English versions of the pages. I <a href="http://googlesystem.blogspot.com/2007/05/google-multilingual-search.html">soon found</a> that Google has already done most of the work for me with their <a href="http://translate.google.com/translate_s">cross-language search</a>.</p>
<p>The only thing cross-language search doesn&#8217;t do for me is collate all of the language results into a single results page. You can only search for results in a single targeted language. Anyway, between sessions (a coder has always got to brag about how fast he can work right <img src='http://www.joelduffin.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  I threw together a <a href="http://www.toolsforsolving.com/">Multilingual Google Search Mashup</a> that does the job. As I put it together, a couple of things almost immediately stood out:</p>
<ul>
<li><strong>Wikipedia owns the top hit slot for many searches</strong>. Because their pages are essentially equivalent in the different languages, listing that entry for each of the languages isn&#8217;t especially useful.</li>
<li><strong>Interleaving search results is difficult</strong>. Rather than try to figure out an intelligent way to order in real-time the search results from the various languages, I just give the first two from each language and provide a language for getting more. I&#8217;ve got ideas for interleaving results, but none of them are too easy. Notice also that I haven&#8217;t included English in the search, which is probably where the most relevant pages will actually come from.</li>
</ul>
<p>These issues makes me wonder if a different approach would be preferrable. Perhaps Google could annotate search results with relevant pages in different languages. This also makes me think about Google&#8217;s search result ordering. Google search results appear to be determinative (if you execute the same search twice, the same item will show up at the top of the list). While this may be what we have come to expect, my experience with writing <a href="http://www.oerrecommender.org/">OER Recommender</a> makes me believe that it isn&#8217;t necessarily the best or the fairest thing to do. When ranking pages it is often the case that the scores of the top two or even 10 pages are statistically indistinguishable. So why should the one that happens to have a .00000001% higher score always show up first. My approach with was to identify a strata of rankings for those &#8220;highest ranked pages&#8221; that are virtually indistinguishable, I randomize the order. This seems fairer since it is quite natural for users to click on the first item in on a search results page, thus biasing it to become more and more popular.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.joelduffin.com/blog/2008/04/21/multi-lingual-google-search-mashup/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Moved to slicehost</title>
		<link>http://www.joelduffin.com/blog/2008/04/21/moved-to-slicehost/</link>
		<comments>http://www.joelduffin.com/blog/2008/04/21/moved-to-slicehost/#comments</comments>
		<pubDate>Mon, 21 Apr 2008 16:37:56 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[web development]]></category>
		<category><![CDATA[folksemantic]]></category>

		<guid isPermaLink="false">http://www.joelduffin.com/blog/?p=53</guid>
		<description><![CDATA[I recently moved from hostgator to slicehost. I signed up with hostgator because it seemed to be a cheap place ($10/mo) to play with rails. It turns out that hostgator doesn&#8217;t really do rails (they offer it via cgi, not even fastcgi). They didn&#8217;t allow me to install things like the Send2Wiki perl module or [...]]]></description>
			<content:encoded><![CDATA[<p>I recently moved from hostgator to slicehost. I signed up with hostgator because it seemed to be a cheap place ($10/mo) to play with rails. It turns out that <a href="http://www.hostgator.com/">hostgator doesn&#8217;t really do rails</a> (they offer it via cgi, not even fastcgi). They didn&#8217;t allow me to install things like the <a href="http://www.mediawiki.org/wiki/Extension:Send2Wiki">Send2Wiki</a> perl module or give me access to execute scripts from PHP either.</p>
<p>With <a href="http://www.slicehost.com/">slicehost I get a VPS for $20 / mo</a>. I&#8217;ve never used a VPS before but have run my own web servers, so it seems worth a shot. In the process of getting it set up I&#8217;ve learned more about DNS, iptables, fastcgi, etc than before, so that is fun, but also a time drain. Anyway, now I&#8217;ve got the access I need to set up Send2Wiki and run rails apps.</p>
<p>After hearing <a href="http://www.justinball.com/2008/03/06/social-wordpress/">Justin rave about WPMU</a>, I thought I would give it a shot. WP normally installs easy, so I expected the same. It would have except I wanted to install WPMU in a subdirectory. My Googling gave warnings against doing so, but I shrugged them off and forged ahead and tried to hack it to get it to work. Bad idea. It may be possible to do, and I got it to almost work, but certainly it doesn&#8217;t fall within the scope of famous 5 minutes install. So I&#8217;ve backed off and returned to the single user version.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.joelduffin.com/blog/2008/04/21/moved-to-slicehost/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Lucene and Multi-Lingual Updates to OER Recommender</title>
		<link>http://www.joelduffin.com/blog/2008/04/03/lucene-and-multi-lingual-updates-to-oer-recommender/</link>
		<comments>http://www.joelduffin.com/blog/2008/04/03/lucene-and-multi-lingual-updates-to-oer-recommender/#comments</comments>
		<pubDate>Thu, 03 Apr 2008 14:16:23 +0000</pubDate>
		<dc:creator>joel</dc:creator>
				<category><![CDATA[web development]]></category>
		<category><![CDATA[folksemantic]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[multi-lingual]]></category>
		<category><![CDATA[recommender]]></category>

		<guid isPermaLink="false">http://www.joelduffin.com/blog/2008/04/03/lucene-and-multi-lingual-updates-to-oer-recommender/</guid>
		<description><![CDATA[Last week I posted an update to OER Recommender. The source for the project is posted in Google code projects: oerrecommender, recommenderd, and aggregatord. The biggest change was moving OER Recommender from my home-brewed indexing and recommendation engine to using the super fast, super easy, open source search engine Lucene. I made the move because [...]]]></description>
			<content:encoded><![CDATA[<p>Last week I posted an update to <a href="http://www.oerrecommender.org/">OER Recommender</a>. The source for the project is posted in Google code projects: <a href="http://code.google.com/p/oerrecommender/">oerrecommender</a>, <a href="http://code.google.com/p/recommenderd/">recommenderd</a>, and <a href="http://code.google.com/p/aggregatord/">aggregatord</a>. The biggest change was moving OER Recommender from my home-brewed indexing and recommendation engine to using the <a href="http://lucene.apache.org/">super fast, super easy, open source search engine Lucene</a>. I made the move because I had heard many good things about Lucene and wanted to explore using it. In addition, Lucene supports multiple languages nicely. Because the OER Recommender web app is in written in Rails, I used the <a href="http://wiki.rubyonrails.org/rails/pages/Acts+as+Solr+Plugin">acts_as_solr</a> plugin which depends on <a href="http://lucene.apache.org/solr/">Solr</a>, another Apache project which provides easy integration with Web applications. Here is a list of the changes I made:</p>
<ul>
<li><strong>Added Collections</strong>. The index now contains more than 90,000 records from <a href="http://www.oerrecommender.org/collections">over 100 collections and 26 languages</a>.</li>
<li><strong>Added Support for Harvesting via SQI/WSDL</strong>. Support for harvesting SQI/WSDL using <a href="http://ws.apache.org/axis/">Axis</a> was added in order to harvest <a href="http://www.merlot.org/">MERLOT</a> via <a href="http://ariadne.cs.kuleuven.be/SqiInterop/free/SQIImplementationsRegistry.jsp#Merlot">Araidne</a>.</li>
<li><strong>Catalog Links</strong>. When providing recommendations if we have catalog links and direct links for resouces such as in the case of <a href="http://www.oercommons.org/">OER Commons</a> and <a href="http://www.merlot.org/">MERLOT</a> both are provided.</li>
<li><strong>OAI Set Discovery</strong> &#8211; In order to get the names of collections from OER Commons the ability to discover the OAI Sets (collections) was added to the harvester.</li>
<li><strong>Lucene</strong>. The home brewed search and recommendation system was swapped out with Lucene. This makes for faster and better searching as well as faster indexing. The full range of <a href="http://lucene.apache.org/java/docs/queryparsersyntax.html">query syntax supported by Lucene</a> is now supported.</li>
<li><strong>Multi-Lingual</strong>. With Lucene in place OER Recommender can now support language-specific search and recommendation.</li>
<li><strong>Additional Metadata</strong>. Additional metadata was added to search and recommendation results: Descriptions, Authors, Date (metadata), Date (relevance was calculated).</li>
<li><strong>Home Page Cleanup</strong>. The home page was simplified by moving the <a href="http://www.oerrecommender.org/help/demo.html">Greasemonkey script and example resources</a> and to a separate page.</li>
<li><strong>Search Results</strong>. Search results were modified to look similar to Google search results. In cases where the index contains both links to catalog pages and direct links for a resource, a Metadata link next to the title takes you to the catalog page. A &#8220;Related Resources&#8221; link is also provided next to each item in search results. This makes it easy to see recommendations.</li>
<li><strong>More Recommendations Page</strong>. The geek friendly page was replaced with a page that looks essentially like the search results page. A link to the original page containing details such as is included near the top of the page.</li>
<li><strong>Incremental Updates Support</strong>. The recommender was modified to support incrementally updating the indexes and recommendations without losing user data. It now runs every night, harvesting the collections, indexing and creating recommendations for new records. Once a week it re-runs all recommendations so that recommendations could be created that point at new records.</li>
<li><strong>Time on Page</strong>. Average time on page tracking was added and used to adapt the recommendations algorithm.</li>
<li><strong>Localized Interface</strong>. The main web pages were translated into Spanish, French, German, Japanese, Dutch, Russian, and Chinese using <a href="http://www.google.com/translate_t">Google Translate</a> (if you speak one of those languages and want to help the translation, feel free to send me fixes). Localization is supported via the <a href="http://simple-localization.arkanis.de/">swell Simple Localization rails plugin</a>. The web app also auto-detects the language set in your web browser and sets that as the default search interface.</li>
</ul>
<p>More on implementation later&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.joelduffin.com/blog/2008/04/03/lucene-and-multi-lingual-updates-to-oer-recommender/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Debugging browser incompatibilities</title>
		<link>http://www.joelduffin.com/blog/2007/12/13/debugging-browser-incompatibilities/</link>
		<comments>http://www.joelduffin.com/blog/2007/12/13/debugging-browser-incompatibilities/#comments</comments>
		<pubDate>Thu, 13 Dec 2007 15:27:29 +0000</pubDate>
		<dc:creator>joel</dc:creator>
				<category><![CDATA[interactive online math]]></category>
		<category><![CDATA[web development]]></category>
		<category><![CDATA[nlvm]]></category>

		<guid isPermaLink="false">http://www.joelduffin.com/blog/2007/12/13/debugging-browser-incompatibilities/</guid>
		<description><![CDATA[Every time we update the NLVM website like we did a few weeks ago we receive email from people that are no longer able to access the applets. Often times the causes are somewhat mysterious. Some of the problems are caused by proxy and browser caching; some of the updated files arrive at peoples&#8217; browsers [...]]]></description>
			<content:encoded><![CDATA[<p>Every time we update the <a href="http://nlvm.usu.edu/">NLVM</a> website like we did a few weeks ago we receive email from people that are no longer able to access the applets. Often times the causes are somewhat mysterious. Some of the problems are caused by proxy and browser caching; some of the updated files arrive at peoples&#8217; browsers and others do not. After a few days the problems seem to work themselves out.</p>
<p>Other times problems occur because we broke something <img src='http://www.joelduffin.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  and we didn&#8217;t catch it in our testing. This is aggravated by browser incompatibilities. Because old hardware and software tend to hang around schools longer than other places, strange things show up. Right now I&#8217;m trying to track down some of those types of issues.</p>
<p>A few years ago when I was more actively developing the NLVM I used to keep old machines around so I could test old browsers. One of the reasons multiple machines were needed was that I couldn&#8217;t find an easy way to run multiple versions of Internet Explorer on the same machine. Yesterday I googled to see if there is anything new out there to help with this issue. I was pleasantly surprised to find a utility by tredosoft that allows you to <a href="http://tredosoft.com/Multiple_IE">install multiple versions of IE on your PC</a>. Thanks tredosoft! Unfortunately after installing the multiple browsers, I&#8217;m still not able to see the reported problem even when running on the same browser.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.joelduffin.com/blog/2007/12/13/debugging-browser-incompatibilities/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>OER Recommender Released</title>
		<link>http://www.joelduffin.com/blog/2007/08/23/oer-recommender-released/</link>
		<comments>http://www.joelduffin.com/blog/2007/08/23/oer-recommender-released/#comments</comments>
		<pubDate>Fri, 24 Aug 2007 04:15:06 +0000</pubDate>
		<dc:creator>joel</dc:creator>
				<category><![CDATA[information retrieval]]></category>
		<category><![CDATA[web development]]></category>
		<category><![CDATA[folksemantic]]></category>
		<category><![CDATA[nsdl]]></category>
		<category><![CDATA[recommender]]></category>

		<guid isPermaLink="false">http://www.joelduffin.com/blog/?p=12</guid>
		<description><![CDATA[Here is the updated OER Recommender White Paper.
Yesterday we released the OER Recommender system that I have worked on.  There are still many things that could be added or tweaked, but it does something useful already so out the door it goes! I&#8217;m concerned that we are calling it a recommender as the &#8220;recommendations&#8221; [...]]]></description>
			<content:encoded><![CDATA[<p>Here is the updated <a title="OER Recommender White Paper" href="http://www.joelduffin.com/blog/wp-content/uploads/2008/01/recommender.pdf">OER Recommender White Paper</a>.</p>
<p>Yesterday we <a href="http://opencontent.org/blog/archives/367">released</a> the <a href="http://www.oerrecommender.org/">OER Recommender</a> system that I have worked on<a href="http://opencontent.org/blog/archives/367"></a>.  There are still many things that could be added or tweaked, but it does something useful already so out the door it goes! I&#8217;m concerned that we are calling it a <a href="http://en.wikipedia.org/wiki/Recommendation_system">recommender</a> as the &#8220;recommendations&#8221; it currently provides are not specific to a user, they are related resources generated via a content-based approach. The proposal and plans are to make it a recommender based on user profiles. See <a href="http://www.joelduffin.com/blog/2007/05/09/implementing-a-recommender-system/">my previous post</a> for details about where it is intended to go.</p>
<p><strong>The Problem</strong>. The <a href="http://www.hewlett.org/" target="_blank">William and Flora Hewlett Foundation</a>, the <a href="http://nsdl.org/">National Science Digital Library</a>, and other large organizations have made large investments in the development of electronic resources that can be used for teaching, learning, and research. They are keen on finding ways to increase the impact of their investment. One way to do that is to create tools that make it easier for poeple to find resources that are useful to them. One approach to solving this problem are search and browsing tools that help people find good resources such as NSDL, <a href="http://www.ocwfinder.org/">OCW Finder</a>, and of course Google. Recommender systems approach the problem by helping bringing resources to peoples&#8217; attention without having them to go look for them specifically Google&#8217;s Adsense technology and Amazon&#8217;s recommendations are two of the most common examples. Some of the challenges with general search tools such as Google and even more focused ones such as NSDL&#8217;s main search is that they cast too wide a net and you get back resources do not match. On the other hand, while most collections of open courseware and digital libraries provide search, it is nice to be able to search across repositories at the same time to find the best resources wherever they reside.</p>
<p><strong>The Vision</strong>. In creating the OER Recommender we set out to create a service that would help people find relevant open education resources. The first step in this was to create an automated process for clustering related resources. The hope is that presenting links to related resources to users when looking at a resource could help people stumble upon resources that are relevant to them even though they might have not been actively looking for them. We could tune this service by monitoring what resources people visit, share, rate, tag, and otherwise pay attention to. We could use this attention metadata to create profiles of what people&#8217;s interest are. The profiles could be used to push recommendations to people even when they are not browsing perhaps via email or other means.</p>
<p><strong>Exploration</strong>. Since learning about it I have felt that the recommender would be one of the more interesting <a href="http://www.folksemantic.org/">folksemantic</a> tools to work on since I initially thought to pursue a <a href="http://en.wikipedia.org/wiki/Latent_semantic_analysis">latent semantic analysis</a> approach. I had heard about LSA from my exposure to <a href="http://sitcogblog.blogspot.com/">Andy Walker&#8217;s</a> and <a href="http://home.autotutor.org/graesser/">Art Graesser&#8217;s</a> work. I have also had a number of interesting discussions about clustering with <a href="http://http://www.math.usu.edu/~adele/home.htm">Adele Cutler</a> about her work on <a href="http://www.math.usu.edu/~adele/forests/index.htm">Random Forests</a>. I got a hold of the <span class="sans"><a href="http://www.amazon.com/Handbook-Semantic-University-Institute-Cognitive/dp/0805854185">Handbook of Latent Semantic Analysis</a> and explored the <a href="http://del.icio.us/jduffin/lsa">LSA online resources I could find</a> as well as the <a href="http://cran.r-project.org/src/contrib/Descriptions/lsa.html">lsa module for R</a></span>. After spending significant time reading and playing I came to the conclusion that it would likely more complex to implement and computationally expensive than I wanted. I also concluded that everything that I would need to do to implement a standard <a href="http://en.wikipedia.org/wiki/Vector_space_model">Term Vector Model</a> approach could be used to later implement an LSA approach. During my exploration, I came across a number of <a href="http://del.icio.us/jduffin/recommender">useful online resources</a> and books including <a href="http://www.cs.cornell.edu/Info/Department/Annual95/Faculty/Salton.html">Gerard Salton&#8217;s</a> work.</p>
<p><strong>An Explanation for My Mother</strong>. I&#8217;ve already been asked by <a href="http://shelleylyn.blogspot.com/">Shelly</a> to explain the implementation of the recommender in language that my mother would understand, so I&#8217;ll give that explanation first. We use the folksemantic feed harvester to gather information (metadata) about OERs into databases. For each pair of resources that are related enough, the recommender uses the titles, description, and tags to calculate a score indicating how related they are. Recommended resources are the ones scored to be the most similar. The similarity of two resources is based on an automated analysis of the words in their metadata. Got that Mom?</p>
<p>We use four phases to arrive at recommendations: (1) parse metadata, (2) calculate local term weights, (3) calculate global term weights, (4) calculate similarity scores.</p>
<p><strong>Parse Metadata</strong>. Using a string tokenizer we break metadata text into terms. We throw away stop words (common words such as &#8216;and&#8217;, &#8216;is&#8217;, &#8216;of&#8217; that do not add meaning). Next we convert terms into their stems using a <a href="http://www.tartarus.org/~martin/PorterStemmer">common algorithm</a>; for example &#8216;running&#8217;, &#8216;ran&#8217;, and &#8216;runs&#8217; all get collapsed into &#8216;run&#8217;.</p>
<p><strong>Calculate Local Term Weights</strong>. A local term weight is a measure of how important a word is for describing a document. The more frequently a word appears in a document, the more important it is&#8230; up to a point. Of course, where a word appears in a document is probably a good indicator of how important the word is as well. For example if a word appears in a title, it is probably more important than if it appears in the body. One more important factor to consider is document length (total number of terms in the document); all other things being equal, the longer a document is, the more times a term will appear in it. So we normalize term frequencies using the document length. One last issue relates to filler words.</p>
<p>To calculate local term weights OER Recommender uses a function that looks like this.</p>
<p><a title="Global Term Weights Function" href="http://www.joelduffin.com/blog/wp-content/uploads/2007/08/gtw.jpg"><img src="http://www.joelduffin.com/blog/wp-content/uploads/2007/08/gtw.jpg" alt="Global Term Weights Function" /></a></p>
<p>The x-axis represents term frequency and the y-axis represents the local term weight. As you can see, the more time a term appears in a document, the more weight it is given. However after reaching a threshold, it plateaus. I think this is especially important in situations where document creators might be trying to game the system. The formula used is:</p>
<p><em>ltw<sub>i</sub> = 1 / (1 + (e<sup>(.0044)dlen</sup>) .7<sup>(fti &#8211; 1)</sup>)</em></p>
<p><em>ltw<sub>i</sub></em> is the local term weight for a given term in a document.</p>
<p><em>dlen</em> is the total number of terms in the document.</p>
<p><em>ft</em><em><sub>i</sub></em> is the number of times a given term appears in a document.</p>
<p>Note that the constants .0044 and .7 in the equation determine the shape of the curve. I chose those values based on their use in the <a href="http://lhncbc.nlm.nih.gov/lhc/docs/published/2001/pub2001045.pdf">MeSH system</a>. As explained there, the values should be tuned to your data set using an empirical approach. I played with the parameters some and the values they used seemed fine, so I adopted them.</p>
<p><strong>Calculate Global Term Weights</strong>. Global term weights are a measure of how important a word is for distinguishing documents within a collection. The more documents a word appears in, the less value it has for characterizing documents. For example, the term &#8216;USU&#8217; appears in every metadata record for resources in USU&#8217;s OpenCourseware. As a result, it has no value for characterizing clustering resources. The term &#8216;USU&#8217; becomes in a sense, a stop word, similar to those thrown away during the parse phase. To calculate the global term weight for a term, the number of documents that it appears in are counted as well as the total number of documents.</p>
<p>To calculate global term weights OER Recommender uses a function that looks like.</p>
<p><a title="Local Term Weights Function" href="http://www.joelduffin.com/blog/wp-content/uploads/2007/08/ltw.jpg"><img src="http://www.joelduffin.com/blog/wp-content/uploads/2007/08/ltw.jpg" alt="Local Term Weights Function" /></a></p>
<p>The x-axis represents the number of documents that a term appears in. The y-axis represents the global weight assigned to the term. The formula used is:</p>
<p><em>gtw<sub>i</sub> = log (D/df<sub>i</sub>) </em></p>
<p><em>gtw<sub>i</sub></em> is the global term weight for term.</p>
<p><em>D</em> is the total number of documents.</p>
<p><em>df<sub>i</sub></em> is the number of documents that the term appears in.</p>
<p>See <a href="http://www.miislita.com/term-vector/term-vector-1.html">Mi lslita&#8217;s explanation of Term Vector Theory</a> for details.</p>
<p><strong>Calculate Similarity Scores</strong>. Once local term weights and global term weights are calculated, we are ready to create recommendations. To create recommendations for a document we first find all documents that have any of the same terms in it. It turns out that this can be a very large number of documents (e.g. 40,000 in our system).  For each pairing of the document being considered and a document with matching terms we calculate a similarity score. Because calculating 40,000 scores can take a long time (about 15 seconds in our current system), we shorten the list to consider to 200. We do this by sorting the pairs according to the number of overlapping terms. To calculate the similarity score for a pair of documents, we sum over all of the terms that they share in common. The contribution from each term is a combination of the local term weight in the first document, the local term weight in the second document and the global term weight. Because in OER recommender, each feed can be considered a separate collection, we calculate global term weights for each of the feeds and use them in calculating the similarity score.</p>
<p>To calculate the contribution of an individual term to the similarity score, OER Recommender uses:</p>
<p><em>sst<sub>i</sub> = (gtw<sub>1i</sub>)(ltw<sub>1i</sub>)</em><em>(gtw<sub>2i</sub>)</em><em>(ltw<sub>2i</sub>)</em></p>
<p><em>sst<sub>i</sub></em> Is the contribution to the similarity score for a given term.</p>
<p><em>gtw<sub>1i</sub></em> is the global term weight from the feed that the first document is in.</p>
<p><em>ltw<sub>1i</sub></em> is the local term weight from the first document.</p>
<p><em>gtw<sub>2i</sub></em> is the global term weight from the feed that the second document is in.</p>
<p><em>ltw<sub>2i</sub></em> is the local term weight from the second document.</p>
<p>See <a href="http://www.ncbi.nlm.nih.gov/entrez/query/static/computation.html">NCBI&#8217;s explanation of how MeSH calculates Related Articles</a> for details. Once OER Recommender has calculated similarity scores for the 200 documents, it sorts the documents by those scores and stores the top 10 in a database.</p>
<p><strong>Displaying Recommendations</strong>. Now that the recommendations are stored in a database, displaying them to the user is straightforward. We have created a <a href="https://addons.mozilla.org/en-US/firefox/addon/748">Greasmonkey script</a> that requests recommendations for each web page that a user browses. If OER Recommender returns any, the script inserts HTML for the recommendations into the web page. The eduCommons team is building a plone tool so that anyone running eduCommons can turn on recommendations. That way users won&#8217;t have to have install the Greasemonkey script in order to see recommendations on the eduCommons website. The OER Recommender site publishes the XML format it returns so others could add recommendations into their websites.</p>
<p><strong>Getting Resources into the Recommender</strong>. Currently OER Recommender only provides recommendations for URLs that it has metadata for, so in order to get recommendations, you need to register a metadata feed. If you have an OCW site that provides an RSS feed, you can do that at <a href="http://www.ocwfinder.org/">OCW Finder</a>. If you have an OAI feed or an RSS feed for learning objects or other OERs, you can add your metadata by sending a feed title, URL, and display URL (home page of the repository) to oerrecommender AT cosl DOT usu DOT edu.</p>
<p><strong>Future Directions</strong>. There are many things left to be done: (1) adapt recommendations by monitoring which recommended resources people click on and how long they stay at recommended resources, (2) provide links to more recommendations (so users can see more than the default 5), (3) provide a way for people to indicate when a recommendation doesn&#8217;t work, (4) provide a way for people to register and login, so they can receive personalized recommendations (perhaps via email), (5) create a process whereby recommendations can be updated without going through all the documents in the database, (6) add functionality for retrieving recommendations for arbitrary web pages (recommending on demand) perhaps via a bookmarklet, (7) create whatever Greasemonkey-like add on is supported on IE. We also plan to integrate the recommender with the aggregator and feed reader to provide recommended news items based on the attention people have previously paid to news articles.</p>
<p>Wow! I realize that this is a long post, but it is not a simple topic. Hopefully the explanation is useful.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.joelduffin.com/blog/2007/08/23/oer-recommender-released/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Scaling Rails (Debugging Ozmozr)</title>
		<link>http://www.joelduffin.com/blog/2007/08/23/scaling-rails-debugging-ozmozr/</link>
		<comments>http://www.joelduffin.com/blog/2007/08/23/scaling-rails-debugging-ozmozr/#comments</comments>
		<pubDate>Thu, 23 Aug 2007 22:28:02 +0000</pubDate>
		<dc:creator>joel</dc:creator>
				<category><![CDATA[web development]]></category>
		<category><![CDATA[aggregator]]></category>
		<category><![CDATA[cosl]]></category>
		<category><![CDATA[rails]]></category>

		<guid isPermaLink="false">http://www.joelduffin.com/blog/?p=11</guid>
		<description><![CDATA[Justin and I have been debating whether or not we really believe that Rails can scale. As we talked about this issue, we realized that ozmozr is probably a good test case. We stopped working on ozmozr months ago, realizing that it needed additional work. We needed to move on to other projects we had [...]]]></description>
			<content:encoded><![CDATA[<p>Justin and I have been debating whether or not we really believe that Rails can scale. As we talked about this issue, we realized that <a href="http://www.ozmozr.com/">ozmozr</a> is probably a good test case. We stopped working on ozmozr months ago, realizing that it needed additional work. We needed to move on to other projects we had committed to. Over time, ozmozr&#8217;s up time has been less and less. I suspected that the problem was in the ugly queries that we were throwing its way. Well, today I checked and here is what I found. It turns out that so far our problems have nothing to do with Rails, rather they have to do with the amount of data we are working over, poorly designed queries, and Java daemon processes that are stealing all of the CPU.</p>
<p><strong>The database</strong>. I&#8217;ve checked the database and found that we have nearly 2 million rss entries and about 350 thousand unique tags. We&#8217;ve indexed the entries so that we can access recent ones quickly. We haven&#8217;t indexed the entries for fast searching by tag.</p>
<p><strong>Search was dog slow</strong>. Ozmozr&#8217;s search which is visible from most pages, takes entered terms and treats them as tags with which to search entries. Joining through the massive tag table to the even more massive entries table was taking forever (on the order of 45-60 seconds per query). We temporarily disabled entry searching until I poked around and found that we hadn&#8217;t created an index on tag names. Just putting an index on tag names reduced the query time to around 8 seconds. Much shorter, but still too long. So I reduced the complexity of the query. It may not be as powerful as it used to be, but now executes in under two seconds.</p>
<p><strong>Aggregator&#8217;s 20 Java threads talking to postgres was stealing the CPU</strong>. We use a Java aggregator daemon that I wrote to harvest RSS feeds. We haven&#8217;t harvested in a while because of the problems we have experienced. When I fired it up today I found that it brought the site to its knees when it started doing its work. By monitoring the CPU I saw that the aggregator daemon was stealing all of the CPU leaving none for Rails to use to serve pages. By default the aggregator checks feeds every hour. When doing this it can use up to 20 threads. Each thread creates a connection to postgres. It was actually postgres that was stealing all of the CPU, but only because of how the Java daemon was talking to it. My quick fix is to limit the aggregator to 1 thread. I initially designed the aggregator to use many threads because it turns out that most of the time spent in harvesting is used up in waiting for web servers to respond. I guess I need to figure out another approach. I&#8217;m going to look to see if we can throttle how much CPU the Java threads (and corresponding Postgres processes) use.</p>
<p><strong>Shrinking tag clouds are a fun idea but a pig to implement</strong>. Similar to how <a href="http://www.ocwfinder.org/">OCW Finder</a> allows you to filter your browsing by selecting tags, we initially implemented the same idea in oz via a shrinking tag cloud. The idea was that as soon as you clicked on a tag in a cloud, in addition to the items being filtered, the tags in the resulting tag cloud would be filtered as well (shrinking the cloud). It turns out that this kind of query is horrendously expensive, at least the way we implemented it. It also turns out that users don&#8217;t seem to understand what is going on. As a result of this, we ripped the shrinking tag clouds out&#8230; almost. I just found an instance where it was left in. They are gone now.</p>
<p>There are more ugly queries to look at but I&#8217;ve gotten most queries down to under 4 seconds, which is still a long time, but at least the website doesn&#8217;t die. I&#8217;ll write more as I find it out.</p>
<p>As a note, the way I was able to easily identify the nasty queries was by going into the postgres config file (var/lib/pgsql/data/postgresql.conf) and setting the option log_min_duration_statement = 3000 (milliseconds). With this option set, every query that takes longer than three seconds is written to the postgres log file.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.joelduffin.com/blog/2007/08/23/scaling-rails-debugging-ozmozr/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>D-Lib Math Tools DL article</title>
		<link>http://www.joelduffin.com/blog/2004/03/01/d-lib-math-tools-dl-article/</link>
		<comments>http://www.joelduffin.com/blog/2004/03/01/d-lib-math-tools-dl-article/#comments</comments>
		<pubDate>Mon, 01 Mar 2004 21:20:52 +0000</pubDate>
		<dc:creator>joel</dc:creator>
				<category><![CDATA[authoring tools]]></category>
		<category><![CDATA[interactive online math]]></category>
		<category><![CDATA[math education]]></category>
		<category><![CDATA[web development]]></category>
		<category><![CDATA[enlvm]]></category>

		<guid isPermaLink="false">http://www.joelduffin.com/blog/2004/03/01/d-lib-math-tools-dl-article/</guid>
		<description><![CDATA[I just read an article by SRI researchers that reports on a user study of the Math Tools DL. As one of the participants in the study I was interested to see what they had to say. The basic structure of the report was to: (a) summarize the results, (b) propose representative personas, and (c) [...]]]></description>
			<content:encoded><![CDATA[<p>I just read <a href="http://web.archive.org/web/20041101193359/http://www.dlib.org/dlib/february04/shechtman/02shechtman.html">an article</a> by SRI researchers that reports on a user study of the <a href="http://www.mathforum.org/mathtools/">Math Tools DL</a>. As one of the participants in the study I was interested to see what they had to say. The basic structure of the report was to: (a) summarize the results, (b) propose representative personas, and (c) propose a metaphor and a set of design principles.One thing that was not really touched on that I hope MTDL can become is a place for people to <strong>DO</strong> stuff, not just find and talk about stuff. I&#8217;m working on a proposal for adding functionality to MTDL for using TADRIOLA to adapt existing lessons, activities, and mathlets and then sharing the derived works.</p>
<p><a name="more" href="http://web.archive.org/web/20041101193359/http://www.reusability.org/blogs/joel/archives/000478.html"></a><strong>Summary</strong></p>
<ul>
<li><em>Searching and Publishing</em> &#8211; People come to the MTDL to find and share resources</li>
<li><em>Overcoming Isolation</em> &#8211; People come to the MTDL to help overcome isolation</li>
<li><em>Discuss Development</em> &#8211; Talk with others about the development of math software</li>
</ul>
<p><strong>Personas</strong></p>
<ul>
<li>Teacher Developer</li>
<li>Professional Developer</li>
<li>Educational Researcher</li>
<li>Inexperienced Developer</li>
<li>Hobbyist Developer</li>
</ul>
<p><strong>Metaphor and Principles</strong></p>
<ul>
<li>Workshop metaphor</li>
<li>Design for multiple roles</li>
<li>Design for multiple levels of expertise</li>
<li>Provide activity indicators</li>
</ul>
<p>I like the workshop metaphor, though I think that perhaps there are better. I can&#8217;t really discern the implication of designing for different roles and different levels of expertise. I lilke the idea of activity indicators. I realize that this is an area of recent interest throughout the field.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.joelduffin.com/blog/2004/03/01/d-lib-math-tools-dl-article/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

