life is a rum go guv’nor, and that’s the truth

Solving aggregation problems

In Folksemantic, we run into the following problems:

  • Duplicate entries. Search and recommendation results that list multiple entries for the same resource.
  • Catalog pages. Search and recommendation results that link to catalog pages for resources (people would rather go directly to the resource, but the metadata providers want people to go to their catalog entry for the resource).
  • Dead links. Results that link to resources that no longer exist.
  • Urls without metadata. When someone shares a resource or inserts the recommender widget in a page for which we don’t have metadata, we need to be able to generate metadata.

Duplicate entries show up because:

  • Two feeds specify entries with the same permalink.
  • The same feed gets added twice (maybe different formats for the same feed, eg. RSS, Atom)
  • Multiple catalogs provide metadata for the same resource.

Dealing With Duplicate Feeds

Problem: In folksemantic a user can enter the url of their blog and we will detect the feeds from the page and add them. We use the feeds to generate personal recommendations. The problem is, a blog typically has 3 or more feeds all of which contain the same content, just provided in different formats (e.g. RSS, Atom etc). So we really don’t want all of the feeds to be generated.

Solution 1: One approach to solving this is to try to detect the duplicate feed the first time we harvest it, don’t add its entries to the index, and then flag the feed as “duplicate” so that we don’t harvest it again. Store in the feed the id of the feed it duplicates. One potential problem with this is that if someone registers a feed that has just the entries tagged a certain way (e.g. all of the entries tagged apple on the gizmodo feed), then if the main feed is already registered, all of the entries on the filtered feeds duplicate the entries in the main feed, so the entries are duplicate, but the feed is not. If we want to use the feed as a basis for making recommendations to the user, we don’t want to use the main feed.

Solution 2: Another approach to the problem is to just add the feed, and harvest it, but then flag the entries as duplicates. Our thought about doing this is to store in each entry a list all of the feeds that the entry belongs to. We need to verify that this won’t slow down our Lucene queries.

It seems that Solution 2 may be best and make it up to the app to avoid adding duplicate feeds (like the 4 feeds for the same blog that Folksemantic does).

Dealing with Catalog Entries

A number of NSDL and other projects such as OER Commons  have created large catalogs of online resources. Sometimes their metadata is harvested directly from the resource websites. Sometimes they enhance that metadata with new information. Sometimes they create metadata for resources that don’t provide their own metadata. The catalog websites often provide services such as rating, discussion, and other valuable services and so they want people to come to their websites and use them. While, these services are nice, when people are searching for resources, they likely want to look at the resource first and make their own judgement if that is possible, and then read more about it if they are interested. I think this is because the cost of looking at an online resource is minimal (as compared to buying something or attending a course, for example). So the catalog issue leads to two problems:

Problem: When people see search results, they likely want to go directly to the resource instead of to a catalog page.

Solution: When a catalog page is the only entry for a resource, that entry is flagged “primary”. As soon as we create an entry that goes directly to the resource, we flag that new entry primary, and the catalog entry as not primary; we also store the id of the catalog entry in the list of duplicate entries that we store in the new entry. When searching, by default return only primary entries unless the application explicitly requests all entries. Return a flag indicating that an entry has catalog entries. Provide an API for requesting catalog entries for a specific entry.

Problem: In most cases, catalog metadata does not provide the url of the resource it is cataloging.

Solution: Initially flag the entry as “primary” so it will show up in search results. Later, asynchronously crawl the catalog pages to find the url of the catalogued resource. Once the direct url is known, create a new entry for the resource and store the id of the catalog entry in the list of “related entries” that we store for the new entry. Flag the catalog entry as not primary and the new entry as primary. Copy the metadata from the catalog entry into the new entry. Use the resource domain as the key for the feed to add the new entry to. If the feed does not already exist, create one for it.

Problem: If there are multiple entries (catalog etc for a resource), which metadata should we use to calculate the recommendation for the resource?

Solution: Options might be: (a) the metadata provided by the resource, (b) metadata generated by a crawl of the resource – I think this is bad because frequently metadata is more descriptive than the page itself, (c) the first catalog entry found for the resource, (d) the largest set of metadata for the resource. My thought it to always use the largest set of metadata for the resource unless there is no catalog entries (like in the case of where we crawl a website), in which case we must use the metadata generated by the crawl. In order to facilitate this approach, we: (1) for entries, we store whether or not the metadata came from that resource itself or not, (2) whenever we detect a new catalog entry for a resource that already has an entry, we look to see if the metadata in the existing entry was copied from a catalog entry; if it was, compare the size of the metadata from the two entries and update the metadata with the new catalog entry metadata if it is larger. For the purpose of calculating recommendations it might make sense to use all of the metadata from all of the sources.

Problem: When a website requests recommendations for a url, normally we want to return non-catalog entries, but when a catalog requests recommendations for one of its urls, they likely want their own catalog entries back if they exist.

Solution: When generating recommendations, for recommended entries that have catalog entries, check those and recommend those catalog entries instead.

Detecting and Handling Feed Entry Deletions

Problem: OAI has a way to tell you that an entry has been deleted, but RSS does not. How can you detect when an entry has been deleted, and what should you do when it is deleted?

Solution: My thought is that this is just part of what our dead link handler does. It finds entries with dead links and flags them deleted or actually deletes them. When we re-index we remove items from the index that have been flagged deleted.

Dealing with Dead Links

Problem: Many times the resources in our indexes get taken down or moved without notification (the source of the metadata doesn’t get updated or it doesn’t get updated for a while). What should we do in that situation?

Solution: We will write a bot that will flag entries dead. Once entries are dead they won’t show up in search or recommendation results. Should they still be used as the basis for recommendations? Probably not. Maybe we create another process that looks for the new location of the dead entries?

Generating Metadata for a URL

Problem: When someone adds an entry but doesn’t provide metadata, we need to be able to generate metadata for the entry. We also need to know which feed to put it into.

Solution: The application should give us a feed id, or a display url for the feed along with the entry URL. If it does not send a feed ID, we will look for feed using the host portion of the entry permalink. If one does not exist, it will create one and specify the host the display url for the feed, that way future entries for that feed will always go into that feed.

Generating Recommendations for Web Pages We Haven’t Indexed Yet

This is similar to the previous issue, we want people to be able to add the OER Recommender widget to their pages and have them just start working, removing the requirement that they add their resources to our index before we can provide recommendations. We can analyze and provide recommendations in real time, but that tends to bury our server if it gets a bunch of requests for real time recommendations  all at once.

Problem: Provide recommendations for URLs that haven’t been indexed yet.

Solution: When the recommendation is requested, add an entry for the URL, flagging it needing to be scraped. Flag the feed as being not-recommendable. If we don’t have a domain feed for the URL, add a domain feed for the entry and specify it as the feed for the entry. Queue the feed for approval by site admins.

This brings up the issue of being able to narrow the scope of the space into which recommendations are made. Depending on the context, we want to consider different sets of items to recommend. For example, in folksemantic, for personal recommendations we let users add feeds they produce, but we don’t necessarily want to include their stuff in recommendations that we make to other people.

Problem: Narrow the scope of the space that we recommend items from.

Solution: Define recommendation tasks by specifying the aggregation of feeds that we are recommending from and the aggregation of feeds that we are recommending into. Store those ids in the recommendation table.

Leave a Reply