Justin and I have been debating whether or not we really believe that Rails can scale. As we talked about this issue, we realized that ozmozr is probably a good test case. We stopped working on ozmozr months ago, realizing that it needed additional work. We needed to move on to other projects we had committed to. Over time, ozmozr’s up time has been less and less. I suspected that the problem was in the ugly queries that we were throwing its way. Well, today I checked and here is what I found. It turns out that so far our problems have nothing to do with Rails, rather they have to do with the amount of data we are working over, poorly designed queries, and Java daemon processes that are stealing all of the CPU.
The database. I’ve checked the database and found that we have nearly 2 million rss entries and about 350 thousand unique tags. We’ve indexed the entries so that we can access recent ones quickly. We haven’t indexed the entries for fast searching by tag.
Search was dog slow. Ozmozr’s search which is visible from most pages, takes entered terms and treats them as tags with which to search entries. Joining through the massive tag table to the even more massive entries table was taking forever (on the order of 45-60 seconds per query). We temporarily disabled entry searching until I poked around and found that we hadn’t created an index on tag names. Just putting an index on tag names reduced the query time to around 8 seconds. Much shorter, but still too long. So I reduced the complexity of the query. It may not be as powerful as it used to be, but now executes in under two seconds.
Aggregator’s 20 Java threads talking to postgres was stealing the CPU. We use a Java aggregator daemon that I wrote to harvest RSS feeds. We haven’t harvested in a while because of the problems we have experienced. When I fired it up today I found that it brought the site to its knees when it started doing its work. By monitoring the CPU I saw that the aggregator daemon was stealing all of the CPU leaving none for Rails to use to serve pages. By default the aggregator checks feeds every hour. When doing this it can use up to 20 threads. Each thread creates a connection to postgres. It was actually postgres that was stealing all of the CPU, but only because of how the Java daemon was talking to it. My quick fix is to limit the aggregator to 1 thread. I initially designed the aggregator to use many threads because it turns out that most of the time spent in harvesting is used up in waiting for web servers to respond. I guess I need to figure out another approach. I’m going to look to see if we can throttle how much CPU the Java threads (and corresponding Postgres processes) use.
Shrinking tag clouds are a fun idea but a pig to implement. Similar to how OCW Finder allows you to filter your browsing by selecting tags, we initially implemented the same idea in oz via a shrinking tag cloud. The idea was that as soon as you clicked on a tag in a cloud, in addition to the items being filtered, the tags in the resulting tag cloud would be filtered as well (shrinking the cloud). It turns out that this kind of query is horrendously expensive, at least the way we implemented it. It also turns out that users don’t seem to understand what is going on. As a result of this, we ripped the shrinking tag clouds out… almost. I just found an instance where it was left in. They are gone now.
There are more ugly queries to look at but I’ve gotten most queries down to under 4 seconds, which is still a long time, but at least the website doesn’t die. I’ll write more as I find it out.
As a note, the way I was able to easily identify the nasty queries was by going into the postgres config file (var/lib/pgsql/data/postgresql.conf) and setting the option log_min_duration_statement = 3000 (milliseconds). With this option set, every query that takes longer than three seconds is written to the postgres log file.
Posted on August 23rd, 2007 by joel
Filed under: web development
Leave a Reply