Using Apache logs to track 500m pageviews a month

So here's the deal. GateHouse Media is becoming a large newspaper/technology company. We have a few hundred web sites, and accurate stats on those sites in invaluable.

When I heard that accredited stat tracking companies like HitBox and Omniture cost hundreds of thousands of dollars a year, I thought — Holy geez, can I help out w/this expense?

I did some exploration on the subject, and since they didn't [immediately] go along with it, here are my preliminary findings.

For the sake of simplicity, let's say that I only need to gather IP addresses and Request URLs

Oh and the ability to analyze 500 million pageviews a month.

The Plan

We want maximum accuracy using a minimum of resources here.

This plan uses Apache request logging to gather & store information in the least complex, obtrusive, memory-intensive or CPU-intensive way I can think of.

Capture data using JavaScript. Compile a fake request URL to our statistics server (beacon-style), and load that into an <img> tag.

<img src="http://statdomain.com/images/blank.gif?page=url_of_this_page&other=more_info_we_can_track"/>

Your browser will execute the request, and our server will log it. Here's what you'll get:

10.1.1.105 - - [03/Jun/2008:10:32:59 -0400] "GET /lorem/ipsum/dolor/what/the/crap/flavia/ex/arbore/cadit/rit/try/this/out/ HTTP/1.1" 404 403

Configuration

I set up a prototype to drum test requests against, and I learned a few things about tweaking Apache along the way.

Because requests are going to be coming at the rate of 1 per user (instead of the normal 5-20 or 30 that you can expect from a normal web page), and because they'll be coming in at an astronomical rate, I want to make sure that connection turnover is as high as I can get it.

_disable KeepAlive so we aren't maintaining connections uselessly_

KeepAlive Off
_raise the number of connections (maxClients) to prepare for the flurry of connections we want to accept_

<IfModule prefork.c>
...
MaxClients         1600
...
</IfModule>
_establish a combined access_log (usually this is already there, and just needs to be uncommented)_
CustomLog logs/access_log combined
_comment out other logs (you don't need gigs of '404 page not found' errors)_
# CustomLog logs/error_log

Prototype Testing

At 500m pageviews a month, I projected that this server would be drummed with more than *30,000* hits a minute during peak times.

In order to replicate this action, I installed jMeter on 4 computers here and set up a test that would pound the test IP with requests for a dummy URL for 10 minutes with 20,000 users on each requesting the URL one time per user.

With a total of 40,000 users requesting one url each over a period of 10 minutes, you might think that you'd get an awesome data set, and the opportunity to present some rad analysis on your Apache config.

A Grinding Halt

Yea, the computers couldn't reliably create that many users. They froze. I could only get about 900 fake users on each. On the bright side, 2700 users requesting one page apiece over a period of 1min. 30 secs. worked w/out a hitch.

Now, I'm working on a better testing method. Any suggestions on how to drum up 40,000 views a minute in a measurable environment?