So here's the deal. GateHouse Media is becoming a large newspaper/technology company. We have a few hundred web sites, and accurate stats on those sites in invaluable.
When I heard that accredited stat tracking companies like HitBox and Omniture cost hundreds of thousands of dollars a year, I thought — Holy geez, can I help out w/this expense?
I did some exploration on the subject, and since they didn't [immediately] go along with it, here are my preliminary findings.
For the sake of simplicity, let's say that I only need to gather IP addresses and Request URLs
Oh and the ability to analyze 500 million pageviews a month.
The Plan
We want maximum accuracy using a minimum of resources here.
This plan uses Apache request logging to gather & store information in the least complex, obtrusive, memory-intensive or CPU-intensive way I can think of.
Capture data using JavaScript. Compile a fake request URL to our statistics server (beacon-style), and load that into an <img> tag.
<img src="http://statdomain.com/images/blank.gif?page=url_of_this_page&other=more_info_we_can_track"/>
Your browser will execute the request, and our server will log it. Here's what you'll get:
10.1.1.105 - - [03/Jun/2008:10:32:59 -0400] "GET /lorem/ipsum/dolor/what/the/crap/flavia/ex/arbore/cadit/rit/try/this/out/ HTTP/1.1" 404 403
Configuration
I set up a prototype to drum test requests against, and I learned a few things about tweaking Apache along the way.
Because requests are going to be coming at the rate of 1 per user (instead of the normal 5-20 or 30 that you can expect from a normal web page), and because they'll be coming in at an astronomical rate, I want to make sure that connection turnover is as high as I can get it.
_disable KeepAlive so we aren't maintaining connections uselessly_
KeepAlive Off
_raise the number of connections (maxClients) to prepare for the flurry of connections we want to accept_
<IfModule prefork.c>
...
MaxClients 1600
...
</IfModule>
_establish a combined access_log (usually this is already there, and just needs to be uncommented)_
CustomLog logs/access_log combined
_comment out other logs (you don't need gigs of '404 page not found' errors)_
# CustomLog logs/error_log
Prototype Testing
At 500m pageviews a month, I projected that this server would be drummed with more than *30,000* hits a minute during peak times.
In order to replicate this action, I installed jMeter on 4 computers here and set up a test that would pound the test IP with requests for a dummy URL for 10 minutes with 20,000 users on each requesting the URL one time per user.
With a total of 40,000 users requesting one url each over a period of 10 minutes, you might think that you'd get an awesome data set, and the opportunity to present some rad analysis on your Apache config.
A Grinding Halt
Yea, the computers couldn't reliably create that many users. They froze. I could only get about 900 fake users on each. On the bright side, 2700 users requesting one page apiece over a period of 1min. 30 secs. worked w/out a hitch.
Now, I'm working on a better testing method. Any suggestions on how to drum up 40,000 views a minute in a measurable environment?