In all, there are:
- 637 MB of log data (rounding with the `ls -lh` command; across 8 files)
- 6,334,626 logged transactions
- 718.35883498192 seconds to run (thanks to modified code from http://www.developerfusion.com/code/2058/determine-execution-time-in-php/)
- 2,376,693 Legitimate entries (useful requests, with some caution…)
- only 858 entries which did not match the reg-ex (some appear to be malicious, and some only require a tweak on the expression)
- 2,265,199 image requests before we stopped logging them in April of this year (this group also ended up with some legitimate entries)
- 361,410 http error status codes (>= 300)—many come from USU's IT department to keep us honest
- 422,823 Search engine bots
So, for a first-pass, this are my results. Don't be surprised when they change the next time I report on the pre-processing.
Next, I'll be tweaking the regular expressions so I don't mis-classify requests, identifying the IP addresses of bots/crawlers/spiders, potential attacks, and USU's IT scans to cull their transactions. Along with those, I'll also cull hits to our test sites and other virtual hosts. While these data may be useful for some purposes (and I would love to explore them all), I must focus on characterizing our users and weeding out any that are not relevant to the educational purpose.
After that, I'll start looking for user sessions and sticking it all into a database for queries.
Well, how exciting!
Until next time…

No comments:
Post a Comment