Friday, October 10, 2008

First look at WSL data

Well, I finally got a script put together to start picking into the web server log (WSL) data for my dissertation. This is a log which extends from 30 August 2005 to 01 October 2008. This wasn't a perfect run, but not bad for basically a single regular expression match and a few if-then's for some filtering.

In all, there are:

  • 637 MB of log data (rounding with the `ls -lh` command; across 8 files)

  • 6,334,626 logged transactions

  • 718.35883498192 seconds to run (thanks to modified code from http://www.developerfusion.com/code/2058/determine-execution-time-in-php/)

  • 2,376,693 Legitimate entries (useful requests, with some caution…)

  • only 858 entries which did not match the reg-ex (some appear to be malicious, and some only require a tweak on the expression)

  • 2,265,199 image requests before we stopped logging them in April of this year (this group also ended up with some legitimate entries)

  • 361,410 http error status codes (>= 300)—many come from USU's IT department to keep us honest

  • 422,823 Search engine bots



So, for a first-pass, this are my results. Don't be surprised when they change the next time I report on the pre-processing.
Next, I'll be tweaking the regular expressions so I don't mis-classify requests, identifying the IP addresses of bots/crawlers/spiders, potential attacks, and USU's IT scans to cull their transactions. Along with those, I'll also cull hits to our test sites and other virtual hosts. While these data may be useful for some purposes (and I would love to explore them all), I must focus on characterizing our users and weeding out any that are not relevant to the educational purpose.

After that, I'll start looking for user sessions and sticking it all into a database for queries.
Well, how exciting!
Until next time…

Wednesday, October 8, 2008

A library perspective

This article is important because it shows some thinking along the lines of librarian use of large data sets; and since I am working with the digital library community, it is good thoughts to be aware of.

A note on librarian thinking

I know many librarians, and I like to work with them. However, when it comes to thinking, they are different—they love to organize information so that people can find it later. The funny thing about a library is that for everyone to be able to find anything, that means that the organization has generally had to choose an organization which makes little sense to many non-librarians. So, when we interact with the library sciences we must understand the "need" to organize and retrieve the world's knowledge for safe keeping. Other than that, librarians are just like you and me… :)

Cummins, C. (2006). Below the surface: New tools—and savvy librarians—are turning the ILS into a gold mine for making more informed decisions. Library Journal, 131(1), 12-14.

The focus is on how the new library system tools are making data-driven decisions (D3) easier for librarians and management. However, along with these tools there needs to be a greater understanding of data sources and the interpretation of results.

As a researcher, I can see where this article could have benefited from a framework of D3 and how questions turn into answers. I see the Knowledge Discovery from Data/Databases (KDD) framework throughout this article, but I think the nOOb would have a hard time grokking the steps from data to answers.

There is one very nice quote about the utility and practical solutions provided by library DM. Speaking of how nice it is to analyze search strings (both successful and unsuccessful), "It's really like an ongoing, automatic version of usability testing" (p. 14).

Much of this article isn't what some would consider data mining. However, the KDD ideas of questions, data collection and selection, analysis, interpretation, and communication are all present.

Looking at the literature

As part of my DM Dissertation, I am working on a literature review from the ERIC database (via EBSCO). Thus far the keywords "data mining" bring up about 120 results. Er… just a second.

I just revisited the EBSCO host and am now getting a different number of results. I am now getting 117 total on "data mining" and only 27 if I check the "Peer Reviewed" option. Now, having just found this I imagine that the missing results are some duplicate records that I have found in the first set. Some have just been plain unusable results…

Well, I can see that I'll need to double-check the results and move forward. At any rate, the purpose of the "literature review" label is for summary, comments, and reference for the papers of impact on my review. I will also be adding in other labels so I can tell which papers speak about which aspects of Educational Data Mining.

Oh, I subscribed to the EBSCO RSS feed in Google Reader, I'll see if I can't get that exposed or shared, or linked to from here.

Tuesday, October 7, 2008

DM and the "churn" of business

Well, as you can see, my blogging frequency is really low... But that is about to change! This has been a very educational 5 months since the last post (more about that somewhere else).

Today's topic has to do with a concept from business which is relevant to educational data mining (EDM): Churn and Customer Relations Management.
The article pulled from today is:

Lejeune, M. A. P. M. (2001). Measuring the impact of data mining on churn management. Internet Research, 11(5), 375–387.

Churn

Lejeune does a nice job laying out the basics of the churn issue and why companies care about managing attrition. Churn is defined as "Churn or customer attrition is defined as `the annual turnover of the market base' (Strouse, 1999)" (p.377). The electronic commerce is assumed to be the driving reason churn has become so great in the last few years. Lejeune mentions that having competition only "one click away" requires companies to have a multi-faceted marketing and management strategy to acquire and retain customers.

In the education world, there are several parallel conditions of churn: student attrition in higher education, class churn at the start of a semester, and use of educational tools, to name a few. Alternative schools, classes, and tools are everywhere (even in K12) to students and teachers. However, the view of churn in education may not be entirely negative, either. For instance, an educator interested in the growth of their student would certainly be happy when their student is ready for a greater challenge than can be currently provided.

In the world of the Instructional Architect, churn happens more than we would like, but again, that isn't a bad thing. Preliminary data has shown that about 10% of our registrants actually return within six months of exposure. We hope that this represents "early adopters" and that more will come soon. However, as the IA is built for research and supporting teachers, if they find something different that works for them, then we are happy with that. The problem has been knowing where they go and why—a nice research topic for someone if they are interested…

Customer Relations Management (CRM)

Part of the reason for data mining the IA is to figure out what user segments exist so we can better address their needs. Business has shown that the cost of acquiring a customer is about equal to the cost of a winback (unless I missed something in Lejeune's paper). So, maintaining a good relationship with current customers becomes paramount. This relationship is generally best handled on a 1:1 basis—the holy grail of marketing.

A perfect marketing and retention strategy has become easier through data mining all the many bits of information that users provide—many times without realizing they are helping (a nod to privacy concerns). With the ability to gather customer (a.k.a. user) information and almost immediately apply that to their current user experience, CRM has come a long way.

Another reference for CRM and Data Mining is:
Cooley, R. (2003). Mining customer relationship management (CRM) data. In N. Ye (Ed.), The handbook of data mining (pp. 597–616). Mahwah, NJ: Lawrence Erlbaum Associates.


DM vs. Statistics

Lejeune makes an argument that DM &ne Statistical analysis, and supports the idea that data mining has something more to offer than the standard statistics and reporting. I would have to agree wholeheartedly. His two reasons stem from the need for timely analysis and not historical facts; and that statistics traditionally find the more obvious variations, which may not be the most meaningful.

In education the t-test, ANOVA, and regression have held sway for many years. However, each of those tests begins to be insufficiently powerful when dealing with the amount and kind of data available in DM.

While most would differentiate two bodies of DM by the machine learning concept of supervision (even myself), Lejeune offers the following DM objectives:

Descriptive
increasing understanding of data and their content (usually requiring unsupervised methods).

Predictive or Prescriptive
for forecasting and devising, at orienting the decision process (usually requires supervised methods).



Privacy

This is a great concern in today's educational environment, and no less for the IA. As Lejeune recommends, we have already included the use of personal information in our privacy policy.

Data sources

The three typical sources for e-business DM are: (a) clickstreams, (b) cookies, and (c) customer registrations. The same is true for EDM. We are also seeking to tie these with surveys and interviews in the near future—perhaps we can find out where our less frequent visitors are going, if anywhere…

Wrap-up

Lejeune goes on to say,
"Descriptive data mining methods appear useful to understand differences and particularities of the various categories of clients. It allows customer segmentation, formation of homogeneous clusters or categories of customers characterized by a small variance within groups, and a high variance between groups. Based on these clusters, data mining methods have th ability to detect common features of the individuals belonging to the same cluster" (p. 382).

Currently in the IA we know we have users, but we don't know much of how they are using the tool—especially after the workshops. So, this characterization is very important to us at this time. Later, there will be many more questions to be answered, but we need a stake in the ground of our user groups, today.

He then describes some customer taxonomies which have been developed from data mining exercises. Then he describes some sensitivity measures to see how responsive DM can be in churn management.
What I want to pull from this is that DM is helpful in education, just as it is in business. We may have different domains and considerations when it comes to use, but DM is still very applicable.

Saturday, May 17, 2008

Starting to dig

Educational Data Mining (EDM) has become my life... I am working on a PhD in Instructional Technology at Utah State University and my dissertation proposal is nearing completion. The current title:

"Web Usage Mining: Application to an Online Educational Digital Library Service"


I'll post a link to the defended proposal. But for now, the gist is:
  1. There is a ton of educational data out there that is just waiting to be mined.
  2. I have been working with the Instructional Architect for the past 5 years and have some experience with web metrics, but now it is time for some serious mining to characterize our users.
  3. We have some suspicions as to what user segments exist, but we don't know. Therefore I'll be using an unsupervised machine learning technique on our data (e.g., LCA, SOM).
  4. The data will come from our user database, web server logs, and Google Analytics (GA).
  5. The results of this research should be helpful to (a) the IA tool development, (b) the IA team's teacher professional development workshop, and (c) the digital library community.
In the blog, there will be a page of links and resources as I find interesting places, applications, and ideas about EDL.

It has been interesting how many people are interested in EDM at USU and elsewhere. Both Jamison Fargo and Yong Seog Kim are working with me on the methods and such. As it turns out there are 3-4 other PhD students at USU that have the same kind of interest. We'll have had some great conversations and continue to help each other out. Working with the folks at the National Science Digital Library, we have had some good publications and experience talking about web metrics and user understanding.

Well, there it is for this post. Now on to the good resources...