Wednesday 11 July 2012


Data colllection is ended, first analyses ongoing


The data collection using Piwik has been successful. We used a perl-script to download a separate Piwik logfile per week (for security). This produced on average 4 Mb data per week (9 weeks in May-June). Here is a sniplet of perl-code to download the piwik data:

$auth = "XXXSECRETXXX";
$period = "range";
$date = $date1.",".$date2;
$url = "https://oururl/piwik/index.php?module=API&method=Live.getLastVisitsDetails&format=XML&idSite=".$siteid.       "&period=".$period."&date=".$date."&expanded=1&filter_limit=1000&token_auth=".$auth;
my $ua = new LWP::UserAgent;
$ua->timeout(120);
my $request = new HTTP::Request('GET', $url);
my $response = $ua->request($request);
my $content = $response->content();
print MYFILE $content;

Before we go into the more fancy analysis methods, we first study our data in a more traditional way.  Again using Perl, we collect the Piwik data, link it to the outcome of the progress test that was administered in May and produced two comma-separated reports: one with data per student and one with data per session.

Per student we collected the number of sessions, the result of the progress test and the university, year and program for all students that participated in the progress test. This allows us to count the number of students that used ProF.

Per session we collected the number of pages, the duration of the visit and the number of pages with specific settings (such as “cumulative score”, “longitudinal view” etc.). We also stored per session the test result and the student’s university, year and program.  This allows us to characterize the sessions and study distributions over the sessions.

These two comma-separated files can be analyzed with many standard tools. We started using RapidMiner for analyzing the student-report data. Although it is quite possible to do all the filtering and aggregation we need, it is rather clumsy to change the filtering and collect the results in a report every time you want a different selection.  Therefore we decided to switch to Project R. This is a script-based environment and changing a script to get a different selection is much easier than the interactive flowchart of RapidMiner. It is true that Rapidminer offers a nice interactive module to inspect data visually, but the quality and flexibility of graphs in R is much better, and honestly, producing different graphs is just as easy.

The analysis shows that the usage of the ProF system is a little higher than we expected. It is clearly visible that when students are stimulated to use ProF, for instance, because they have to include an analysis in their portfolio, the usage of ProF is much higher.

We also see differences in the use between the students with insufficient grade and students with sufficient or good grades. Not only in the number of times ProF is used (good students use ProF the most), but also in the number of details they study.

The perl-scripts that collects our data also produces XES-files for process mining in Prom. Now we have analyzed the data, we can decide how to filter and preprocess the data in a meaningful way for Prom.