Wikipedia Page Hopping

Tue, May 17, 2011

I have realized that like most users of Wikipedia, I do a lot of Wikipedia page-hopping [1]. Wikipedia is sort of addictive that way. You start reading about a piece of Flamenco music and after twenty minutes find yourself staring at the page about ETA, a Basque nationalist organization. So I decided to figure out how I exactly do I get lost in the huge list of interconnected articles. I use Chromium, and it stores its history in a SQLite3 database file. I wrote a small Ruby script that parses the history, splits them to chunks of articles accessed per day, and filter only Wikipedia links from this.

This is basically what I had to do:

  • Query the db for the last visit time and URLs.

  • Chromium (and Google Chrome) stores timestamps of page visits in a not so obvious format. They basically store time stamps as the number of micro seconds expired since Jan 01, 1601

  • Splitting the URLs into chunks accessed per day involved calculating the number of micro-seconds in a day and splitting the URLs based on this. Ruby’s Array#group_by is really handy here.

  • Analysis of the URLs involves filtering only the URLs that contain “wikipedia”

  • There is a caveat here, as redirects to Wikipedia from both Google and Facebook contain the string “wikipedia” in their URLs. These need to be filtered out.

The analysis of my Wikipedia history showed me some interesting things. For example, when I was reading Michael J. Arlen’s Passage to Ararat, I spent a lot of time on Wikipedia, hopping between pages about Armenian history and culture. This is what the list of Wikipedia pages on that day look like:

When I was reading about Data warehousing, this is how the hopping happened:,_transform,_load

I am still trying to make more sense of the links that I clicked away and the articles I read when I was page hopping.

The Ruby script that parses Chromium history and figures out the Wikipedia links is below:

#!/usr/bin/env ruby
# Ruby script to parse Chromium (or Google Chrome) history to identify Wikipedia pages read per day.
# usage: ./wikipedia_history.rb <location of Chromium history db>
# The Chromium history db can be usually found under ~/.config/chromium/Default

require 'rubygems'
require 'sqlite3'

US_IN_A_DAY = 24 * 60 * 60 * 1000000
SITE = "wikipedia"

module ChromiumHP
  class DbConnection
    def initialize db_name
      @db_name = db_name

    def urls_history
      db = @db_name
      urls = db.execute("SELECT last_visit_time, url from urls ORDER BY last_visit_time;").map do |t, u|
        {:last_visit_time => t, :url => u}

  class Parser
    def initialize db_name
      @db_name = db_name

    def chunks days
      @history ||= get_history
      parts = @history.group_by do |h|
         h[:last_visit_time] / (days * US_IN_A_DAY)
      end { |k, group| group }

    def get_history

  class Analyzer
    def self.graph chunks do |c|
        c.find_all do |entry|
          url = entry[:url]
          url.include?(SITE) &&
            !url.include?("facebook") &&
      end.find_all do |c|
      end.sort_by do |c|
        c.length do |c| do |entry|

history_loc = ARGV.first
abort "Error: Pass the chromium history location as parameter" if history_loc.nil?

daily_chunks = + "/History").chunks(1)
ChromiumHP::Analyzer.graph(daily_chunks).each do |entries|
  puts entries
  puts ""