Skip to main content
SearchLoginLogin or Signup

Protecting user privacy on the COPIM project

Looking back at how the COPIM project protected the privacy of our website users and rethought the typical technical model for gathering web analytics

Published onAug 17, 2023
Protecting user privacy on the COPIM project
·

Header photo by Matthew Henry on Unsplash licensed under the Unsplash License.

The COPIM project ran from 2019 to 2023 and worked to create open publishing infrastructures driven by the core ethical values of being community-led, working openly with open access and open source licensing, and being anti-competitive. As an Open Source Software Developer on the project, I sought to keep technological ethics in mind when creating software and maintaining core project infrastructure such as the COPIM website.

In this blog post, I look back at how we protected the users of our website by rethinking the default mechanisms used for modern websites and by taking a considered approach to users’ personal data. I outline what ethical considerations we addressed and, for those interested, go into some technical detail on the mechanisms we used to: collect website analytics without giving the data to a third-party service; reduce the amount of data gathered about our users to the absolute minimum required; and to set up an onion service mirror of the website which users can access through the Tor network for added data privacy and circumvention of censorship mechanisms.

1.0: web ethics

COPIM was a project based around open access licensing: the project created scholar-led, community-owned systems and infrastructures to enable open access book publishing by smaller presses in opposition to the monopolisation of open access publishing by commercial publishers and for-profit intermediaries. Alongside this strong focus on open access, the project also recognised the holistic and overlapping nature of open values. For example, we believed that supporting open infrastructures means supporting open source software and open standards for metadata.

With this holistic ethical thinking in mind, supporting community-led infrastructures also meant thinking about protecting the users of our community. Users of websites and other software infrastructures are surrounded by risks to their online privacy and threats to their personal data. State and corporate surveillance of people’s internet usage is a massive issue for privacy rights. In 2013, Edward Snowden exposed the USA’s PRISM surveillance programme and the US National Security Agency’s collaboration with tech companies including Microsoft, Facebook, Apple, and Google (Greenwald & MacAskill, 2013). In 2017, WikiLeaks published the Vault 7 documents revealing that the US Central Intelligence Agency was able to extract user data from various ‘smart’ technology products and from the iOS and Android smartphone operating systems (Burgess, 2018).

The limited scope of the COPIM project precluded us from tackling the massive surveillance issues raised above but what we were able to do was ensure that our user-facing infrastructure was as free from digital surveillance as we were able to make it. We wanted our web presence to demonstrate an alternative mode of engaging with users on the internet and to show, albeit in a small-scale way, that a different internet is possible: to, in the words of Cory Doctorow’s (Doctorow, 2023) upcoming book on interoperability and dismantling corporate hegemony online, “seize the means of computation”.

In line with COPIM’s ‘scaling small’ philosophy (Adema & Moore, 2021), we deliberately avoided using technical infrastructures provided by monopolistic corporate entities like Google or Microsoft and instead relied on nurturing connections and collaborations with providers of smaller open source products and services. Instead of keeping documents on a cloud-hosted Microsoft SharePoint site, we hosted our own Virtual Private Server and kept documents in a Nextcloud instance; instead of holding meetings over Zoom, we used BigBlueButton, Jitsi, and eduMEET; and instead of hosting our website with a cloud-hosted all-in-one provider like Squarespace, we built and hosted our own websites on our own server.

Hosting our own services allowed us more control over how they worked and this allowed us in turn to experiment with our approach to hosting our website and gathering analytics.

2.0: website analytics

Unless explicitly disabled, all websites gather some form of user analytics either through the web server logs or through a third-party service like Google Analytics or Adobe Analytics. These typically log various details of each user visiting the website including their IP address for geographic location details, which operating system and web browser they used to access the site, how long they spent on the site, which pages they visited, and which site referred them to this site. These data can be used to intuit usage trends, form user profiles for marketing purposes, and to establish potential vectors for improving the site’s user experience.

In our case, we needed to retain some user analytics for reporting to the COPIM project’s funders and for improving the site’s UX but wanted to do so in a way that was minimally invasive for our users. For starters, this meant not handing over our user’s data to a massive third-party corporation like Google or Adobe (especially one already implicated in working with state intelligence agencies). Instead we used GoAccess, an open source web log analyser that offers a browser-based dashboard for exploring analytics.

2.1: GoAccess

Our server runs a variety of applications in Docker containers managed through Docker Compose. One container runs an NGINX web server which we used to present services like our website, our Nextcloud instance, and our Gitea instance. We reconfigured the NGINX container to save its logs to a directory on the host server so that other containers were able to access them. Following a model set out by Dimitrii Prisacari on GitHub, we then set up a new Docker container for GoAccess using a Dockerfile like below and pointed the application at the NGINX logs that we wanted to analyse.

Dockerfile for building a GoAccess application container

It’s worth noting that GoAccess can use a free GeoLite2 geolocation database to match IP addresses with geolocation. This geolocation database is free but requires registration to download so we manually download the GeoLite2-City.mmdb file and in line 11 the Dockerfile copies it to the appropriate directory in the GoAccess container.

Persistent data is turned on in goaccess.conf to ensure that log data is kept even when the GoAccess Docker container is turned off. We then use an NGINX configuration file to serve the application through a custom subdomain so that the dashboard is available over the web.

2.2: redacting user location

The most identifying personal data gathered by NGINX server logs is the user’s IP address which can pinpoint a user’s location down to the city level. While it’s useful to have some idea of what continent our users are accessing the site from to assess our international reach, there’s no reason for us to retain the whole IP address and its granular location data.

A public IPv4 address uses 32 bits to store an address using four octets i.e. four chunks of 8 bit data. A typical IPv4 address looks like this: 198.51.100.44. Zeroing the last octet of the address e.g. 198.51.100.0 can make the address refer to a different city/district/state/province. Zeroing the last two octets e.g. 198.51.0.0 can make the address refer to a different country.

Following some instructions by Matt Bagley (Bagley, 2018) on anonymising NGINX logs, we set up nginx.conf to zero the final octet of a user’s IP address in access_log. This section of nginx.conf looks like this:

    map $remote_addr $remote_addr_anon {
        ~(?P<ip>\d+\.\d+\.\d+)\.    $ip.0;
        ~(?P<ip>[^:]+:[^:]+):       $ip::;
        # IP addresses to not anonymize (such as your server)
        127.0.0.1                   $remote_addr;
        ::1                         $remote_addr;
        default                     0.0.0.0;
    }

    log_format  main  '$remote_addr_anon - $remote_user [$time_local] "$request" '
        '$status $body_bytes_sent "$http_referer" '
        '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

This could also be extended to zero the last two octets for further anonymisation by replacing the line ~(?P<ip>\d+\.\d+\.\d+)\. $ip.0; with ~(?P<ip>\d+\.\d+)\.\d+\. $ip.0.0;.

Further to this, we then fully anonymise all IP addresses in logs older than 2 months using Matt Bagley’s Bash script for anonymising logs. This script uses ipv6loganon, a Linux program for anonymising IPv4/IPv6 addresses in HTTP server log files. We set up a crontab entry to run the script weekly and only to run on log files older than 2 months.

Overall, these anonymisation processes allow us to strike a good balance between retaining some information for determining the geographical reach of our website and protecting the privacy of our website users.

3.0: onion service

This work around analytics helped to protect the privacy of our users but ultimately we still had control over harvesting and retaining user’s data. In order to put more control into the hands of users, we set up an onion service mirror of our website to encourage users to access the site completely anonymously over the Tor network.

Tor (The Onion Router) is built on the principle of onion routing which uses nested layers of encryption to route internet communications. Tor software directs a user’s internet traffic through a number of encrypted relays on the Tor network so that the endpoint that the user is accessing cannot recognise where the user’s traffic originated. For most users, the easiest way to access the Tor network is through the Tor Browser, a free and open source web browser built on Mozilla Firefox that automatically starts Tor and routes traffic through a circuit on the Tor network. The image below shows a user accessing the COPIM website through Tor Browser and being routed through a circuit that goes via a bridge relay and then through two relay nodes in the Netherlands and Austria thus obfuscating their original IP address. Anytime they restart Tor Browser, they will be given a different circuit.

An onion service is a Tor hidden service: a website that can only be accessed when using the Tor network. In contrast to the usual URL structure for the web, these sites have addresses which consist of randomly generated strings of letters and numbers followed by the top-level domain, .onion e.g. http://kfp2vjmzkxmogotmtck3x5tefvn7yi77tsgfb6b6bsjgb3kxycypemid.onion/.

Onion services have been used for sites offering services illegal under international copyright regimes such as Sci-Hub and Z-Library, shadow library sites for accessing closed access scholarly articles and academic monographs. But onion services can also be used to create mirrors of standard websites for those who need added privacy when accessing websites. Twitter, for example, set up an onion service mirror in 2022 though this has not been maintained (Robertson, 2022). In his piece on onion services in a scholarly communications context, Kevin Sanders (Sanders, 2018) outlines the benefits for those users accessing sites from oppressive geopolitical contexts:

“Onion services offer not only enhanced privacy for users, but also help to circumvent censorship. Some governments and regimes routinely deny access to clear-net websites deemed obscene or a threat to national security. Providing an onion service of the repository not only protects those that may suffer enhanced digital surveillance for challenging social constructs or social relations (which can have a severely chilling effect on intellectual freedom), but also on entire geographical areas that are locked out of accessing publicly accessible content on the clear-net.”

We wanted to provide this enhanced privacy and circumvention of censorship for users coming to the COPIM website, especially for those users in countries where they would otherwise be unable to access the clearnet version of the site. We wanted a parallel onion service version of the site that would allow this kind of protection for anyone who wanted it while not impacting on the experience of standard web users.

3.1: technical details

Alec Muffett’s Enterprise Onion Toolkit (EOTK) is a great tool for setting up an onion mirror of an existing website but didn’t work for our Docker Compose environment. Instead we based our configuration on Onionize, a Docker container configuration for creating Tor onion services. This configuration (which you can see starting at line 172 in the docker-compose.yml file for my own server) involves creating a ‘Faraday’ network that cannot access the internet but that can be accessed by specified Docker containers, an NGINX container which serves the website through the Faraday network, and an Onionize container which automatically exposes selected Docker containers as onion services (in this case exposing only the onion-NGINX container).

We also add an Onion-Location meta element to the site’s HTML in order to automatically suggest the onion service mirror to users accessing the site through Tor Browser.

Tor Browser automatically suggesting the onion service version of the site to a user accessing the clearnet version of a site

In order to implement the Onion-Location header, we have a script (see below) that runs regularly on the server to retrieve the automatically-generated onion address from the onion-NGINX container, insert the new onion address into the HTML, and create a copy of the site replacing all absolute links which specify the full www.copim.ac.uk domain with the address onion address.

Copies the COPIM Hugo site, finds the correct onion address, replaces the Onion-Location header in the original, and replaces the base_url with the onion service address in the onion mirror.

For good measure, we also include an onion symbol in the footer of the site which links to the current onion address.

footer of the COPIM website

4.0: conclusion

Through these few technical configurations, those of us working on the COPIM project attempted to practice good web ethics and protect the privacy of our users. It’s relatively easy for an experienced server administrator to set up their own web analytics platform that isn’t Google or Adobe, to redact IP addresses in server log files, and to set up an onion server mirror of a website. I’ve adopted these same models for protecting user privacy on my own website (the full configuration for which is available on GitHub) and on the website for the Centre for Postdigital Cultures at Coventry University. Hopefully by openly sharing these technical details, we demonstrate that there are alternative ways of engaging with our users on the internet that show more respect for them than passing their data to corporate third-party services and ultimately that a different internet is possible.

Comments
0
comment
No comments here
Why not start the discussion?