Cookie-less tracking

With GDPR, consent for analytics tracking on digital assets must be explicitly given. This default out-opt entails a drop of traffic tending to blur reports and web analytics based decisions. Traffic information recovery was needed.

Objective:

  • Develop a GDPR compliant consent-agnostic tool to recover traffic

  • Seamlessly integrate new data within existing web analytics data pipeline.

Result :

  • ~95% of traffic recovered and integrated in data flow.

Overview

a computer screen with a line graph on it
a computer screen with a line graph on it
Building engine

Use anonymized server logs and AWS (S3, Lambda, Step Function) to parse logs every 5 min and send it back to Google Analytics for seamless integration.

Testing, improvement and dashboard

Testing and improvement loops. Built Looker Dashboards to display data to stakeholders.

Defining Architecture

Worked with various teams (IT, Legal) and external provider (lawyer, data privacy) to figure out and validate a tool compliant with regulation (GDPR/e-privacy/EU-US data transfert).

Skills

Detailed Methodology

Getting the data

It was an exiting project that went beyond usual data analysis as a bit of data science, data engineering, projet management and legal analysis were also involved.

We were able to recover ~95% all traffic (IP address had to be cropped for anonymization entailing a loss of cardinality).

Further development could be done in order to retrieve more information out of the logs such as more precise interaction with the page.

Server access logs collects served ressources to the user of a website (along with timestamp, IP and user_agent). These informations are required by networking protocols to deliver the resources the user asked. Thus there are considered as "level 1" functional cookies and doesn't require consent to be given. These data are stored outside the user hardware and thus can be used as long as IP is anonymized.

Two logs formats was taken into account Common Log Format (cf. here) and WordPress HTTP request logs format (cf. here)

Various anonymization functions. Source: ARTICLE 29 DATA PROTECTION WORKING PARTY, Opinion 05/2014 on Anonymisation Techniques

Defined anonymisation function

The Data Working Party (former European Data Protection Board) issued recommendations (such as here) on data anonymization : re-identification risk should be as low as possible on these 3 factors :

  • Singling out: which corresponds to the possibility to isolate some or all records which identify an individual in the dataset;

  • Linkability: which is the ability to link, at least, two records concerning the same data subject or a group of data subjects (either in the same database or in two different databases).

  • Inference: which is the possibility to deduce, with significant probability, the value of an attribute from the values of a set of other attributes.

Source: ARTICLE 29 DATA PROTECTION WORKING PARTY, Opinion 05/2014 on Anonymisation Techniques

Multiple anonymization techniques could be considered with varying efficiency on the risk factor. Thus a composition of theses functions were considered and validated as our anonymisation function.

Defined Architecture

One way to seamlessly and more easily integrate these data into existing web analytics dataflow would be to send the data back to Google Analytics using Google Measurement Protocol (cf. here). It has the advantage of having Google Analytics crunching the recovered traffic data to a session/hit point of view the same way it would have done for a regular traffic.

Thus we need to build an engine triggered by the arrival of new logs that would transform Server Access Log into http requests accepted by Google Analytics. This transformation must be done under Measurement Protocol and sessions time-outs.

Built Engine

This tool was developed using AWS.

Server access logs would be stored in S3 buckets.

As transformation was done only using Python scripts, several Lambdas representing specific steps of the architecture were used and orchestrated with a Step Function.

The Step Function was triggered when a new logs is dropped in the S3 bucket. The trigger would pass a list of filenames to be retrieved by the Lambda.

As requests receiving order matters for Google Analytics, Step Function must execute fast enough to be done before next drop of logs. So efficiency must be considered while developing.

Integration, testing & Improvements

To integrate with existing traffic and deduplicate it (consented traffic also generate access logs), we used a pixel that was filtered out in Google Analytics. Having a pixel within the server logs traffic meant a consented duplicate traffic to remove.

The pixel integration was build with GTM and a custom javascript tag. It was inserted (thus requested from server) when analytics cookie where accepted.

Testing allowed us to emphasize the need for speed in the execution and the right frequency to set for logs dropping (execution time, engine start-up time vs log quantity).

As new data was already integrated in data pipeline, further integration was facilitated : the new data had already the same structure as the former one. We only needed to switch data sources in Looker Studio and set specific filters to split regular from retrieved traffic.

Conclusion

Example of server access log following common log format.

a close up of a computer screen with a graph on it
a close up of a computer screen with a graph on it