The digital fingerprint has been used as a mechanism to identify users, send personalized advertising, optimize the performance of web pages and even for statistical purposes. Additionally, it lends itself to be used as an element in the process of identifying digital users. In this paper we explore some of the mechanisms used to take advantage of the digital fingerprint in identification processes, as well as the risks they may present.
Due to the close relationship between people and their digital devices, device identification can be very useful for tasks where it is necessary to identify the user, such as sending personalized advertising, or as an authentication factor in sensitive services such as digital banking or medical services. The identification of a device, or identification by digital fingerprint, has been performed for more than a decade with varying degrees of success. In particular, two broad categories can be established for this identification: identification by device fingerprinting and identification by web browsing.
Device fingerprinting focuses on the device used to access the internet, such as a tablet, a personal computer or a cell phone. In particular, the latter is the most interesting due to the high penetration of mobile telephony (it is estimated that in Latin America alone there were 343 million mobile internet users in 2019 [GSM2020]). In other words, being able to identify a cell phone user makes it possible to identify more than half of the region’s population.
On the other hand, identification by web browsing is the most widely used by different web services in order to know when a user returns to their web page without having to rely on cookies [RID2020], which can be blocked by the browser or deleted by the user. This type of identification takes advantage of the information exchange that takes place when accessing a web page, which is used for example to optimize the presentation of webpage content on different devices.
The use of these methods is not exclusive, and is very useful in processes that seek to guarantee that the person accessing a service is who they say they are, such as banking services. That is, by establishing that the user not only presents the ordinary authentication credentials, such as a username and password, a photo for facial biometrics or a pin, but also does so from a known device, the level of trust in the transaction is raised, which promotes a smoother user experience and a higher level of security for the digital ecosystem. However, its indiscriminate use can affect the privacy of users, especially in sites where this information is collected and shared with third parties without disclosing this process to users. Below we will discuss the two types of identification and then we will discuss some of the risks associated with these processes.
Digital device fingerprint
A fundamental principle in the identification process and therefore of the identification by digital fingerprint process, is that only through the combination of attributes it is possible to accurately identify a person or device. Among the elements that we can consider in a device are its hardware characteristics, which can allow us, for example, to establish that a phone is an iPhone by the type of processor it uses, and not a Samsung or another brand. It is also possible to identify the device by the software it has installed. The immediate example is the one mentioned above, where we can easily identify whether the operating system is Android or iOS. Additionally, mobile devices have sensors that measure different user variables, such as the way they walk or type. While these measurements can help the identification process, they can be categorized as behavioral biometrics, in which will not be covered in this document.
An important quality that an identification process should have to guarantee its success is that should seem effortless for the user. For example, in [KHO2018], we seek to characterize a device by means of its implementation of the TCP protocol, which is used to connect to web pages, which are based on http. In this case, all that is required is that the user connects to the digital service, i.e., there is no extra effor for the user. However, by means of this solution, only 75% accuracy was achieved, which suggests that other complementary methods are required. Some of these methods can be based on the variations of the different sensors that a device has, such as microphones, accelerometers, gyroscopes, GPS, among others. This however requires a characterization of different devices which can be cumbersome, and as in the case of TCP, with varying levels of accuracy.
Web digital fingerprint
Web fingerprinting is the most common, and unfortunately also the one most commonly used to abuse and violate the privacy of web users [RID2020]. In [ACA2013], web fingerprinting is classified into 4 categories: i) JavaScript-based, ii) plugin-based, iii) extension-based and iv) header-based.
In the case of JavaScript-based identification, it is possible due to the information that is exchanged with the aim of optimizing the display of web pages. In this case, attributes such as screen resolution, language, time zone, installed font group, browser or operating system are used to identify a user. Due to the success of this type of identification, many websites follow this procedure, storing the information and even selling it without the user’s knowledge. Unlike cookies, which can be deleted by the user, in this case every time a user uses a site he can be identified as the visitor of another site that stores the same information. In plugin-based identification, plugin APIs are used, as they are allowed to provide additional information about the user. In [ACA2013] it is stated that through these APIs it is possible to obtain information such as the kernel version of the device, or if the user has more than one screen installed. In the third category, extensions that users install in browsers are used to obtain user information. Paradoxically, the installation of some extensions that are used to block advertising and avoid the identification of a person can be used as another identification attribute. Finally, the evaluation of headers in communications, or IP addresses can be used as an element of identification. However, this may be limited using proxies or NATs. Within this category, one could include the use of DNS for identification. In [KLE2019], a method is proposed that leverages the use of caching in the address resolution process to allow identifying a client that enters a web page of the interested party to make the identification. This is achieved by having many IPs associated to a domain, where each IP responds differently to the code installed in the client’s JavaScript. Because the browser saves the address resolutions to avoid connecting to the DNS server frequently, it is possible to identify the user as he/she navigates and connects to the domain of the entity that seeks to identify the user. This is possible as long as the DNS cache is not deleted.
As can be seen, these techniques work better as more of the device’s attributes are stored, and are useful for authentication processes, where real users can access the system in a seamless manner, while those who are not trusted enough can go through a more stringent process. On the other hand, some attributes are more significant than others because they are shared with a smaller number of people. Additionally, the use of tools that seek to anonymize the user sometimes give rise to contradictory data that end up helping the identification process. This means that these tools are frequently used, often without the user consent about how he or she is being identified.
Risks in digital trace analysis
As mentioned in [RID2020], when browsing the Internet, identification processes are not completely transparent, an emblematic case being the use of reCAPTCHA, which accumulates user information without any warning. But it is not the only case, as many web pages use JavaScript-based identification, which is exchanged and sold by actors invisible to users. The fact that information from different domains can be integrated, makes it possible to tie the identity of a user who enters a bank, a medical page, and a newspaper, allowing inferences that can be harmful to individuals. Furthermore, the fact that the user is not aware of the information being collected makes it impossible to exercise his or her right to habeas data.
Therefore, for these types of tools to be used successfully, it is necessary for the user to know what type of information is being collected, and what use is being made of it. Otherwise, the perception about this type of tools, which as we have mentioned can be very useful to reduce fraud in the digital world, can make them impractical and rejected by the general population.
Conclusions
The identification by means of the digital trace has very interesting advantages for digital transactions because they can be used as another identification factor to ensure digital services, while limiting the burden of authentication on users on which there is a high degree of confidence. The fact that people use the same devices to connect to digital services, and the repetition of patterns can help improve the user experience in authentication processes. However, the use of these tools must be tied to good communication with users, so that there is clarity about what is done with their information, with whom it is shared, while allowing the access, modification or deletion of personal data.
Diego Pacheco-Páramo
Translated by: Anasol Monguí
Bibliography
[GSM2020] La economía mócil en América Latina 2020. GSM Association.
[RID2020] Privacidad y navegación por internet. D. Pacheco-Paramo, 2020. ReconoSER ID. https://reconoserid.com/privacidad-y-navegacion-por-internet/
[KHO2018] Device Fingerprinting for Authentication. Z. Khodzaev et al. ELECO 2018.
[ACA2013] FPDetective: Dusting the Web for Fingerprinters. G. Acar et al. 2013. CCS’13, November 4–8, 2013
[KLE2019] DNS Cache-Based User Tracking. A. Klein and B. Pinkas. Network and Distributed Systems Security (NDSS) Symposium 2019 24-27 February 2019, San Diego, CA, USA