Volver

Open data sources: Its limitations and consumption strategies

KYC (Know Your Customer) processes allow companies to verify the data of their customers, establish their risk level, and generate enough trust in the digital ecosystem that allows these users to access all types of digital services. A huge shortcoming that the institutions who perform KYC processes suffer is that they tend to rely solely on the information provided by the customer to create their records and estimate their risk profile. For successful KYC processes and to develop robust risk models, it is nowadays necessary to incorporate other sources of open data besides the data provided by different governmental and private organizations.  In this article, we analyze the importance of using open sources to have complete knowledge of the client and robust risk estimation. 

One of the great challenges of KYC processes is to have a full view of the customers to identify in a timely manner any possible risk associated with the well-known AML activities and the use or access to different banking services. 

In this scenario, a huge shortcoming of institutions performing KYC processes is to rely solely on information provided by the client to create their records and estimate the risk profile of their clients. In many cases this information cannot be fully validated, or omitted. Even under the assumption that the information provided is true, this precarious set of information generates a high number of false positives (i.e. non-risky clients identified as risky) in the risk models [MIKKELSEN2019], and as a consequence, a large number of cases must be unnecessarily reviewed by the risk experts of the company. This results in higher costs, makes low-risk clients uncomfortable due to additional scrutiny and significantly reduces the effectiveness of anti-money laundering efforts [MIKKELSEN2019]. To achieve robust KYC processes and risk models, it is necessary to incorporate other sources of information that are “reliable and independent” [SPIJKERS2020]. 

These external sources are generally databases built by national and international governmental institutions with the purpose of facilitating access to citizen information. In most cases the final user of these sources are not institutions that are dedicated to KYC processes; they are the citizens themselves to issue, for example, background certificates, credit history, or judicial process statements. In fewer numbers there are sources that are built by joint and localized efforts of governments to fight against terrorist financing and money laundering activities. The Specially Designated Nationals and Blocked Persons List (SDN List), the United Nations Security Council Consolidated List, and the FBI’s Most Wanted List are some examples of these efforts. 

It is common to refer to these sources as “Open Data Sources” due to the fact that they can be directly accessed through the internet.  With the necessary authorizations from a citizen, a third-party entity can use these sources to strengthen its KYC processes. Since the “Panama papers” case [BBC2016], many regulators around the world have promoted the use of open sources for KYC processes. It is common, for example, that within local regulations in Latin American countries the use of lists such as OFAC SDN is now mandatory. This information helps organizations to quickly compare a potential customer with sanctioned entities and individuals in the US. Similarly, it is now common to be required to check if a user is a Politically Exposed Person (PEP), which are individuals who by virtue of their recognition and functionality require special care by a financial institution’s KYC due diligence. A PEP can potentially use their political privilege for personal gain [SPIJKERS2020]. The solutions provided by ReconoSER ID allow to check if the user appears on the Panama Papers or if the user is a PEP acording to different legislations. 

While the use of external data to enrich customer data is on the rise, even in 2019, a surprisingly low number of institutions use additional data sources to enrich customer data and get a more complete view [AIDAR2019].  A notable example for the LATAM region is the Canadian regulation, PCML TFA, which received an update in 2019. This regulation authorizes customer identification through external sources and even allows the use of identification data obtained by another entity to complement accurate customer identification. There is certainly a clear trend in the new regulations to use a more holistic approach to KYC, either through the use of technology, the use of open external data sources or a combination of the above. 

Consumption Strategies 

The use of open data sources poses a set of technological challenges in today’s onboarding processes where approval to financial products and services is expected within minutes.   Making these processes scalable requires automated processes to ensure an acceptable user experience. In other words, querying, extracting, integrating and processing open sources requires fast and effective technological mechanisms. 

The challenge and difficulty lies in the fact that open sources are available on the Internet without standard protocols or consumption strategies. It is therefore necessary, for each source, to perform independent query and extraction processes for subsequent integration. Among the technologies used today are web scraping and automated robotic processes (RPA). 

Web Scraping 

Web scraping refers to the automatic extraction of data from a website.  This information is collected and then exported into a format that is most useful for integration tasks.  Websites are built in many ways, as a result, each source requires a different process. 

From a URL a web scraper loads all the HTML, CSS, and JavaScript code (low level structure of any web page).  Then, it will extract all the data from the page or the specific data required. This extraction process requires a certain degree of knowledge of the source structure. 

Automated robotic processes (RPA) 

In simple terms, an RPA is a set of technological tools used with, the objective of automatinge business processes.  With RPA tools, an enterprise can configure software, better called a “bot,” to capture and interpret applications to process a transaction, manipulate data, trigger responses and communicate with other digital systems. RPA scenarios range from something as simple as generating an automatic response to an email to deploying thousands of bots, each programmed to automate open data query jobs, for example. 

Limitations of open data source queries 

While the benefits of open data source consulting in KYC processes are well known, there are a number of limitations associated with its use: 

  • Independence of the source: it must be guaranteed that the sources are constructed by impartial entities independent of third-party interests. 
  • Quality of the source: the source must provide reliable information, without errors that could lead to misinterpretations. 
  • Availability: As these are external sources that depend on third parties, 7/24 availability cannot be guaranteed. Therefore, KYC processes should not rely on a single source. 
  • Data protection: the regulatory standard of each country regarding the processing of personal data must be complied with [PACHECO2020]. In Colombia, for example, the Superintendencia de Industria y Comercio (SIC) published in 2019 a guide on the processing of personal data for e-commerce purposes [SIC2019]. For open data source consultation, citizen approval and a clear description of what information is going to be used and in what form is required. 
  • Technological regulation: In some countries this automatic data extraction using web scrapers can be seen as a crime.  The vaguely written US Computer Crime and Fraud Act, for example, makes it a possible crime to access certain types of information in a programmed manner [LAM2020]. However, and contradictorily for AML compliance purposes, its use is justified. Data collection is costly and complicated, but it is also an important tool for uncovering and revealing risks.  Technologies to automate these processes are required to meet the needs of today’s KYC processes. 

Conclusions 

The use of open data sources has become a “must do” in KYC processes.  Regulators from different countries have established the necessity of its use for the correct establishment of a client’s potential risk. However, its use has not yet become widespread and requires automatic consumption strategies. At a technological level it is common to use techniques such as web scrapping, automated robotic processes, and data integration strategies. Among the most relevant limitations of the use of open data sources are the impartiality and independence of the source, its availability and compliance with regulations regarding the processing of personal data. 

Rubén Manrique 

Bibliography 

[LAM2020] Lam Thuy Vo (2020). Tecnología y Sociedad ‘Web scrapping’, la técnica para denunciar injusticias y que es delito a la vez. (https://www.technologyreview.es/s/12963/web-scrapping-la-tecnica-para-denunciar-injusticias-y-que-es-delito-la-vez). 

[PACHECO2020] Diego Pacheco (2020) Tratamiento de datos personales y sensibles en Colombia. https://reconoserid.com/tratamiento-de-datos-personales-y-sensibles-en-colombia/ 

[SPIJKERS2020] Nicolas Spijkers. (2020) PEPs en el mundo y en Colombia: ¿Cuál es el debido KYC? https://reconoserid.com/peps-en-el-mundo-y-en-colombia-cual-es-el-debido-kyc/. 

[SIC2019] Guía sobre el tratamiento de datos personales para fines de comercio electrónico. Superintendencia de Industria y Comercia 2019. 

[MIKKELSEN2019] Daniel Mikkelsen, Azra Pravdic, and Bryan Richardson (2019). Flushing out the money launderers with better customer risk-rating models 

[BBC2016] BBC, 2016. ¿Qué son los Panamá Papers? https://www.bbc.com/mundo/video_fotos/2016/04/160404_video_panama_papers_investigacion_cof 

[AIDAR2019] Aidar Orunkhanov, 2019. The Power of External Data to Power Your KYC Program. https://www.tamr.com/blog/the-power-of-external-data-to-power-your-kyc-program/