The Pandora Papers have rocked the world. Since news organisations began publishing their explosive contents on October 3, the giant leak has dominated headlines and posed questions of some of the world’s most powerful people and their financial propriety.
Everyone from former UK prime minister Tony Blair to the King of Jordan have been dragged into a murky world of offshore finance, with stunning allegations being uncovered daily. And not for the first time, calls have been made to crack down on offshore financial products and institutions, and to instigate a fairer tax regime.
The Pandora paper revelations came from an unfathomably big tranche of documents: 2.94 terabytes of data in all, 11.9 million records and documents dating back to the 1970s. But how do you handle a massive leak of such size securely, when documents come in all sizes and formats, some dating back five decades?
The organisation behind the Pandora Papers leak, the International Consortium of Investigative Journalists (ICIJ), has spent the best part of a year coordinating simultaneous reporting from 150 different media outlets in 117 countries. And it involves a lot of technical infrastructure to bring the stories of financial issues to light. “We had data from 14 different offshore providers,” says Delphine Reuter, a Belgian data journalist and researcher at the ICIJ. Work began on analysing the data in November 2020.
“The first challenge for us was to get the data,” explains Pierre Romera, chief technology officer at the ICIJ. “We exchanged for weeks and months with the sources, and at a point we had to find a way to get the data.” Initially, the ICIJ brokered a deal with its sources that would allow them to send the data remotely without needing to travel, but as the size of the document dump grew, so did the challenges in ensuring it all could be sent to a secure server. Some members of the ICIJ team met directly with sources and collected huge hard drives containing the documents.
But the sheer size of the leak was still tricky to navigate. “They’re massive,” Romera says. Analysing such a volume of data isn’t a job for Excel or existing database management programs. “You can’t just go at it with classic tools. There’s nothing in the market for journalists that can ingest so much data.” Worse, four million of the files were PDFs – notoriously bad to interrogate. “PDFs are horrible to extract information from,” says Reuter. And they weren’t ordinary PDFs either: seemingly unrelated documents were scanned together into single PDF files without rhyme or reason. “You might have copies or emails or registers of directors within the information we were interested in,” she adds.
However, the ICIJ has had practice in parsing huge troves of information. The Panama Papers, which in 2016 uncovered the rogue offshore finance industry over 11.5 million leaked documents across 2.6 terabytes of data, gave the coalition of investigative journalists a set of best practices on how to handle all that data. “We created our own tools and technology to extract the text and make it searchable,” says Romera. That task fell to a team including Bruno Thomas, senior developer at the ICIJ, to prepare the data to be accessible for scores of reporters worldwide.
The ICIJ used two self-developed technologies in combination to comb through the documents. One, Extract, is able to share the computational load of extracting information between multiple servers. “When you have millions of documents, Extract is able to tell a server to look at one document and another server to look at another,” Romera says. Extract is part of a larger ICIJ project, called Datashare, which is a data structuring tool. “Everyone has to use Datashare to explore the documents,” says Reuter. “They can download documents to their own machine, but they have to use Datashare to search the documents because it’s not doable to go through 11.9 million documents without the system.”
Datashare was vital because just four per cent of the 11.9 million files the ICIJ received as part of the Pandora Papers were ‘structured’ – that is, organised in table-based file formats such as spreadsheets and CSV files. Those structured files are far easier to handle and interrogate. Emails, PDFs and Word documents are more difficult to search for data. Images, of which there were 2.9 million, are even more complicated to analyse computationally. Datashare parses all the documents, including scanning PDF files through optical character recognition (OCR) through Tesseract, an open-source system. Apache’s Tika Java framework was used to extract text from all the documents. “Tika can handle 50 or more different documents,” says Thomas. The data Tika extracts is then ultimately accessed through Datashare by the end user.
Without some kind of structure, the 600 partner journalists that the ICIJ worked with on the Pandora Papers would struggle to identify newsworthy nuggets of information contained within the millions of files they had access to. “The first step is to get the data and make it searchable,” says Romera.
The ICIJ tries to make it easier by offering them access to Datashare, but also by directing them to the newsworthy stories in each country at the beginning of the project. The team at the ICIJ developed a ‘country list’ – a list of the number of times countries or people of interest appear in the documents. They’re then identified by country, and partners are contacted to say that there is a list of people of interest connected to the country.
One of the ways Datashare manages to pull out those lists of names is through batch searches. The ICIJ has developed a tool that allows people wanting to interrogate the documents to supply a list of names or different queries in CSV format that are cross-checked against the metadata in the documents itself. “That’s incredibly helpful, because then the information is already structured, and you can export the results in CSV into any spreadsheet software and go through the results,” says Reuter. The ICIJ also uses machine learning to try and classify documents into broad clusters, helping differentiate, for instance, between documents related to the creation of a company, or a personal letter, or a duplicate of other documents.
“Graph databases excel at spotting data relationships at scale,” says Emil Eifrem, CEO of Neo4j, a graph technology company whose products are used by the ICIJ. Instead of breaking up data artificially, graph databases more closely mimic the way humans think about information. “Once that data model is coded in a scalable architecture, a graph database is matchless at mining connections in huge and complex datasets,” Eifrem says.
Sorting and interrogating the data was “much harder than the Panama or Paradise Papers,” says Romera. Although the datasets are of a similar size to those two leaks, the individual documents are significantly bigger in page count – around ten times bigger – than the Panama Papers. “The system we used until now to search into the documents was not powerful enough to handle such a massive amount of big documents,” says Romera. As a result, the ICIJ had to improve the configuration of its servers, and the way its search tools operated, to handle these new files. “There were huge 10,000-page PDF files,” says Thomas. “We had to cut those PDF files into pages, gather those pages into logical forms, and then we had to extract the data – like beneficial owners and their nationalities from unstructured data.”
In addition, the Pandora Papers included a broader range of file types and formatting that the machine learning systems the ICIJ previously used had to learn about to be able to parse and identify in order to be able to sort. “It’s now able to read very specific financial documents and very specific PDFs,” says Romera.
The 600 or so partner journalists then interrogated the data by accessing the ICIJ files through a secure authentication platform. Contact with the ICIJ uses PGP to encrypt emails and multi-factor authentication to access the servers – of which there are up to 60 running, a number that can expand to 80 when indexing files. SSL client certificates were also a must-have for partner journalists. “Sometimes it can be hard for partners to just connect to our servers,” admits Romera. However, once they have access to the data, the media partners are able to perform their own analysis on the data. A data-sharing API allows data scientists working for media partners to mine the documents within the Pandora Papers themselves using their own scripts or machine learning tools.
“We have to be ready for anything all the time,” says Romera. “It can turn you paranoid, because there’s so much at stake here.”
And for good reason: the ICIJ believes it has been subject to at least two attempts to break into the servers hosting the Pandora Papers since they and their partners began approaching politicians and businesspeople named in the documents for their stories in the last week. “As soon as we started to send comment papers, we started to have attacks on the servers,” says Romera. On October 1, the ICIJ website withstood a distributed denial of service (DDoS) attack that saw it bombarded with six million requests a minute, Romera says. Another suspected attack occurred on October 3, when the servers started showing unusual behaviour. This is currently under investigation. “When the server’s thought to be crazy, the priority is to fix it, not to find someone in the system,” says Romera. “We’re investigating to know if we had an intrusion.”
It also reinforces the importance of the ICIJ’s standard operating procedure, which is to withdraw partner access to the documents within a few weeks of the first stories breaking, requiring them to restate their interest in getting access to ensure no bad actors can leverage their way in through insecure contacts from third parties.