In late 2024 I had a realization that anti-money laundering (AML) laws require AI tools for compliance. This blog post explains the nature of the problem, with a very specific example, and what I built to solve the problem. I won't be turning this into a commercial product, but I learned a tremendous amount about the future of compliance by building a solution. If you want to truly understand a problem, you should be able to put forward a solution. I did it. Read on to see how, because it's applicable to many other areas of law. The solution I made is a complex data tool with a simple interface: https://screencomply.com/aml_prototype/.

The Problem Of Politically Exposed Persons

AML is not possible to do well without access to huge amounts of structured data that simply doesn't exist. For example, Canada's federal AML law requires thousands of registered companies like banks and securities dealers to look for politically exposed persons (PEPs), which is a defined term that is impossible to comply with at the moment. AML professionals scoff at the idea that the law is impossible to comply with, but I've never met anyone who could point me to a government database of every mayor in Canada. There is no such list. There's not even a list of municipalities in Canada. Even at the provincial level, Ontario does not maintain a list of mayors or even an up-to-date list of municipalities. The people who designed the AML law must have thought that vendors would somehow build this, but how would they keep it up to date? Manually review thousands and thousands of municipal websites on a weekly basis?

Beyond current mayors and reeves, the law also covers past ones within the last five years, and many other positions, such as deputy ministers, which are a level of bureaucrat that are in the weeds and not always publicly listed. Where positions are listed, the data is often inaccurate. There are many thousands of PEPs in Ontario alone and the Ontario government doesn't provide a list of who they are, so people can't comply with the federal law properly. And at the federal level, there are many, many PEPs, including judges, certain military officers, and key people in political parties that have a certain level of support in Parliament.

In other words, the government has made a rule that can't be complied with using any resource that exists from the government, and is poorly handled by vendors. So how do people comply with the law? They rely on asking customers. Most regulated companies have a box that they ask customers to check that certifies that they are not a PEP. But how many customers even understand the checkbox? How many people become a mayor of some small town and don't notify their bank that they are suddenly a candidate for being treated as high risk by the anti-money laundering department?

The Solution: Step 1 Is Crawling

Here's the protoype: https://screencomply.com/aml_prototype/.

The solution to the problem of PEPs is to create a crawler that goes through thousands of websites across Canada and automatically crawls them in an intelligent fashion. The approach I picked was to have a crawler guided by an LLM, so each page that it reads the AI is asked to identify which links on the page are most likely to lead to a page that has politically exposed persons on it, with different questions depending on the nature of the website.

Here's one of the prompts that the crawler uses (with dynamic injection of some data from the crawling process):

$prompt = 'The Task:

----------

This website is for '.$title.'.

You are looking to identify which links are most likely to lead to the target that you are looking for, and then identify which URLs are likely to lead to the target based on the information that is given below. You will be returning an array of links inside a field named bestURLs, and you will explain your thinking in a field called explanation.

Some tips for finding the best links:

The best link is almost always on the same domain as the pages. THe only excerption is when the website is moving to a new site because there has been a reorganization that involves a change of domain name going forward.

The best link is the one that is likely to lead to a page that helps to find the target.

There is often a page that lists key staff members.

Target:

-------------------'."\n".$target;

if($valid_n != '-1' && $valid_n != 'many' && !empty($valid_n) )

$prompt .= "There are this many people that are you are looking for (i.e. the quantity of people that match the criteria of the target above): $valid_n\n";

$prompt .='

Once You Know Who You Are Looking For:

-------------------------------------

You will also tell me which of the following URLs are most likely to contain data that would be helpful to know more about the targets:

'.implode("\n",$links).'##

Skip List:

----------

You should avoid visiting a URL if it is already known. So if you see a URL in the list below, skip it:

'.implode("\n",$seen_URLs).'

Example Output:

---------------

You will provide me with a JSON output that corresponds to this example:

{

"bestURLs":["https://example.org/example/example2.html","https://example.org/abc/example-specific-listing-page.html"],

"explanation":"This is where you will write a two sentence description of who you think the target is based on the available information"

}

Provide the absolute URL for each returned link in bestURLs. The links must be on the same domain as the pages so far captured.

Provide the bestURLs in the order of most likely to contain the target to least likely. Always convert to absolute URLs, using your best guess of the URL to use based on what is provided in this prompt.

The pages known so far:

------------------------------------

'.$txt_pages.'

Site Description:

---------

The website is: '.$description_of_site.'.

JSON only:

---------

Only answer using JSON as the response, in the format of the above example that has the field "bestURLs". Provide at most 50 bestURLs, all as absolute links (not relative URLs).';

Solution: Step 2 Is Parsing

As the crawler goes through the site, it checks to see if it's found the PEPs it is looking for. The output of the crawling process is JSON files, which are then evaluated by various rules to try to judge if the current people extracted are the right ones, and to validate their PEP status against the definition. Each site results in a JSON file like this:

{

"entities": [

{

"name": "Paul Kortenaar",

"url_IDs": [

"499634",

"505264",

"505265"

],

"positions": [

[

"Chief Executive Officer of Ontario Science Centre",

"",

""

]

]

}

],

"sources": {

"499634": {

"title": "Media Room | Ontario Science Centre",

"site_title": "Ontario Science Centre",

"country": "Canada",

"subnational": "Ontario",

"type": "Agencies",

"titles": [

"Chief Executive Officer",

"CEO",

"President",

"Executive Director",

"Chairman",

"Chair",

"Chairperson",

"Chairwoman",

"Chair Of The Board",

"Director"

],

"ts_processed": 1733254226,

"ts_accessed": 1730982164,

"topjpg": "https:\/\/screencomplystorage.nyc3.cdn.digitaloceanspaces.com\/captures\/53ffc2e917d0e5f63fb7a57fa4c67d00\/20241107\/1730982138\/top.jpg",

"fullpagepng": "https:\/\/screencomplystorage.nyc3.cdn.digitaloceanspaces.com\/captures\/53ffc2e917d0e5f63fb7a57fa4c67d00\/20241107\/1730982138\/fullpage.png",

"page_md5": {

"title": "2b985fe2b83208de3aac072923ff21dd",

"txt": "380260b0522322fdd7ce8d143e013683",

"html": "2e03d3aaa1324a9685f7d101fd7836b4"

},

"url": "https:\/\/www.ontariosciencecentre.ca\/about-us\/media-room\/"

},

"505264": {

"title": "Who We Are",

"site_title": "Ontario Science Centre",

"country": "Canada",

"subnational": "Ontario",

"type": "Agencies",

"titles": [

"Chief Executive Officer",

"CEO",

"President",

"Executive Director",

"Chairman",

"Chair",

"Chairperson",

"Chairwoman",

"Chair Of The Board",

"Director"

],

"ts_processed": 1733254226,

"ts_accessed": 1731311253,

"topjpg": "https:\/\/screencomplystorage.nyc3.cdn.digitaloceanspaces.com\/captures\/77d44bd760827b5e1372c4c603b579dd\/20241111\/1731311207\/top.jpg",

"fullpagepng": "https:\/\/screencomplystorage.nyc3.cdn.digitaloceanspaces.com\/captures\/77d44bd760827b5e1372c4c603b579dd\/20241111\/1731311207\/fullpage.png",

"page_md5": {

"title": "c146e0986e92bff2312ab97cc807ec53",

"txt": "29c3fb9346d5789e2d4073a142b8f641",

"html": "eb80c11a69b957a5e40f04906e710f6e"

},

"url": "https:\/\/www.ontariosciencecentre.ca\/about-us\/ceo-plus-board-of-trustees\/ceo\/"

},

"505265": {

"title": "CEO + Board of Trustees | Ontario Science Centre",

"site_title": "Ontario Science Centre",

"country": "Canada",

"subnational": "Ontario",

"type": "Agencies",

"titles": [

"Chief Executive Officer",

"CEO",

"President",

"Executive Director",

"Chairman",

"Chair",

"Chairperson",

"Chairwoman",

"Chair Of The Board",

"Director"

],

"ts_processed": 1733254226,

"ts_accessed": 1731311238,

"topjpg": "https:\/\/screencomplystorage.nyc3.cdn.digitaloceanspaces.com\/captures\/866f218dbabdd12b5d1f2694638a0a1f\/20241111\/1731311207\/top.jpg",

"fullpagepng": "https:\/\/screencomplystorage.nyc3.cdn.digitaloceanspaces.com\/captures\/866f218dbabdd12b5d1f2694638a0a1f\/20241111\/1731311207\/fullpage.png",

"page_md5": {

"title": "503295e2925b48fe7870cf339540fb31",

"txt": "925942a2071d628aa79080ce8974bf57",

"html": "01e7facca3679cc4e4ba24c6dfed99bb"

},

"url": "https:\/\/www.ontariosciencecentre.ca\/about-us\/ceo-plus-board-of-trustees\/"

}

}

}

Step 3 Is Searching

Once all the data is assembled, it's filtered and compiled into a large file that's ready to be searched. By using a static file, the system ensures that it can be scaled easily and cheaply, and because the data is so simple this method is almost instantaneous. The actual system works surprisingly well for identifying PEPs! It took a lot of trial and error, but it is possible to identify probably around 95%+ of PEPs in Ontario using these methods. The remaining people are not known on the Internet, because the organizations they work for don't publish personnel listings. But in 2025, this is uncommon, and it's quite feasible to have a system that regularly crawls these websites and extracts the right information, and then surfaces it for companies that are trying to comply with the law.

Reflections On Challenges

1. I spent a long time trying to get Google Gemini to work properly. My previous AI experiments have all been with OpenAI's services, which work better. I thought Gemini would be cheaper or nearly free, but because of the rate limits on the free tier, it cost a few hundred bucks to run my tens of thousands of LLM queries. Much of this was wasted on bad prompts and bad crawling methods. The actual system that I built is very cheap to run, but that's only after a fair bit of trial and error.

2. Large context windows are really important for applications like this, but even the largest one available at the time (Google's Gemini) wasn't big enough. One hybrid method I tried was to have the text of the page extracted first, with a list of links, and then ask the system to pick from the links. This works for 90%+ of websites, but some sites have important text that isn't visible to the user unless they click something (think dropdowns) so the actual system I ended up having to write uses the full HTML of the pages. In some cases this requires stitching together a couple runs of the LLM, through different subsections of the pages, to ensure all the page is processed. This is a very tricky part that's easy to get wrong. I expect in a couple years this won't be necessary because the context windows will be large enough to handle any reasonably-sized HTML.

3. LLMs require significant guidance. I tried many prompts but in the end realized I couldn't write a universal prompt that could process any government page, and instead had to have different prompts that were selected based on my manual classification of the grouping. For example, court pages have one prompt, and government agencies have another, and municipalities another one. I manually tagged these, but it's also possible to use an LLM to classify them as a first step. This is an important step that I didn't appreciate when I first tried my hand at this problem, thinking it would be easy to get the LLM to do what I wanted. Just like an intern, the LLM needs significant guidance to get to the right answer.

4. There are certain jobs that are inherently difficult to classify. For example, military officers and embassy officials. These types of jobs are often not listed online with the same information as the AML law expects, and this might be an inherently fuzzy part of the application of LLMs. Other jobs are easier, but still require a manual list of titles and anti-titles to properly produce the right answer, which often depends on ranking several potential titles and deciding which one is best. For example, below is the file for Ontario Agencies that my system uses (the agencies are loaded already by the crawler process, and the codes next to each title identifies them, with the human-readable name for convenience/debugging):

Country:Canada

Subnational:Ontario

Type:Agencies

Titles: Chief Executive Officer, CEO, President, Executive Director, Chairman, Chair, Chairperson, Chairwoman, Chair Of The Board, Director

Anti-Titles: Assistant Director, Deputy CEO, Deputy Chief Executive Officer, Deputy President,Vice President, Chief Administrative Officer, CAO, Board Member

Valid Entities Filter: There is only one head of this provincial agency, and it is the person with the senior-most title, which is most likely the Chief Executive Officer or President (if these titles are parts of positions held by someone then this person is the head of the agency that you should select).

Page Filter: A government agency of the provincial level, which is established by law to carry out a legislated purpose, often, but not always, with the name of the jurisdiction as part of the name of the agency.

Valid Entities Limit: 1

user_ID: 105

788 Building Ontario Fund

787 Centralized Supply Chain Ontario (Supply Ontario)

786 Committee on the Status of Species at Risk in Ontario

783 Ontario Labour Relations Board

778 Higher Education Quality Council of Ontario

776 Intellectual Property Ontario

773 Law Enforcement Complaints Agency

772 McMichael Canadian Art Collection

769 Niagara Escarpment Commission

768 Office of the Employer Adviser

766 Office of the Worker Adviser

765 Ontario Creates

764 Ontario Food Terminal Board

763 Ontario Heritage Trust

761 Ontario Parks Board of Directors

760 Ontario Police Arbitration and Adjudication Commission

759 Ontario Public Service Pension Board (Ontario Pension Board)

755 Post-Secondary Education Quality Assessment Board

754 Provincial Schools Authority

752 Walkerton Clean Water Centre

751 Workplace Safety and Insurance Appeals Tribunal

750 Workplace Safety and Insurance Board

729 Venture Ontario

727 St. Lawrence Park Commission

726 Skilled Trades Ontario

725 Science North

724 Royal Ontario Museum

723 Ontario Arts Council

721 Ornge

720 Ontario Trillium Foundation

718 Ontario Securities Commission

716 Ontario Northland Transportation Commission

713 Ontario Health

711 Ontario Financing Authority

710 Ontario Energy Board

708 Ontario Educational Communications Authority

707 Ontario Clean Water Agency

706 Ontario Agency for Health Protection and Promotion

704 Niagara Parks Commission

702 Metrolinx

701 Legal Aid Ontario

699 Invest Ontario

698 Independent Electricity System Operator

696 Forest Renewal Trust

695 Financial Services Regulatory Authority of Ontario

693 Education Quality and Accountability Office

692 Ontario Science Centre

691 Algonquin Forestry Authority

690 Alcohol and Gaming Commission of Ontario

689 Agricultural Research Institute of Ontario

688 Agricorp

684 Liquor Control Board of Ontario

683 iGaming Ontario

967 Trillium Gift Of Life Network

968 HealthForceOntario

969 eHealth Ontario

970 Office Of The Fairness Commissioner

971 Human Rights Legal Support Centre

972 Ontario Racing

966 Toronto Islands Residential Community Trust Corporation

Conclusion

LLMs can be used, with appropriate guidance, and careful prompting, to extract structured information about government employees, which can then be fed into AML compliance software to flag people as being politically-exposed persons. The same method should be applicable in any other jurisdiction, so long as there's websites that list employees. But, it requires a fair bit of guidance, so it requires an expert who understands the titles and jobs of the jurisdiction and can carefully read the relevant legislation/regulations. This problem proved to be a lot harder than I expected, and what started off as an interesting weekend experiment ended up consuming probably 100 hours of my time. I consider this much more worthwhile than any university course I could have done, and the outcome was a deeper understanding of the challenges of AML, and the future of AI-led compliance.