In late 2024 I had a realization that anti-money laundering (AML) laws require AI tools for compliance. This blog post explains the nature of the problem, with a very specific example, and what I built to solve the problem. I won't be turning this into a commercial product, but I learned a tremendous amount about the future of compliance by building a solution. If you want to truly understand a problem, you should be able to put forward a solution. I did it. Read on to see how, because it's applicable to many other areas of law. The solution I made is a complex data tool with a simple interface: https://screencomply.com/aml_prototype/.
The Problem Of Politically Exposed Persons
AML is not possible to do well without access to huge amounts of structured data that simply doesn't exist. For example, Canada's federal AML law requires thousands of registered companies like banks and securities dealers to look for politically exposed persons
(PEPs), which is a defined term that is impossible to comply with at the moment. AML professionals scoff at the idea that the law is impossible to comply with, but I've never met anyone who could point me to a government database of every mayor in Canada. There is no such list. There's not even a list of municipalities in Canada. Even at the provincial level, Ontario does not maintain a list of mayors or even an up-to-date list of municipalities. The people who designed the AML law must have thought that vendors would somehow build this, but how would they keep it up to date? Manually review thousands and thousands of municipal websites on a weekly basis?
Beyond current mayors and reeves, the law also covers past ones within the last five years, and many other positions, such as deputy ministers, which are a level of bureaucrat that are in the weeds and not always publicly listed. Where positions are listed, the data is often inaccurate. There are many thousands of PEPs in Ontario alone and the Ontario government doesn't provide a list of who they are, so people can't comply with the federal law properly. And at the federal level, there are many, many PEPs, including judges, certain military officers, and key people in political parties that have a certain level of support in Parliament.
In other words, the government has made a rule that can't be complied with using any resource that exists from the government, and is poorly handled by vendors. So how do people comply with the law? They rely on asking customers. Most regulated companies have a box that they ask customers to check that certifies that they are not a PEP. But how many customers even understand the checkbox? How many people become a mayor of some small town and don't notify their bank that they are suddenly a candidate for being treated as high risk by the anti-money laundering department?
The Solution: Step 1 Is Crawling
Here's the protoype: https://screencomply.com/aml_prototype/.
The solution to the problem of PEPs is to create a crawler that goes through thousands of websites across Canada and automatically crawls them in an intelligent fashion. The approach I picked was to have a crawler guided by an LLM, so each page that it reads the AI is asked to identify which links on the page are most likely to lead to a page that has politically exposed persons on it, with different questions depending on the nature of the website.
Here's one of the prompts that the crawler uses (with dynamic injection of some data from the crawling process):
$prompt = 'The Task:
----------
This website is for '.$title.'.
You are looking to identify which links are most likely to lead to the target that you are looking for, and then identify which URLs are likely to lead to the target based on the information that is given below. You will be returning an array of links inside a field named bestURLs, and you will explain your thinking in a field called explanation.
Some tips for finding the best links:
The best link is almost always on the same domain as the pages. THe only excerption is when the website is moving to a new site because there has been a reorganization that involves a change of domain name going forward.
The best link is the one that is likely to lead to a page that helps to find the target.
There is often a page that lists key staff members.
Target:
-------------------'."\n".$target;
if($valid_n != '-1' && $valid_n != 'many' && !empty($valid_n) )
$prompt .= "There are this many people that are you are looking for (i.e. the quantity of people that match the criteria of the target above): $valid_n\n";
$prompt .='
Once You Know Who You Are Looking For:
-------------------------------------
You will also tell me which of the following URLs are most likely to contain data that would be helpful to know more about the targets:
'.implode("\n",$links).'##
Skip List:
----------
You should avoid visiting a URL if it is already known. So if you see a URL in the list below, skip it:
'.implode("\n",$seen_URLs).'
Example Output:
---------------
You will provide me with a JSON output that corresponds to this example:
{
"bestURLs":["https://example.org/example/example2.html","https://example.org/abc/example-specific-listing-page.html"],
"explanation":"This is where you will write a two sentence description of who you think the target is based on the available information"
}
Provide the absolute URL for each returned link in bestURLs. The links must be on the same domain as the pages so far captured.
Provide the bestURLs in the order of most likely to contain the target to least likely. Always convert to absolute URLs, using your best guess of the URL to use based on what is provided in this prompt.
The pages known so far:
------------------------------------
'.$txt_pages.'
Site Description:
---------
The website is: '.$description_of_site.'.
JSON only:
---------
Only answer using JSON as the response, in the format of the above example that has the field "bestURLs". Provide at most 50 bestURLs, all as absolute links (not relative URLs).';
Solution: Step 2 Is Parsing
As the crawler goes through the site, it checks to see if it's found the PEPs it is looking for. The output of the crawling process is JSON files, which are then evaluated by various rules to try to judge if the current people extracted are the right ones, and to validate their PEP status against the definition. Each site results in a JSON file like this:
{
"entities": [
{
"name": "Paul Kortenaar",
"url_IDs": [
"499634",
"505264",
"505265"
],
"positions": [
[
"Chief Executive Officer of Ontario Science Centre",
"",
""
]
]
}
],
"sources": {
"499634": {
"title": "Media Room | Ontario Science Centre",
"site_title": "Ontario Science Centre",
"country": "Canada",
"subnational": "Ontario",
"type": "Agencies",
"titles": [
"Chief Executive Officer",
"CEO",
"President",
"Executive Director",
"Chairman",
"Chair",
"Chairperson",
"Chairwoman",
"Chair Of The Board",
"Director"
],
"ts_processed": 1733254226,
"ts_accessed": 1730982164,
"topjpg": "https:\/\/screencomplystorage.nyc3.cdn.digitaloceanspaces.com\/captures\/53ffc2e917d0e5f63fb7a57fa4c67d00\/20241107\/1730982138\/top.jpg",
"fullpagepng": "https:\/\/screencomplystorage.nyc3.cdn.digitaloceanspaces.com\/captures\/53ffc2e917d0e5f63fb7a57fa4c67d00\/20241107\/1730982138\/fullpage.png",
"page_md5": {
"title": "2b985fe2b83208de3aac072923ff21dd",
"txt": "380260b0522322fdd7ce8d143e013683",
"html": "2e03d3aaa1324a9685f7d101fd7836b4"
},
"url": "https:\/\/www.ontariosciencecentre.ca\/about-us\/media-room\/"
},
"505264": {
"title": "Who We Are",
"site_title": "Ontario Science Centre",
"country": "Canada",
"subnational": "Ontario",
"type": "Agencies",
"titles": [
"Chief Executive Officer",
"CEO",
"President",
"Executive Director",
"Chairman",
"Chair",
"Chairperson",
"Chairwoman",
"Chair Of The Board",
"Director"
],
"ts_processed": 1733254226,
"ts_accessed": 1731311253,
"topjpg": "https:\/\/screencomplystorage.nyc3.cdn.digitaloceanspaces.com\/captures\/77d44bd760827b5e1372c4c603b579dd\/20241111\/1731311207\/top.jpg",
"fullpagepng": "https:\/\/screencomplystorage.nyc3.cdn.digitaloceanspaces.com\/captures\/77d44bd760827b5e1372c4c603b579dd\/20241111\/1731311207\/fullpage.png",
"page_md5": {
"title": "c146e0986e92bff2312ab97cc807ec53",
"txt": "29c3fb9346d5789e2d4073a142b8f641",
"html": "eb80c11a69b957a5e40f04906e710f6e"
},
"url": "https:\/\/www.ontariosciencecentre.ca\/about-us\/ceo-plus-board-of-trustees\/ceo\/"
},
"505265": {
"title": "CEO + Board of Trustees | Ontario Science Centre",
"site_title": "Ontario Science Centre",
"country": "Canada",
"subnational": "Ontario",
"type": "Agencies",
"titles": [
"Chief Executive Officer",
"CEO",
"President",
"Executive Director",
"Chairman",
"Chair",
"Chairperson",
"Chairwoman",
"Chair Of The Board",
"Director"
],
"ts_processed": 1733254226,
"ts_accessed": 1731311238,
"topjpg": "https:\/\/screencomplystorage.nyc3.cdn.digitaloceanspaces.com\/captures\/866f218dbabdd12b5d1f2694638a0a1f\/20241111\/1731311207\/top.jpg",
"fullpagepng": "https:\/\/screencomplystorage.nyc3.cdn.digitaloceanspaces.com\/captures\/866f218dbabdd12b5d1f2694638a0a1f\/20241111\/1731311207\/fullpage.png",
"page_md5": {
"title": "503295e2925b48fe7870cf339540fb31",
"txt": "925942a2071d628aa79080ce8974bf57",
"html": "01e7facca3679cc4e4ba24c6dfed99bb"
},
"url": "https:\/\/www.ontariosciencecentre.ca\/about-us\/ceo-plus-board-of-trustees\/"
}
}
}
Step 3 Is Searching
Once all the data is assembled, it's filtered and compiled into a large file that's ready to be searched. By using a static file, the system ensures that it can be scaled easily and cheaply, and because the data is so simple this method is almost instantaneous. The actual system works surprisingly well for identifying PEPs! It took a lot of trial and error, but it is possible to identify probably around 95%+ of PEPs in Ontario using these methods. The remaining people are not known on the Internet, because the organizations they work for don't publish personnel listings. But in 2025, this is uncommon, and it's quite feasible to have a system that regularly crawls these websites and extracts the right information, and then surfaces it for companies that are trying to comply with the law.
Reflections On Challenges
1. I spent a long time trying to get Google Gemini to work properly. My previous AI experiments have all been with OpenAI's services, which work better. I thought Gemini would be cheaper or nearly free, but because of the rate limits on the free tier, it cost a few hundred bucks to run my tens of thousands of LLM queries. Much of this was wasted on bad prompts and bad crawling methods. The actual system that I built is very cheap to run, but that's only after a fair bit of trial and error.
2. Large context windows are really important for applications like this, but even the largest one available at the time (Google's Gemini) wasn't big enough. One hybrid method I tried was to have the text of the page extracted first, with a list of links, and then ask the system to pick from the links. This works for 90%+ of websites, but some sites have important text that isn't visible to the user unless they click something (think dropdowns) so the actual system I ended up having to write uses the full HTML of the pages. In some cases this requires stitching together a couple runs of the LLM, through different subsections of the pages, to ensure all the page is processed. This is a very tricky part that's easy to get wrong. I expect in a couple years this won't be necessary because the context windows will be large enough to handle any reasonably-sized HTML.
3. LLMs require significant guidance. I tried many prompts but in the end realized I couldn't write a universal prompt that could process any government page, and instead had to have different prompts that were selected based on my manual classification of the grouping. For example, court pages have one prompt, and government agencies have another, and municipalities another one. I manually tagged these, but it's also possible to use an LLM to classify them as a first step. This is an important step that I didn't appreciate when I first tried my hand at this problem, thinking it would be easy to get the LLM to do what I wanted. Just like an intern, the LLM needs significant guidance to get to the right answer.
4. There are certain jobs that are inherently difficult to classify. For example, military officers and embassy officials. These types of jobs are often not listed online with the same information as the AML law expects, and this might be an inherently fuzzy part of the application of LLMs. Other jobs are easier, but still require a manual list of titles and anti-titles to properly produce the right answer, which often depends on ranking several potential titles and deciding which one is best. For example, below is the file for Ontario Agencies that my system uses (the agencies are loaded already by the crawler process, and the codes next to each title identifies them, with the human-readable name for convenience/debugging):
Country:Canada
Subnational:Ontario
Type:Agencies
Titles: Chief Executive Officer, CEO, President, Executive Director, Chairman, Chair, Chairperson, Chairwoman, Chair Of The Board, Director
Anti-Titles: Assistant Director, Deputy CEO, Deputy Chief Executive Officer, Deputy President,Vice President, Chief Administrative Officer, CAO, Board Member
Valid Entities Filter: There is only one head of this provincial agency, and it is the person with the senior-most title, which is most likely the Chief Executive Officer or President (if these titles are parts of positions held by someone then this person is the head of the agency that you should select).
Page Filter: A government agency of the provincial level, which is established by law to carry out a legislated purpose, often, but not always, with the name of the jurisdiction as part of the name of the agency.
Valid Entities Limit: 1
user_ID: 105
788 Building Ontario Fund
787 Centralized Supply Chain Ontario (Supply Ontario)
786 Committee on the Status of Species at Risk in Ontario
783 Ontario Labour Relations Board
778 Higher Education Quality Council of Ontario
776 Intellectual Property Ontario
773 Law Enforcement Complaints Agency
772 McMichael Canadian Art Collection
769 Niagara Escarpment Commission
768 Office of the Employer Adviser
766 Office of the Worker Adviser
765 Ontario Creates
764 Ontario Food Terminal Board
763 Ontario Heritage Trust
761 Ontario Parks Board of Directors
760 Ontario Police Arbitration and Adjudication Commission
759 Ontario Public Service Pension Board (Ontario Pension Board)
755 Post-Secondary Education Quality Assessment Board
754 Provincial Schools Authority
752 Walkerton Clean Water Centre
751 Workplace Safety and Insurance Appeals Tribunal
750 Workplace Safety and Insurance Board
729 Venture Ontario
727 St. Lawrence Park Commission
726 Skilled Trades Ontario
725 Science North
724 Royal Ontario Museum
723 Ontario Arts Council
721 Ornge
720 Ontario Trillium Foundation
718 Ontario Securities Commission
716 Ontario Northland Transportation Commission
713 Ontario Health
711 Ontario Financing Authority
710 Ontario Energy Board
708 Ontario Educational Communications Authority
707 Ontario Clean Water Agency
706 Ontario Agency for Health Protection and Promotion
704 Niagara Parks Commission
702 Metrolinx
701 Legal Aid Ontario
699 Invest Ontario
698 Independent Electricity System Operator
696 Forest Renewal Trust
695 Financial Services Regulatory Authority of Ontario
693 Education Quality and Accountability Office
692 Ontario Science Centre
691 Algonquin Forestry Authority
690 Alcohol and Gaming Commission of Ontario
689 Agricultural Research Institute of Ontario
688 Agricorp
684 Liquor Control Board of Ontario
683 iGaming Ontario
967 Trillium Gift Of Life Network
968 HealthForceOntario
969 eHealth Ontario
970 Office Of The Fairness Commissioner
971 Human Rights Legal Support Centre
972 Ontario Racing
966 Toronto Islands Residential Community Trust Corporation
Conclusion
LLMs can be used, with appropriate guidance, and careful prompting, to extract structured information about government employees, which can then be fed into AML compliance software to flag people as being politically-exposed persons. The same method should be applicable in any other jurisdiction, so long as there's websites that list employees. But, it requires a fair bit of guidance, so it requires an expert who understands the titles and jobs of the jurisdiction and can carefully read the relevant legislation/regulations. This problem proved to be a lot harder than I expected, and what started off as an interesting weekend experiment ended up consuming probably 100 hours of my time. I consider this much more worthwhile than any university course I could have done, and the outcome was a deeper understanding of the challenges of AML, and the future of AI-led compliance.