โš ๏ธLarge suppression files

How to handle massive suppression files appropriately

We recommend against uploading unnecessarily large suppression files directly into your platform.

Many advertisers now have suppression files that have grown to 1GB, 2GB, or even larger. The outcome of uploading files this large is unpredictable and contingent on many factors including the robustness of your hardware.

Instead, we recommend running a script or tool outside of the platform.

Use a tool that can output a file of only the matches between your contact list and the suppression file, and then upload that output file as a suppression list instead.

This is the type of task the free version of ChatGPT 3.5 is perfectly suited for, as shown below.

Here's the prompt I used on the free version of ChatGPT to generate a script that runs on my macbook pro that will process a large suppression file in md5sum format:

I need to create a python script that can run on osx which takes a plain text file and converts each line to an md5sum, then compares each newly md5sumed line of that file to each line of a second plain text file which already consists of md5sums, I then need the matching lines from the first file to be saved to a file in the lines original format before they were converted to md5 for the match. The pre-existing md5sums file is 812MB and over 23 million lines, so this script needs to handle large file sizes while completing quickly

import hashlib
import sys

def main():
    # get the filenames from the command line arguments
    file1 = sys.argv[1]
    file2 = sys.argv[2]
    output_file = "matching_lines.txt"

    # read the md5sums from file2 and store them in a set
    md5sums = set()
    with open(file2, "r") as f:
        for line in f:
            md5sums.add(line.strip())

    # loop over each line in file1 and compare its md5sum to the md5sums in the set
    with open(file1, "r") as f, open(output_file, "w") as out:
        for line in f:
            md5 = hashlib.md5(line.encode()).hexdigest()
            if md5 in md5sums:
                out.write(line)

if __name__ == "__main__":
    main()

The above script was tested on millions of records, completed in a few seconds, and produced accurate output

Here's another example one liner with AWK in macOS terminal that can process two files of email addresses in regular plain unhashed format:

awk -F, 'FNR==NR {a[$1]; next}; $1 in a' suppression.csv contacts.csv

To export multiple contact lists together as one file to make this process efficient, add them to a segment and export the segment instead

If you need any help with this topic, feel free to open a support ticket

Last updated