Harvesting Addresses

Have you ever wished you could create a list of all of the email addresses that you have communicated with over the years? Perhaps you’re building an email distribution list or need to notify everyone that you’re changing your address. Whatever the reason, the task is very simple if you are using an email client that stores mail in a text format.

In my case, I’m using Apple Mail which stores its messages in an mbox-like format which is, indeed, plain text.

To grab our addresses all we need to do is locate each of our mail messages as identified by their emlx extension and then grep the message for the three headers that contain addresses. As we find addresses we tell the script to store the output in a file called output.txt in our home folder.

find ~/Library/Mail -name *.emlx | xargs grep '^From:' >~/output.txt
find ~/Library/Mail -name *.emlx | xargs grep '^To:' >>~/output.txt
find ~/Library/Mail -name *.emlx | xargs grep '^Cc:' >>~/output.txt

Since grep outputs the complete line that contains the string we were looking for we now need to use Perl to iterate through our output file and extract anything that looks like an email address while ignoring the bits and bobs we have no use for.

perl -wne'while(/[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}/g){print "$&n"}' ~/output.txt >~/output2.txt

Now that we have a file with lots and lots of email addresses, let’s go ahead and make everything lower case which will make our next step of finding only unique addresses easier.

dd if=~/output2.txt of=~/output3.txt conv=lcase

Now the last step in this process is to make sure there are only unique addresses in our file.

awk '!x[$0]++' ~/output3.txt >~/output.txt

After the script completes we are left with a file that contains all of the unique addresses that have been contained in messages that you have “sent to” and “received from” over the years. I use this script regularly as a way of adding folks to our company mailing list. If you do this, it is imperative that you comply with the CAN-SPAM Act by making it easy for recipients of your messages to opt-out, preferably with a single click or reply.

Following is the complete script with a few extra lines to take care of cleaning up the temporary files we create along the way.

find ~/Library/Mail -name *.emlx | xargs grep '^From:' >~/output.txt
find ~/Library/Mail -name *.emlx | xargs grep '^To:' >>~/output.txt
find ~/Library/Mail -name *.emlx | xargs grep '^Cc:' >>~/output.txt
perl -wne'while(/[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}/g){print "$&n"}' output.txt >output2.txt
rm ~/output.txt
dd if=output2.txt of=output3.txt conv=lcase
rm ~/output2.txt
awk '!x[$0]++' output3.txt >output.txt
rm ~/output3.txt

Until next time – GEEK OUT!


Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload the CAPTCHA.