Skip to content

Reconnecting – 10 Years of Email

I’m really bad about keeping connections. I’ll work with someone on a project for a few days, weeks, months, or years and I’ll somehow forget to put them in my contacts. I’ll not connect with them on LinkedIn. I’ll basically do a crummy job of maintaining a relationship with someone who was an associate. Despite this I’ve managed to end up with about 1,900 connections on LinkedIn. No doubt this is more of a result of the projects that I’ve worked on than my ability to make connections.

What I do well, however, is I keep all my email. I have for 10 years. Recently I was able to write a tool which would extract out of my PST files the inbound and outbound email messages including everyone the message was sent to. The result was a database that I could use to find out what my email world was like but more importantly it would allow me to mine my email for the people with whom I communicated but didn’t have a LinkedIn connection for. Along the way, I learned a few things.


By The Numbers

I thought that some of the numbers might be interesting:

Total Rows (Messages X recipients) 668,429
Messages 375,317
Conversations 142,118
Rows I Sent (Messages X recipients) 114,438
Messages I Sent 75,009
Conversations where I said something 29,741
People I talked to* 10,114*
People I emailed more than three times* 3,108*

One caveat to the number of people I talked to is that often times a person appeared multiple times because they changed their display name, or I used their email address, they got married, changed organizations, etc. Even after some cleanup I ended up with a large number of duplicates in the list of people I talked to. This was fine for my purposes but it represents a serious data challenge.

I’ve not finished playing with the data but I know there are some gaps and other issues with it – and I know I’ve spoken to a lot of people over the years.

The Process

So the process for this has been keeping all my mail for the last 10 years and letting Outlook autoarchive it into PSTs as my mailbox got too full. Over the years I’ve ended up with 12 archive folders. I didn’t do a super great job of keeping regular intervals between them or a standard maximum size but having 12 makes each file manageable.

The utility I wrote loads a PST into Outlook then enumerates all of the folders in the PST and writes out every message. It writes a separate line in a CSV for each recipient of each message. I include the conversation ID, the sender, the subject, and the time it was sent in addition to each recipient. I had the utility writing out individual rows for each person because I was most interested in aggregating by the folks I sent to.

The CSV files that I generated were loaded into Access and there was a fair amount of cleanup I had to do. First, I removed any single quotes at the beginning and end of the to email addresses. As I was trying to aggregate by email address, having some where the address had single quotes and some where it didn’t proved problematic.

I didn’t end up getting the email address – I really only ended up getting the display name. I would have preferred to get both but it wasn’t obvious how to do this and for my purposes I was focused on the name.

The other key cleanup I had to do was to update places where I was the sender because this was one of my major goals – to filter by what I had sent. This was a bit more problematic because many of the rows had a blank sender, some had my email address, and still more had my X.400 email address from Exchange. Ultimately I set a flag in each row indicating whether I sent the message or not.

From there, I created a query for Top Talkers. This query isolated those people who were the recipient of at least three of my messages. I used this as a proxy for whether I had a real conversation with them or not.

If you are interested in the tooling, you can send me an email and I’ll respond when I can.

Picking a Threshold

One of the challenges was figuring out how many messages I’d need to send someone before I’d say that we had a meaningful conversation. Complicating this was that the same person occurred in the data multiple ways (as I mentioned above.) So if I traded emails with someone 100 times would they remember me and want to be connected? What about fifty? How about five?

Unfortunately, there weren’t any clear answers. I ultimately decided on three to simply limit what I was looking at. I figured that I knew that folks that only got three or less messages from me probably weren’t strong enough connections.

I generally use the threshold for LinkedIn that I’d have to be willing to connect one of my connections with someone else. Not that I’d recommend them but that I could say that I knew some aspect of them. I figured three or below and I wouldn’t be able to do that.


I’ll have to admit that my mind needed some prompting on more than a few of the people. I’d go through the list of people I had talked to and if I remembered them I’d search LinkedIn for them. Quite a few were already connections so that was great. Some weren’t and I asked to invite them.

For those that I didn’t remember I searched the database for the subjects of the messages I had sent them which generally helped me know how I knew them pretty quickly and I could search LinkedIn to find them. Later in the process I switched to using Outlook to search for the folks – but that was just because flipping between tabs in Access wasn’t worth it. I could keep Outlook on one screen and Access on another – and LinkedIn on a third.

Robotic Speed

At some point in the process, LinkedIn noticed the amount of activity I was generating and started flagging me as a potential robot. So I started having to enter Captcha codes for each connection I’d try to make and occasionally provide the contacts email address. While I was flattered that LinkedIn thought I was a robot, I was disappointed that once I hit the threshold I had to verify on every request. Still, it was somewhat fun to be working at a speed that it thought I was a bot.

Delegation (Lack of)

One might ask why I didn’t delegate some of the data management functions to my assistant. How hard can it be matching names in email to the names on LinkedIn? Well, it turns out it was a lot harder than you would think.

I’ve already mentioned the problem of needing to know how I knew someone to select the right person from the list returned from LinkedIn. However, there were more challenges including the need to determine if I didn’t find folks immediately if they were important enough to track down. Finally, there were some folks that I didn’t feel like connecting with. I wouldn’t give them a recommendation to anyone so being connected didn’t make sense.


The whole point of the exercise for me was to reconnect with folks that I had managed to become disconnected with. My experience says that folks will accept LinkedIn invitations over a handful of days but even with having finished the invitations over the last two days, I’ve already gotten connected with 75 people that I had previously dropped. (Since drafting this approach and in the passing weeks it looks like the number is in excess of 150 new connections.) That for me is a big win – and potentially worth the effort to reconnect.

These folks will get a yearly update from me when the time runs around again this coming summer. Until then, I know that if I need to find that person I worked with years ago on something, I can be reasonably certain that I can find them now.

If we’ve worked together and I didn’t send you a LinkedIn connection request – feel free to send one to me.

No comment yet, add your voice below!

Add a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this: