Mentor Matching Using Social Graphs: Crack the Code Index Match 2018

I am a part of Crack the Code, a movement to build a more equitable, inclusive, and diverse workforce in progressive data and technology. They run an annual survey to match new friends and mentoring pairs.

My job was to take answers to all kinds of questions about people wanting to get and give advice and find the best pairs of matches. I did that by making a social network (also known as a graph) that represented all the possible connections and then finding the set of pairs that had the best total match. The graph method made it easy for me to weight the network by different kinds of matches (skills, identities, and locations).

This notebook documents the project. If you want to know more about the choices I made or you see something you might have done differently, feel free to email me.

In [1]:
import pandas as pd
import copy
import random
import networkx as nx
import numpy as np
import gc
from uszipcode import ZipcodeSearchEngine
import matplotlib.pyplot as plt
#import the data
responses = pd.read_excel('/Users/rikiconrey/Documents/community/index_match/index-match-2018-responses.xlsx')
responses = responses[responses['First Name'].notnull()  ]

Graph Based on Mentoring Interests

Most of the survey was questions about what people wanted to give and get. For instance, there was a battery of options under

Industries I've worked in or am interested in:

with options for

  • Getting advice, and
  • Giving advice

I started by building the graph that would make sure that each pair had a getter and a giver rather than two getters or two givers.

In [2]:
#these are the columns containing all these matching data
flags = responses.iloc[:,5:32]
#we need for there to be a connection only when one person wants to give and the other receive
#we don't want there to be two givers who both have 1s
givers = pd.DataFrame()
for column in flags:
    givers[column] = flags[column].str.contains('Giving advice')*1
givers = givers.fillna(0)
#these are the equivalent to the givers for the getters
getters = pd.DataFrame()
for column in flags:
    getters[column] = flags[column].str.contains('Getting advice')*1
getters = getters.fillna(0)

getters.head(5)
Out[2]:
Industries I've worked in or am interested in: [Advocacy organizations] Industries I've worked in or am interested in: [Software engineering] Industries I've worked in or am interested in: [Consulting] Industries I've worked in or am interested in: [Databases] Industries I've worked in or am interested in: [the Democratic party] Industries I've worked in or am interested in: [Data Science] Industries I've worked in or am interested in: [the Resistance] Industries I've worked in or am interested in: [Labor Unions] Career experiences that I'd like to discuss: [Managing a team] Career experiences that I'd like to discuss: [Project or Product Management] ... Technical skills that I'd like to discuss: [SQL] Technical skills that I'd like to discuss: [Polling / survey research] Technical skills that I'd like to discuss: [Predictive modeling] Technical skills that I'd like to discuss: [R] Technical skills that I'd like to discuss: [Experimental design] Technical skills that I'd like to discuss: [Python] Technical skills that I'd like to discuss: [Data architecture] Technical skills that I'd like to discuss: [GIS/Mapping] Technical skills that I'd like to discuss: [Software engineering] Technical skills that I'd like to discuss: [AWS]
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1 0 1 1 1 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 27 columns

So now I have an N (people) X M (interest) matrix of givers and another one of getters. When I create dot products from these, I get sums for each pair of places where givers match getters.

Next, the survey team asked about some identities that people might want to use in the match. Three of these happened often enough for

Some free-responses we might want to include in new community programs were:

  • parents,
  • people over 35,
  • economically disadvantaged people,
  • working class people,
  • people who live in rural places or conservative states,
  • immigrants, and
  • people with disabilities
In [3]:
#these are affinity groups with several responses. 
responses['Woman or gender non-conforming'] = responses["Are you interested in matching based on mutual experiences?"].str.contains('Woman or gender non-conforming')*1
responses['Person of color'] = responses["Are you interested in matching based on mutual experiences?"].str.contains('Person of color')*1
responses['LGBT'] = responses["Are you interested in matching based on mutual experiences?"].str.contains('LGBT')*1

affinity_flags = responses[['Woman or gender non-conforming','Person of color','LGBT']]
affinity_flags = affinity_flags.fillna(0)

affinity_flags.head(5)
Out[3]:
Woman or gender non-conforming Person of color LGBT
0 0 0 0
1 1 0 0
2 0 0 0
3 0 0 0
4 0 0 0

Now I have an N X 3 matrix of identities people endorsed.

Finally, we asked some questions about where people were and tried to honor the desires of those who wanted to be matched locally

In [4]:
#recoding the survey question
responses['local'] = responses['Would you prefer to meet in person?']
responses.loc[responses.local == 
                   'Yes, I prefer to meet face-to-face somewhere convenient to us both','local'] = '1'
responses.loc[responses.local !='1','local'] = '0'
responses.local= responses.local.astype('int')

#pad zips
#some people gave us some additional zips, but there wasn't a ton of systematic variance there. 
responses['padzip'] = responses["What's your zip code?"].astype('int').astype('str').str.pad(5, 'left','0')

#get state and city and attached state to the file
search = ZipcodeSearchEngine()
zipcode = [search.by_zipcode(zip) for zip in responses['padzip']]

nm = list(zipcode[0].keys())
zipcode = pd.DataFrame(zipcode)
zipcode.columns = nm
responses = pd.concat([responses, zipcode['State']], axis=1)

#We're going to assume that people within the same state are local to each other
#I'm recoding DC, MD, and VA to DMV to account for this
indmv = [np.any(state in ['DC','MD','VA']) for state in responses['State']]
responses.loc[indmv,'State'] = 'DMV'

I chose to consider any match within state a match because of where this sample happened to be located.

In [5]:
locationmatches = np.matrix([[(i==j)*1 for j in responses['State']]for i in responses['State']])

#this is an N X X matrix of matches
locationmatches[0:4,:]
Out[5]:
matrix([[1, 1, 1, ..., 0, 0, 1],
        [1, 1, 1, ..., 0, 0, 1],
        [1, 1, 1, ..., 0, 0, 1],
        [1, 1, 1, ..., 0, 0, 1]])

With these tools, I could create the graph from the givers and the getters, the location matches, and the affinities.

In [6]:
#I do want the graph undirected
give = np.dot(givers, np.transpose(getters)) 
receive = np.dot(getters, np.transpose(givers))
bothways = give+receive
#we had some records with no interests. 
#since everyone at least responded, they must be kind of interested
bothways = bothways + 1

#add the location matches
affinity_upvote = np.dot(affinity_flags, np.transpose(affinity_flags))

#now I'm going to upweight every pair a little by location match
bothways = bothways + locationmatches
#and then zero out records where the person really wants a local match, and the target isn't local
maskmatches = np.ones(bothways.shape)
#assume zero for the local folks
maskmatches[responses['local']==1,:] = 0
maskmatches[:,responses['local']==1] = 0
#and write back over it with the location matches
maskmatches = maskmatches + locationmatches
maskmatches[maskmatches==2] = 1


bothways = np.multiply(bothways , maskmatches)

Finding the Best Pairs

Finding the best pairs at this point was simply a matter of using an existing algorithm called the blossom algorithm.

In [7]:
######################
##Pair graph
G = nx.Graph(bothways)
G.remove_edges_from(G.selfloop_edges())

#the graph is not bipartite, so I'm using a blossom algorithm to find the match
matches=nx.max_weight_matching(G)

We would be done, but there are orphans

In [8]:
#pos = nx.kamada_kawai_layout(G)
plt.figure(figsize=(3,3))
nx.draw_spring(G,edgelist=matches, node_color='#089000',  node_size = 100)
plt.axis('off')
plt.show()

That's because of the location restriction. These are good matches, but not everyone can match because we don't have two people in every state..

To fix it, I simply ran this twice more, supressing the location restriction and the previous matches. I had to run it twice because I had an odd number of people, and I still had a straggler after round 2.

In [9]:
###############
#Backup matches
bothways = give+receive
#we had some records with no interests. 
#since everyone at least responded, they must be kind of interested
bothways = bothways + 1

#add the location matches
affinity_upvote = np.dot(affinity_flags, np.transpose(affinity_flags))

#and then I'm also removing the existing matches so we get new ones for everyone
G = nx.Graph(bothways)
G.remove_edges_from(G.selfloop_edges())
G.remove_edges_from(matches)

matches2=nx.max_weight_matching(G)

###############
#third round
G.remove_edges_from(matches2)

matches3=nx.max_weight_matching(G)

Getting the Results

A handy way to get the matches from all these three rounds into a single representation is to make a fully connected graph, add them, and then dump all the links into a single file.

In [10]:
G = nx.Graph(bothways)

#adding matchiness as an edge attribute.
for match in matches:
    G.edges[match]['level'] = 1
for match in matches2:
    G.edges[match]['level'] = 2
for match in matches3:
    G.edges[match]['level'] = 3
    
#getting out the edgelist as a dataframe
match = nx.to_pandas_edgelist(G)
match['level'] = match['level'].fillna(0)
match = match[match.level>0]

match.head(5)
Out[10]:
level source target weight
96 3.0 0 96 11
123 2.0 0 123 12
219 1.0 0 219 14
262 3.0 1 12 1
361 1.0 1 111 2

The Payoff

There's still some work to do to tidy these data, but you're not here for munging. This is the network.

What I like about this is that creating a "match" in the official graph theory sense makes a set of totally disconnected pairs. We could do that, but it doesn't really serve the use case.

What we're trying to do is make introductions that make sense. By making multiple matches, we create a fully connected network. That means that if I missed something, the network can support me by offering referral opportunities.

In the chart below, darker lines are matches from earlier iterations. First iteration is black; second iteration is blue; third iteration is gray.

The thickness of each line is the strength of the relationship by all the weighting I did up front.

In [11]:
G = nx.from_pandas_edgelist(match, source='source', target='target', edge_attr=['weight','level' ])
cols=['black','blue','grey']
weights=[]
levels=[]
for edge in G.edges(data=True):
    weights.append(edge[2]['weight']/10)
    levels.append(cols[int(edge[2]['level']-1)])
pos = nx.spring_layout(G)
plt.figure(figsize=(12,12))
nx.draw_networkx_nodes(G, pos, node_color='#089000',  node_size = 100)
plt.axis('off')
nx.draw_networkx_edges(G, pos=pos, width=weights,edge_color=levels, arrows=False)
plt.show()

I did a bunch of QC on this analysis, but I did it all in R because I don't speak Python. However, the graph itself shows that at least the weighting stuff is working. The earlier matches (black) are thicker than the later matches (gray).

Finally, the last smidge of munging to create the match list.

In [12]:
#I have to flip all these around
flipped = match[['level','target','source','weight']]
flipped.columns = match.columns
flipped.head(5)
match = pd.concat([match, flipped],0)
#make it wide
match = pd.crosstab(index=match['source'], columns=match['level'], values=match['target'], aggfunc=np.mean)

If I just dump out 0-indexed ids, the operations people will make mistakes. I'm adding the email addresses too so they can check their own work.

In [13]:
match1_email = responses['Email?'][match.iloc[:,0]].reset_index()
match2_email = responses['Email?'][match.iloc[:,1]].reset_index()
match3_email = responses['Email?'][match.iloc[:,2]].reset_index()
match['match1_email']=match1_email.iloc[:,1]
match['match2_email']=match2_email.iloc[:,1]
match['match3_email']=match3_email.iloc[:,1]
match['source_email'] = responses['Email?']
In [14]:
#at last! Everyone has at least one match
match.to_csv('/Users/rikiconrey/Documents/community/index_match/matches.csv')