I am a part of Crack the Code, a movement to build a more equitable, inclusive, and diverse workforce in progressive data and technology. They run an annual survey to match new friends and mentoring pairs.

My job was to take answers to all kinds of questions about people wanting to get and give advice and find the best pairs of matches. I did that by making a social network (also known as a graph) that represented all the possible connections and then finding the set of pairs that had the best total match. The graph method made it easy for me to weight the network by different kinds of matches (skills, identities, and locations).

This notebook documents the project. If you want to know more about the choices I made or you see something you might have done differently, feel free to email me.

import pandas as pd
import copy
import random
import networkx as nx
import numpy as np
import gc
from uszipcode import ZipcodeSearchEngine
import matplotlib.pyplot as plt
#import the data
responses = pd.read_excel('/Users/rikiconrey/Documents/community/index_match/index-match-2018-responses.xlsx')
responses = responses[responses['First Name'].notnull()  ]

Graph Based on Mentoring Interests¶

Most of the survey was questions about what people wanted to give and get. For instance, there was a battery of options under

Industries I've worked in or am interested in:¶

with options for

Getting advice, and
Giving advice

I started by building the graph that would make sure that each pair had a getter and a giver rather than two getters or two givers.

#these are the columns containing all these matching data
flags = responses.iloc[:,5:32]
#we need for there to be a connection only when one person wants to give and the other receive
#we don't want there to be two givers who both have 1s
givers = pd.DataFrame()
for column in flags:
    givers[column] = flags[column].str.contains('Giving advice')*1
givers = givers.fillna(0)
#these are the equivalent to the givers for the getters
getters = pd.DataFrame()
for column in flags:
    getters[column] = flags[column].str.contains('Getting advice')*1
getters = getters.fillna(0)

getters.head(5)

So now I have an N (people) X M (interest) matrix of givers and another one of getters. When I create dot products from these, I get sums for each pair of places where givers match getters.

Next, the survey team asked about some identities that people might want to use in the match. Three of these happened often enough for

Some free-responses we might want to include in new community programs were:

parents,
people over 35,
economically disadvantaged people,
working class people,
people who live in rural places or conservative states,
immigrants, and
people with disabilities

#these are affinity groups with several responses. 
responses['Woman or gender non-conforming'] = responses["Are you interested in matching based on mutual experiences?"].str.contains('Woman or gender non-conforming')*1
responses['Person of color'] = responses["Are you interested in matching based on mutual experiences?"].str.contains('Person of color')*1
responses['LGBT'] = responses["Are you interested in matching based on mutual experiences?"].str.contains('LGBT')*1

affinity_flags = responses[['Woman or gender non-conforming','Person of color','LGBT']]
affinity_flags = affinity_flags.fillna(0)

affinity_flags.head(5)

Now I have an N X 3 matrix of identities people endorsed.

Finally, we asked some questions about where people were and tried to honor the desires of those who wanted to be matched locally

#recoding the survey question
responses['local'] = responses['Would you prefer to meet in person?']
responses.loc[responses.local == 
                   'Yes, I prefer to meet face-to-face somewhere convenient to us both','local'] = '1'
responses.loc[responses.local !='1','local'] = '0'
responses.local= responses.local.astype('int')

#pad zips
#some people gave us some additional zips, but there wasn't a ton of systematic variance there. 
responses['padzip'] = responses["What's your zip code?"].astype('int').astype('str').str.pad(5, 'left','0')

#get state and city and attached state to the file
search = ZipcodeSearchEngine()
zipcode = [search.by_zipcode(zip) for zip in responses['padzip']]

nm = list(zipcode[0].keys())
zipcode = pd.DataFrame(zipcode)
zipcode.columns = nm
responses = pd.concat([responses, zipcode['State']], axis=1)

#We're going to assume that people within the same state are local to each other
#I'm recoding DC, MD, and VA to DMV to account for this
indmv = [np.any(state in ['DC','MD','VA']) for state in responses['State']]
responses.loc[indmv,'State'] = 'DMV'

I chose to consider any match within state a match because of where this sample happened to be located.

locationmatches = np.matrix([[(i==j)*1 for j in responses['State']]for i in responses['State']])

#this is an N X X matrix of matches
locationmatches[0:4,:]

matrix([[1, 1, 1, ..., 0, 0, 1],
        [1, 1, 1, ..., 0, 0, 1],
        [1, 1, 1, ..., 0, 0, 1],
        [1, 1, 1, ..., 0, 0, 1]])

With these tools, I could create the graph from the givers and the getters, the location matches, and the affinities.

#I do want the graph undirected
give = np.dot(givers, np.transpose(getters)) 
receive = np.dot(getters, np.transpose(givers))
bothways = give+receive
#we had some records with no interests. 
#since everyone at least responded, they must be kind of interested
bothways = bothways + 1

#add the location matches
affinity_upvote = np.dot(affinity_flags, np.transpose(affinity_flags))

#now I'm going to upweight every pair a little by location match
bothways = bothways + locationmatches
#and then zero out records where the person really wants a local match, and the target isn't local
maskmatches = np.ones(bothways.shape)
#assume zero for the local folks
maskmatches[responses['local']==1,:] = 0
maskmatches[:,responses['local']==1] = 0
#and write back over it with the location matches
maskmatches = maskmatches + locationmatches
maskmatches[maskmatches==2] = 1


bothways = np.multiply(bothways , maskmatches)

Finding the Best Pairs¶

Finding the best pairs at this point was simply a matter of using an existing algorithm called the blossom algorithm.

######################
##Pair graph
G = nx.Graph(bothways)
G.remove_edges_from(G.selfloop_edges())

#the graph is not bipartite, so I'm using a blossom algorithm to find the match
matches=nx.max_weight_matching(G)

We would be done, but there are orphans

#pos = nx.kamada_kawai_layout(G)
plt.figure(figsize=(3,3))
nx.draw_spring(G,edgelist=matches, node_color='#089000',  node_size = 100)
plt.axis('off')
plt.show()

That's because of the location restriction. These are good matches, but not everyone can match because we don't have two people in every state..

To fix it, I simply ran this twice more, supressing the location restriction and the previous matches. I had to run it twice because I had an odd number of people, and I still had a straggler after round 2.

###############
#Backup matches
bothways = give+receive
#we had some records with no interests. 
#since everyone at least responded, they must be kind of interested
bothways = bothways + 1

#add the location matches
affinity_upvote = np.dot(affinity_flags, np.transpose(affinity_flags))

#and then I'm also removing the existing matches so we get new ones for everyone
G = nx.Graph(bothways)
G.remove_edges_from(G.selfloop_edges())
G.remove_edges_from(matches)

matches2=nx.max_weight_matching(G)

###############
#third round
G.remove_edges_from(matches2)

matches3=nx.max_weight_matching(G)

Getting the Results¶

A handy way to get the matches from all these three rounds into a single representation is to make a fully connected graph, add them, and then dump all the links into a single file.

G = nx.Graph(bothways)

#adding matchiness as an edge attribute.
for match in matches:
    G.edges[match]['level'] = 1
for match in matches2:
    G.edges[match]['level'] = 2
for match in matches3:
    G.edges[match]['level'] = 3
    
#getting out the edgelist as a dataframe
match = nx.to_pandas_edgelist(G)
match['level'] = match['level'].fillna(0)
match = match[match.level>0]

match.head(5)

The Payoff¶

There's still some work to do to tidy these data, but you're not here for munging. This is the network.

What I like about this is that creating a "match" in the official graph theory sense makes a set of totally disconnected pairs. We could do that, but it doesn't really serve the use case.

What we're trying to do is make introductions that make sense. By making multiple matches, we create a fully connected network. That means that if I missed something, the network can support me by offering referral opportunities.

In the chart below, darker lines are matches from earlier iterations. First iteration is black; second iteration is blue; third iteration is gray.

The thickness of each line is the strength of the relationship by all the weighting I did up front.

G = nx.from_pandas_edgelist(match, source='source', target='target', edge_attr=['weight','level' ])
cols=['black','blue','grey']
weights=[]
levels=[]
for edge in G.edges(data=True):
    weights.append(edge[2]['weight']/10)
    levels.append(cols[int(edge[2]['level']-1)])
pos = nx.spring_layout(G)
plt.figure(figsize=(12,12))
nx.draw_networkx_nodes(G, pos, node_color='#089000',  node_size = 100)
plt.axis('off')
nx.draw_networkx_edges(G, pos=pos, width=weights,edge_color=levels, arrows=False)
plt.show()

I did a bunch of QC on this analysis, but I did it all in R because I don't speak Python. However, the graph itself shows that at least the weighting stuff is working. The earlier matches (black) are thicker than the later matches (gray).

Finally, the last smidge of munging to create the match list.

#I have to flip all these around
flipped = match[['level','target','source','weight']]
flipped.columns = match.columns
flipped.head(5)
match = pd.concat([match, flipped],0)
#make it wide
match = pd.crosstab(index=match['source'], columns=match['level'], values=match['target'], aggfunc=np.mean)

If I just dump out 0-indexed ids, the operations people will make mistakes. I'm adding the email addresses too so they can check their own work.

match1_email = responses['Email?'][match.iloc[:,0]].reset_index()
match2_email = responses['Email?'][match.iloc[:,1]].reset_index()
match3_email = responses['Email?'][match.iloc[:,2]].reset_index()
match['match1_email']=match1_email.iloc[:,1]
match['match2_email']=match2_email.iloc[:,1]
match['match3_email']=match3_email.iloc[:,1]
match['source_email'] = responses['Email?']

#at last! Everyone has at least one match
match.to_csv('/Users/rikiconrey/Documents/community/index_match/matches.csv')

	Industries I've worked in or am interested in: [Advocacy organizations]	Industries I've worked in or am interested in: [Consulting]	Industries I've worked in or am interested in: [Databases]	Industries I've worked in or am interested in: [the Democratic party]	Industries I've worked in or am interested in: [Labor Unions]	...
0	0	0	0	0	0	...
1	0	0	0	0	0	...
2	1	1	1	1	1	...
3	0	0	0	0	0	...
4	0	0	0	0	0	...

	level	source	target	weight
96	3.0	0	96	11
123	2.0	0	123	12
219	1.0	0	219	14
262	3.0	1	12	1
361	1.0	1	111	2

	Industries I've worked in or am interested in: [Advocacy organizations]	Industries I've worked in or am interested in: [Consulting]	Industries I've worked in or am interested in: [Databases]	Industries I've worked in or am interested in: [the Democratic party]	Industries I've worked in or am interested in: [Labor Unions]	...
0	0	0	0	0	0	...
1	0	0	0	0	0	...
2	1	1	1	1	1	...
3	0	0	0	0	0	...
4	0	0	0	0	0	...

	Woman or gender non-conforming	Person of color	LGBT
0	0	0	0
1	1	0	0
2	0	0	0
3	0	0	0
4	0	0	0

	Industries I've worked in or am interested in: [Advocacy organizations]	Industries I've worked in or am interested in: [Consulting]	Industries I've worked in or am interested in: [Databases]	Industries I've worked in or am interested in: [the Democratic party]	Industries I've worked in or am interested in: [Labor Unions]	...
0	0	0	0	0	0	...
1	0	0	0	0	0	...
2	1	1	1	1	1	...
3	0	0	0	0	0	...
4	0	0	0	0	0	...

Mentor Matching Using Social Graphs: Crack the Code Index Match 2018¶

Graph Based on Mentoring Interests¶

Industries I've worked in or am interested in:¶

Finding the Best Pairs¶

Getting the Results¶

The Payoff¶

	Industries I've worked in or am interested in: [Advocacy organizations]	Industries I've worked in or am interested in: [Consulting]	Industries I've worked in or am interested in: [Databases]	Industries I've worked in or am interested in: [the Democratic party]	Industries I've worked in or am interested in: [Labor Unions]	...
0	0	0	0	0	0	...
1	0	0	0	0	0	...
2	1	1	1	1	1	...
3	0	0	0	0	0	...
4	0	0	0	0	0	...