I met a female computer scientist for the first time in college. She was my professor for my summer lab research! Prior to this, I never actually met a female engineer or computer scientist. As I continued my education, I began to notice the small number of women professors in my college’s computer science/engineering department. If my school’s department looks like this, what do the other CS departments in the US look like? Specifically, how does the representation of women CS Faculty compare to men CS Faculty in the US?

I gathered samples from 7 different CS university departments below.

## Data Source:

If repeated, I would aim for a larger sample size.

## Web Scrapping

I wrote the following functions below to web scrap data from MIT, Stanford, and Cal. The other schools’ data were scrapped by my fellow teammates.

I initialized the gender column to be all male and then changed it to female accordingly for each school. There may be some bias as I manually determined if the individual is a male or female based on their picture provided by the university.

from requests import get

from bs4 import BeautifulSoup

import re

import pandas as pd

import urllib.request

import numpy as npdef lst_data(website: str, tag: str, attrs_key: str, attrs_txt: str):

response = get(website)

html = BeautifulSoup(response.text, ‘html.parser’)

name_data = html.find_all(tag, attrs={attrs_key: re.compile(attrs_txt)})

return name_data#names = [first name, last name]

def index_values(names, name_data):

lst = []

for name in names:

name_str = [str(x) for x in name_data]

new_list = [name_str.index(x) for x in name_str if re.search(name, x)]

lst.append(new_list[0])

return lst#initialize all as male and change to female accordingly

def make_df(name_lst, school, female_lst):

df = pd.DataFrame({‘Name’: name_lst, ‘School’: school, ‘Gender’: ‘male’})

df.index = df[‘Name’]

df.loc[female_lst, ‘Gender’] = ‘female’

df = df.reset_index(drop=True)

return df

The following is an example of how I scrapped faculty names from Stanford.

name_data = lst_data(‘https://cs.stanford.edu/directory/faculty', ‘a’, ‘href’, ‘^http’)# Returns index values [8,67]. Use to index name_data

index_values([‘Maneesh Agrawala’, ‘Matei Zaharia’], name_data)lst = []

for faculty in name_data[8:68]:

lst.append(faculty.text)

#female faculty names

female_lst = [‘Jeannette Bohg’, ‘Emma Brunskill’, ‘Chelsea Finn’, ‘Monica Lam’, ‘Karen Liu’, ‘Dorsa Sadigh’,

‘Caroline Trippel’, ‘Jennifer Widom’, ‘Mary Wootters’]stanford_df = make_df(lst, ‘Stanford’, female_lst)

After collecting and cleaning the appropriate data, I found the proportion of female faculty members within the CS departments of each school. The rest of the code to process the data is linked in my github at the end of the article.

## Hypothesis Test: 1-Sample T-Test

**Why 1 sample T-Test:**Sample size < 30 as only 7 schools and we have an unknown population standard deviation**Sample:**Proportion of female faculty members within the CS departments of each school (Figure 1)

**Null Hypothesis:**p = 0.5, since we are testing if the percentage of CS female faculty is equal to the percentage of CS male faculty, so CS female faculty = 50%**Significance Level (alpha):**5%, which is a 5% chance we reject the null hypothesis when it is actually true

from scipy.stats import ttest_1samptset, pval = ttest_1samp(x, 0.5) #x = sample

print(‘t-statistic:’, tset)

print(‘pval:’, pval)if pval < 0.05: # alpha value is 0.05 or 5%

print(“Reject”)

else:

print(“Accept”)

Using a 5% significance level, we get a p-value of 2.82e-07 from our hypothesis test. The probability of observing the sample data (Figure 1), given the null hypothesis is true, is 0.0000282%. Since the p-value is less than the significance level, we reject the null hypothesis. The test suggests that the percentage of women in CS faculty positions is not equal to the percentage of men in these positions.

Why is that? How does the gender demographics for qualified candidates for CS faculty positions look like?