An analysis of the gender and topic diversity of the 2017 SXSW Interactive speaker and session lineup
How diverse is SXSW really? With my growing curiosity in data, and many events this year that have shown that the world is not as diverse and accepting of a place as my naive self had previously thought, I decided to apply a critical eye and a bit of analytics to this year’s SXSW Interactive lineup.
Used selenium and python to scrape all session and speaker pages from SXSW schedule. Collected the following about speakers – url, name, title, company, bio. Collected the following about sessions – title, speaker urls, description
Data munging –
Ran a word frequency analysis on bios and positions to determine which words to create inference algorithm around to classify speakers’ gender and position level.
Wrote inference algorithm around position level. Categorized about 600/3200 manually.
Used a combination of tenderizer python library (https://pypi.python.org/pypi/genderizer/0.1.2.3) and gender inference to figure out gender. Looked up and filled in 103 manually. resulted in 92% of speakers categorized into gender programmatically
Text analytics on session description to determine topic of session (run word frequency, create boolean for most popular topics and categorize)