Jupyter at Bryn Mawr College |
|||
Public notebooks: /services/public/dblank / Experiments |
This is a Jupyter notebook using Python3 to show how to explore the data mentioned in:
Still No African-Americans Taking the AP Computer Science Exam in Nine States by Liana Heitin and based on the analysis by Barbara Ericson, Director of Computing Outreach and a senior research scientist at Georgia Tech.
This post was mentioned in Facebook's CSEdForum by Mark Guzdial; check out his Computing Education Blog.
You can replicate (and try variations of) this notebook here:
Or by downloading it (upper right-hand corner) and uploading it to your own Jupyter notebook server:
This notebook was created:
Please feel free to copy, explore, and share! If you make corrections or additions, send me email and I'll update this notebook.
In addition to using the Jupyter notebook, we assume that you have the following installed:
In addition, we will install these lesser-known libraries:
!pip install openpyxl --upgrade --user
!pip install metakernel --upgrade --user
And now we can:
import openpyxl ## used in opening Excel spreadsheets
import metakernel
metakernel.register_ipython_magics() ## used for its %download magic
As the article above points out, you can get the data from this site:
Here we show downloading one, after getting the link from above:
%download https://secure-media.collegeboard.org/digitalServices/pdf/research/2015/2015-Top-200-Colleges.xlsx
This is an interesting experiment showing the difficulty in getting and analyzing data. Note:
Be on the lookout for inconsistent data types. For example, in the following spreadsheet, the year format changes from being an integer to being a string inexplicably in 1997.
Also note:
I have no idea what the limits to using this data are. Use at your own risk.
First we download the data file:
%download http://media.collegeboard.com/digitalServices/misc/ap/pennsylvania-summary-2015.xlsx
Then we use the openpyxl library to read in the data, and show the worksheet names. (I don't know this library, but after creating the Python object, you can type a dot followed by a TAB to show the methods. Or read the sparse docs.)
workbook = openpyxl.load_workbook('pennsylvania-summary-2015.xlsx')
print(workbook.get_sheet_names())
Using trial and error, we write a function that takes a workbook, finds the right sheet, and returns the data. This may perhaps work on any state's summary spreadsheet.
def get_annual_participation(workbook):
worksheet = workbook.get_sheet_by_name("Annual Participation")
data = {}
row_count = 0
for row in worksheet.iter_rows():
year = row[0].value
if (type(year) == str and year.isdigit()) or type(year) == int:
year = str(row[0].value)
students = int(row[2].value)
data[year] = students
row_count += 1
data_keys = sorted(data.keys())
data_values = [data[k] for k in data_keys]
return data_keys, data_values
We call the function to get the data returned as [keys, values] where keys are the years (as strings) and values are the students taking the exam.
data = get_annual_participation(workbook)
For this demo, I'll use the standard library matplotlib
because it is very common. I don't particularly like the interface, but it is flexible if not easy to use. I also use the standard method (in the sciences, especially Biology and Physics) of aliasing matplotlib.pyplot
to be plt
.
%matplotlib inline
import matplotlib.pyplot as plt
We can now plot our labels/data with this call to plt.plot()
:
plt.plot(*data);
This is all students, by year, taking the an AP exam in Pennsylvania.
What about in just a specific field, like Computer science? The data doesn't appear to be year in an annual breakdown, but we can get it for this year.
The field/topic names are this in the spreadsheet:
['ART: \nHISTORY', 'BIOLOGY', 'CALCULUS AB', 'CALCULUS BC', 'CHEMISTRY', 'CHINESE LANG. \n& CULTURE', 'COMPUTER \nSCIENCE A', 'ECONOMICS: \nMACRO', 'ECONOMICS: \nMICRO', 'ENGLISH LANG-\n&COMPOSITION', 'ENGLISH LIT-\n&COMPOSITION', 'ENVIRONMENTAL \nSCIENCE', 'EUROPEAN \nHISTORY', 'FRENCH LANG. & CULTURE', 'GERMAN LANG. & CULTURE', 'GOVERNMENT &\nPOLITICS COMP.', 'GOVERNMENT &\nPOLITICS U.S.', 'HUMAN\nGEOGRAPHY', 'ITALIAN LANG. & CULTURE', 'JAPANESE LANG. & CULTURE', 'LATIN\n', 'MUSIC THEORY', 'PHYSICS 1', 'PHYSICS 2', 'PHYSICS C:\nELEC. & MAGNET.', 'PHYSICS C:\nMECHANICS', 'PSYCHOLOGY', 'SEMINAR', 'SPANISH LANG. & CULTURE', 'SPANISH LIT &\nCULTURE', 'STATISTICS', 'STUDIO ART: \n2-D DESIGN', 'STUDIO ART: \n3-D DESIGN', 'STUDIO ART: \nDRAWING', 'US HISTORY', 'WORLD HISTORY']
def get_ethnic_breakdown_by_topic(workbook, topic):
worksheet = workbook.get_sheet_by_name("All")
data = []
ethnicity = []
last_ethnic_group = None
row_count = 0
topics = None
for row in worksheet.iter_rows():
ethnic_group = row[1]
ap_score = row[2]
if row_count < 5:
pass
elif row_count == 5: ## get topics, remove newlines
topics = [item.value for item in row]
else:
if ethnic_group.value is not None and not ethnic_group.value.startswith(" "):
if ethnic_group.value.strip() != last_ethnic_group:
last_ethnic_group = ethnic_group.value.strip()
ethnicity.append(last_ethnic_group)
if ap_score.value == "T":
col = topics.index(topic)
data.append(row[col].value)
row_count += 1
## The sum of all but last should equal last:
assert sum(data[:-1]) == data[-1]
results = data[:-1]
return [v.strip().title() for v in ethnicity[:len(results)]], results
data = get_ethnic_breakdown_by_topic(workbook, "COMPUTER \nSCIENCE A")
data
Now, let's turn that into a bar chart:
import numpy as np
fig = plt.figure()
ax = fig.add_subplot(111)
width = .35
index = np.arange(len(data[1]))
bars = ax.bar(index, data[1]);
xTickMarks = data[0]
ax.set_xticks(index+width)
xtickNames = ax.set_xticklabels(xTickMarks)
plt.setp(xtickNames, rotation=90, fontsize=10);
That code is way too complicated for me to produce a bar chart! I'm looking for another solution for me and students.
Hope this was useful!