Dec 022013
 

In my CS-135 Programming for Non-CS Majors class, one of the primary objectives for the students is to learn to work with collections of data in files. I’m always happy when this requires manipulations that can’t be performed with other tools that the students are comfortable with — thus motivating the need to learn to code.

This afternoon in class, students were working in groups on their final projects. Two groups came up against some problems in getting their data into a format that could be easily processed in Python. Both cases involved data that was only available in the form of PDF files.

The old standby of selecting text and pasting it into Excel did not provide nice columns of information. Our second attempt was to export the data as text.

Case 1

In the first case, we got text data that looked like:

Biology 306 N/A 306
Biotechnology 80 26 106
Business Administration 748 N/A 748
Chemistry 141 N/A 141
Communication 245 N/A 245
Communication Sciences & Disorders 218 N/A 218
Community Health 158 N/A 158
Computer Science 116 N/A 116
Criminal Justice 445 N/A 445
Early Childhood Education 80 19 99
Early Childhood Education, Non-Licensure 26 N/A 26

This looked promising – we’ve dealt with one-record-per-line-space-delimited data files in class before. You just need to read a line at a time, and use Python’s string split method to turn it into a list… But — wait! — the first item  is a variable number of words separated by spaces. That will make for some messy lists — they’ll all be of different lengths:

['Communication', '245', 'N/A', '245']
['Communication', 'Sciences', '&', 'Disorders', '218', 'N/A', '218']
['Community', 'Health', '158', 'N/A', '158']

Here’s the solution: Python lists can be indexed from the end using negative indices. So, we can definitely get at the last three values (numbers of majors — undergraduate, graduate, and total). Assuming a list in a variable department, they are at positions department[-3], department[-2], and department[-1] respectively.

But, what about the department name, which may be in multiple list items? Well, we can get it as a sub-list, using list slicing: department[:-3] yields:

['Communication']
['Communication', 'Sciences', '&', 'Disorders']
['Community', 'Health']

All that’s left is to concatenate them together into a single string:

name = ''
for item in department[:-3]:
    name = name + item + ' '

Full code is here: https://gist.github.com/kwurst/7761789

Case 2

In the second case, we got text data that looked like:

Boston    00350000    4368    65.9    15.2    0.8    2.1    15.9    0.1
Boston Collegiate Charter (District)    04490000    34    67.6    32.4    0.0    0.0    0.0    0.0
Boston Day and Evening Academy Charter (District)    04240000    162    13.0    55.6    0.0    6.8    24.7    0.0
Boston Green Academy
Horace Mann Charter School
(District)    04110000    72    70.8    26.4    0.0    1.4    1.4    0.0
Boston Preparatory Charter Public (District)    04160000    27    74.1    11.1    0.0    3.7    11.1    0.0
Bourne    00360000    145    90.3    4.8    0.0    2.1    2.8    0.0
Braintree    00400000    369    95.1    3.3    0.3    0.3    1.1    0.0

Which could be fixed the same way, except for the fact that some of the district names ended up broken across multiple lines. (I’m not sure why this happened, and it turned out that exporting the data in a different way fixed the problem. But I’d already found a solution, so I’m going to document it here…)

Working from the assumption that the district org code always starts with a zero (I know — not a good assumption, but it works in this case…), the solution involves checking for lines with no zero in them and concatenating them together. Then you can treat the lines as in Case 1.

for line in f:
    while line.find('0') == -1:
        line = line + f.readline()

Full code is here: https://gist.github.com/kwurst/7761789