This week marks the completion of Y Combinator for Bayes Impact! As our Fall 2014 Fellowship ramps up (250+ applicants!), we wanted to do a blog post illustrating how exactly we can use data to understand public services better. We have been exploring Seattle’s Police Report data and will walk you through step by step the questions we asked, the data we analyzed, the conclusions we made and the new questions we have to follow up. The Seattle Police data can be acquired here – you can download the entire dataset by clicking on export. As is with all of our blog posts, we made the entire code base available on github.
Understanding the Data
This data source contains information about suspicious and criminal events, specifically where they took place and when they happened. The above map shows how the city of Seattle is broken up into five precincts and smaller police beats. For the uninitiated, a beat is the territory and time that a police officer patrols. We know the events recorded take place within the city of Seattle, but we don’t know the time range they took place in.
Tip 1: before analyzing the data we should understand when the events happened, and if the system that records the data, also known as the data generating mechanism, is biased to a particular period of time.
Our natural intuition may want to ask does criminal activity vary according to the day of the week? If the data recorded here did not include certain weekends then it is possible that we mistakenly conclude that criminal activity on Saturdays and Sundays is low.
First we will just check how many Mondays, Tuesdays, etc have data recorded in the dataset:
Great! Looks like there is no bias towards any particular day of week. Knowing this, let’s go ahead and just get a plot of what types of crimes are recorded in this data:
There are lots of different types of crimes here, some that are very similar to each other and some that are very different. Dealing with a categorical variable that takes on many different values complicates analysis, specifically it can create a problem of high dimensionality. High dimensionality refers to a problem where there are too many possible combinations of data that need to be modeled.
Tip 2: We can simplify large categorical variables by binning them into a few major categories
We solve this problem by defining a simpler category for crime type which can be “minor”, “serious” or “violent”. Let’s ignore things that we would classify as minor and dig into our question of whether or not different days of week influence criminal activity:
Here are some odd results: there is a large amount of variation across the weekdays for “serious crimes” and Tuesdays seem to have the largest amount of them. I would have expected these to fall mostly on a Friday or Saturday, as they do for “violent” crimes. Let’s probe further, perhaps some police beats have high amounts of serious crime on Tuesday and others have low amounts, but the average is high for Tuesday.
So What Happens on Tuesdays?
With all the variations between type of criminal activity, day of week and police beat, let’s formally quantify the relationships using a model. Out of the days that did have a crime, let’s quantify the probabilities that the crime was either minor, serious, or violent.
Tip 3: We can use a multinomial glm, a specific type of regression, to predict the probability of a categorical value, in this case the chance that the crime type is either minor, serious, or violent.
This equation says the odds of the crime type can be modeled by the day of week, police beat, and an interaction term for day of week * police beat; the first term says the different days of week influence the odds of a particular crime type and this time effect is the same for all police beats, the second term says the police beat increases or decreases the overall odds of a specific crime type but this police beat effect does not change over time, and lastly the interaction term says the effect from the day of week variable can vary according to the police beat. This final term supplements the first two because its effect changes according to time and location, whereas the first term only changed over time and the second term only changed over location.
An abbreviated output of the model is summarized in the chart below, where each horizontal panel is a different police beat and shows the changing probabilities for different days of week and type of crime:
From this model we actually see that the general Tuesday effect on P(Crime type = Serious) is negligible. However, there are interactions between beats and Tuesday that are positive and significant; this implies that there is a Tuesday effect that applies to some beats but not to others. For example in police beat M2 the interaction term pushes its Tuesday affect up by a large amount, β=0.06. It’s Wednesday effect is even stronger and dominates the Tuesday effect; this phenomenon also happens in K1, K2, and W1 despite us seeing earlier that serious crimes happened most often on Tuesday. Note that in N3 it is true that serious crimes have the highest probability on Tuesday. One final interesting thing to note is how large the gap is between P(Crime Type = Serious) and P(Crime Type = Minor) in beat W1, but how narrow the gap is for a downtown beat like M3.
Which Beats Are Poorly Staffed?
We’ve explored the time component to this data, let’s now explore the geographic component. First we want to see where are the hotspots for criminal/suspicious activity:
This is not surprising that the downtown area has a strong density, but in addition the University District and Ballard also have high density as well as the road Rainier Ave S. University District has a very different type of population then the downtown area, what types of crimes are being addressed there?
In the University District, most calls are for Suspicious Circumstances. In the downtown area most police activity is for traffic, but that should be obvious. When it comes to more serious crimes or even violent crimes, are police in the downtown area bogged down because of traffic problems they need to deal with? Crimes can be time sensitive, such as robberies where the probability of recovering stolen goods will decrease the longer the problem is unresolved. We want police beats to be staffed so that they can handle these time sensitive issues efficiently, for example we hope that robberies in the downtown area can be solved as quickly as anywhere else. To study this let us look at the time it takes to clear injuries and threats:
For injuries, the downtown area does comparably well. On average the downtown beats (M, K, D, and E) clear the issue in under 2 hours, whereas the University District and Queen Anne take 2.3 hours. When it comes to threats, we see the opposite, where downtown beats take 2.2 hours while University District and Queen Anne take less than 2 hours.
To get a broader sense of whether or not resources are well balanced, we can look at summary statistics by crime and location for other beats. We want to see how efficient beats are at handling specific problems, so we’ll calculate the mean time until cleared for each crime type and beat. For a specific crime type, like Homicides, we would hope that the mean time until cleared is roughly the same across all the beats. To get a measurement of variation across the beats we’ll use the variance of the means:
Looking at this measurement of variation we’ll see that homicides can vary by plus or minus 6 hours (two standard deviations) from its mean time until cleared (this particular variation is expected because homicides are so rare). Prostitution, narcotics complaints, prowler, weapon calls, injury, assaults and robberies all vary by at least plus or minus 1 hour, a large variation across beats. Property damage varies only by 30 minutes, so we can reasonably claim that the police beats are equally efficient in handling those issues. The variation across the beats could be caused by two things:
- Certain beats will naturally be more challenging or less challenging, for example because of the city layout or population density, causing wide variation
- The beats have an uneven distribution of police resources, for example not enough officers trained to handle certain types of crimes, like narcotics.
Tip 4: To abstract away naturally occurring variance mentioned in (1), we can study within variance, as opposed to the between variance shown above.
Acknowledging (1) from above, we should feel comfortable that police beats can have different means, say 2 hours and 4 hours, and still be perfectly well staffed for clearing a crime, say assaults. Add more detail and suppose that the clearance times are 2 plus or minus 1 hour (standard deviation 30 minutes), and 4 plus or minus 2 hours (standard deviation 1 hour). One thing that we may want to look for is consistency in results within each beat, and not consistency across the beats. The clearance time in beat B has much larger variance, which implies inconsistent results, but again it could be reasonable for the variance to be that high if B is naturally more challenging to work in. We can use the coefficient of variation (CV), which is the standard deviation normalized by the mean, to argue that the two beats are comparable. In this scenario the coefficient of variation would be equal between the beats at 0.25.
Using this measurement, we hope that the CV for each crime type and beat is low and that the CV is roughly equal across the beats for any particular crime type:
There are a few outstanding things about the combination of these two variance plots. Earlier we said that robberies have a large variance across police beats, but normalizing for the mean within the beats the coefficient of variation looks similar across the board. This implies the variation in robberies across the beats is not unreasonable. The only beat that stands out for robberies is district Q, which has a mean clearance time of 1.5 hours but a standard deviation of 1.7 hours! More police resources could be installed here to make the resolution time more consistent.
Another panel that stands out is the Narcotics panel; again in district Q we see that the mean clearance time is 0.88 hours but the standard deviation is 1.24 hours. In district B the problems are handled more consistently with a mean of 1.9 hours and standard deviation of 1.37 hours. The difference in CV, which takes into account difficulties across the difference in beats, implies an imbalance in police resources that could be improved through better resource allocation. This insight does not seem farfetched as narcotics problems require officers with specific training and there could easily be an undersupply of these types of officers, causing resource scheduling to be very difficult.
Police departments serve a variety of neighborhoods and have the challenge of knowing how to distribute resources, and even more difficult how to build a system that can keep up with changing demand. Using data and models we can come up with computer systems that help scheduling and staffing. In Seattle there are different types of distributions where some areas are dominated by very minor issues and areas like downtown are plagued with more serious and violent crimes. These different distributions have an effect on the amount of time it takes to clear an issue; if the city can staff according to these distributions we may see a decrease in the variation among police beats.
Furthermore, if we can quantify the variation across physical locations and time, the Seattle Police Department can optimize the deployment of officers. We saw through the multinomial glm that the distribution of crime type in the downtown police beats is different than in other areas – there is a smaller proportion of minor crimes and larger proportion of serious crimes, meaning the same amount of people in downtown will naturally have a higher demand for law enforcement resources than people living outside of downtown. One opportunity might be to reassign officers in area W1, which has a larger proportion of minor crimes, to M3, which is located nearby but is weighted down by more serious crimes. We could also consider reallocating the resources to M3 on Monday and Tuesday, when serious crimes are most abundant, then restore the resources to W1 on Wednesday when violent crimes are most abundant there. With more detailed information about events happening in Seattle, we can even develop staffing schedules for days with sports games (see how assaults spiked on Superbowl day!)
Enjoyed the read and looking for more? If you’re a data scientist looking to do more with your skills or a social good organization looking to do more with your data, reach out to us. Bayes Impact is bringing data science solutions to social problems — let’s solve the world’s toughest problems together.
Follow the discussion on Hacker News!