Our team members Jim Haines and Jonathan Licht have chosen to analyze the relationship between drug usage and crime rates in the United States. We considered 2 data sets, the first containing data on usage rates of alcohol, tobacco, cocaine, and marijuana by state for the years 2002 to 2018. The second data set contains data on crime rates from 1960 to 2019. The crimes included in the dataset are three property related crimes: burglary, larceny, and theft of motor vehicle and four violent crimes: assault, murder, rape, and robbery. We cleaned up our datasets by removing unnecessary columns like the raw total number of users of each drug. We only considered the usage rates of each drug to account for differeneces in populations between states.
We initially wanted to see if there is a relationship between drug use and crime rates, and if so, determine if drug usage rates be used to effectively predict future crime rates. In the end, we want to determine which set of drug use factors are best at predicting the crime rate for a specific crime.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_percentage_error
from more_itertools import powerset
drugs = pd.read_csv("drugs.csv")
crimes = pd.read_csv("state_crime.csv")
Below, find the tables we just imported.
drugs.head()
State | Year | Population.12-17 | Population.18-25 | Population.26+ | Totals.Alcohol.Use Disorder Past Year.12-17 | Totals.Alcohol.Use Disorder Past Year.18-25 | Totals.Alcohol.Use Disorder Past Year.26+ | Rates.Alcohol.Use Disorder Past Year.12-17 | Rates.Alcohol.Use Disorder Past Year.18-25 | ... | Totals.Marijuana.Used Past Year.26+ | Rates.Marijuana.Used Past Year.12-17 | Rates.Marijuana.Used Past Year.18-25 | Rates.Marijuana.Used Past Year.26+ | Totals.Tobacco.Use Past Month.12-17 | Totals.Tobacco.Use Past Month.18-25 | Totals.Tobacco.Use Past Month.26+ | Rates.Tobacco.Use Past Month.12-17 | Rates.Tobacco.Use Past Month.18-25 | Rates.Tobacco.Use Past Month.26+ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 2002 | 380805 | 499453 | 2812905 | 18 | 68 | 138 | 0.048336 | 0.136490 | ... | 141 | 0.127535 | 0.237880 | 0.050275 | 63 | 226 | 930 | 0.166578 | 0.451976 | 0.330659 |
1 | Alaska | 2002 | 69400 | 62791 | 368460 | 4 | 12 | 27 | 0.061479 | 0.187891 | ... | 46 | 0.188730 | 0.389026 | 0.124566 | 11 | 30 | 112 | 0.163918 | 0.484270 | 0.304220 |
2 | Arizona | 2002 | 485521 | 602265 | 3329482 | 36 | 117 | 258 | 0.073819 | 0.193626 | ... | 215 | 0.169646 | 0.275435 | 0.064640 | 73 | 240 | 1032 | 0.151071 | 0.397968 | 0.309969 |
3 | Arkansas | 2002 | 232986 | 302029 | 1687337 | 14 | 53 | 101 | 0.061457 | 0.175913 | ... | 104 | 0.157567 | 0.288856 | 0.061510 | 46 | 169 | 660 | 0.195714 | 0.558846 | 0.391210 |
4 | California | 2002 | 3140739 | 3919577 | 21392421 | 173 | 581 | 1298 | 0.055109 | 0.148312 | ... | 1670 | 0.141067 | 0.282887 | 0.078068 | 290 | 1377 | 4721 | 0.092235 | 0.351353 | 0.220699 |
5 rows × 53 columns
crimes.head()
State | Year | Data.Population | Data.Rates.Property.All | Data.Rates.Property.Burglary | Data.Rates.Property.Larceny | Data.Rates.Property.Motor | Data.Rates.Violent.All | Data.Rates.Violent.Assault | Data.Rates.Violent.Murder | ... | Data.Rates.Violent.Robbery | Data.Totals.Property.All | Data.Totals.Property.Burglary | Data.Totals.Property.Larceny | Data.Totals.Property.Motor | Data.Totals.Violent.All | Data.Totals.Violent.Assault | Data.Totals.Violent.Murder | Data.Totals.Violent.Rape | Data.Totals.Violent.Robbery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 1960 | 3266740 | 1035.4 | 355.9 | 592.1 | 87.3 | 186.6 | 138.1 | 12.4 | ... | 27.5 | 33823 | 11626 | 19344 | 2853 | 6097 | 4512 | 406 | 281 | 898 |
1 | Alabama | 1961 | 3302000 | 985.5 | 339.3 | 569.4 | 76.8 | 168.5 | 128.9 | 12.9 | ... | 19.1 | 32541 | 11205 | 18801 | 2535 | 5564 | 4255 | 427 | 252 | 630 |
2 | Alabama | 1962 | 3358000 | 1067.0 | 349.1 | 634.5 | 83.4 | 157.3 | 119.0 | 9.4 | ... | 22.5 | 35829 | 11722 | 21306 | 2801 | 5283 | 3995 | 316 | 218 | 754 |
3 | Alabama | 1963 | 3347000 | 1150.9 | 376.9 | 683.4 | 90.6 | 182.7 | 142.1 | 10.2 | ... | 24.7 | 38521 | 12614 | 22874 | 3033 | 6115 | 4755 | 340 | 192 | 828 |
4 | Alabama | 1964 | 3407000 | 1358.7 | 466.6 | 784.1 | 108.0 | 213.1 | 163.0 | 9.3 | ... | 29.1 | 46290 | 15898 | 26713 | 3679 | 7260 | 5555 | 316 | 397 | 992 |
5 rows × 21 columns
Below, find all the dtypes for each table. Pandas correctly interpretted all of them.
drugs.dtypes
State object Year int64 Population.12-17 int64 Population.18-25 int64 Population.26+ int64 Totals.Alcohol.Use Disorder Past Year.12-17 int64 Totals.Alcohol.Use Disorder Past Year.18-25 int64 Totals.Alcohol.Use Disorder Past Year.26+ int64 Rates.Alcohol.Use Disorder Past Year.12-17 float64 Rates.Alcohol.Use Disorder Past Year.18-25 float64 Rates.Alcohol.Use Disorder Past Year.26+ float64 Totals.Alcohol.Use Past Month.12-17 int64 Totals.Alcohol.Use Past Month.18-25 int64 Totals.Alcohol.Use Past Month.26+ int64 Rates.Alcohol.Use Past Month.12-17 float64 Rates.Alcohol.Use Past Month.18-25 float64 Rates.Alcohol.Use Past Month.26+ float64 Totals.Tobacco.Cigarette Past Month.12-17 int64 Totals.Tobacco.Cigarette Past Month.18-25 int64 Totals.Tobacco.Cigarette Past Month.26+ int64 Rates.Tobacco.Cigarette Past Month.12-17 float64 Rates.Tobacco.Cigarette Past Month.18-25 float64 Rates.Tobacco.Cigarette Past Month.26+ float64 Totals.Illicit Drugs.Cocaine Used Past Year.12-17 int64 Totals.Illicit Drugs.Cocaine Used Past Year.18-25 int64 Totals.Illicit Drugs.Cocaine Used Past Year.26+ int64 Rates.Illicit Drugs.Cocaine Used Past Year.12-17 float64 Rates.Illicit Drugs.Cocaine Used Past Year.18-25 float64 Rates.Illicit Drugs.Cocaine Used Past Year.26+ float64 Totals.Marijuana.New Users.12-17 int64 Totals.Marijuana.New Users.18-25 int64 Totals.Marijuana.New Users.26+ int64 Rates.Marijuana.New Users.12-17 float64 Rates.Marijuana.New Users.18-25 float64 Rates.Marijuana.New Users.26+ float64 Totals.Marijuana.Used Past Month.12-17 int64 Totals.Marijuana.Used Past Month.18-25 int64 Totals.Marijuana.Used Past Month.26+ int64 Rates.Marijuana.Used Past Month.12-17 float64 Rates.Marijuana.Used Past Month.18-25 float64 Rates.Marijuana.Used Past Month.26+ float64 Totals.Marijuana.Used Past Year.12-17 int64 Totals.Marijuana.Used Past Year.18-25 int64 Totals.Marijuana.Used Past Year.26+ int64 Rates.Marijuana.Used Past Year.12-17 float64 Rates.Marijuana.Used Past Year.18-25 float64 Rates.Marijuana.Used Past Year.26+ float64 Totals.Tobacco.Use Past Month.12-17 int64 Totals.Tobacco.Use Past Month.18-25 int64 Totals.Tobacco.Use Past Month.26+ int64 Rates.Tobacco.Use Past Month.12-17 float64 Rates.Tobacco.Use Past Month.18-25 float64 Rates.Tobacco.Use Past Month.26+ float64 dtype: object
crimes.dtypes
State object Year int64 Data.Population int64 Data.Rates.Property.All float64 Data.Rates.Property.Burglary float64 Data.Rates.Property.Larceny float64 Data.Rates.Property.Motor float64 Data.Rates.Violent.All float64 Data.Rates.Violent.Assault float64 Data.Rates.Violent.Murder float64 Data.Rates.Violent.Rape float64 Data.Rates.Violent.Robbery float64 Data.Totals.Property.All int64 Data.Totals.Property.Burglary int64 Data.Totals.Property.Larceny int64 Data.Totals.Property.Motor int64 Data.Totals.Violent.All int64 Data.Totals.Violent.Assault int64 Data.Totals.Violent.Murder int64 Data.Totals.Violent.Rape int64 Data.Totals.Violent.Robbery int64 dtype: object
def get_columns(dataframe, key, num):
to_be_dropped = []
for col in list(dataframe.columns)[num:]: # never drop certain columns
if key not in col:
to_be_dropped.append(col)
return dataframe.drop(to_be_dropped, axis=1)
def get_not_columns(dataframe, key, num):
to_be_dropped = []
for col in list(dataframe.columns)[num:]: # never drop certain columns
if key in col:
to_be_dropped.append(col)
return dataframe.drop(to_be_dropped, axis=1)
def graph_drug(df, cols, labels, ylim):
fig, axes = plt.subplots(1,3)
fig.set_figheight(5)
fig.set_figwidth(15)
def subgraph(col, ax, label, ylim):
df.set_index("Year").groupby("State")[col].plot.line(
ylim=(0, ylim), alpha=0.7, fontsize=14, ax=ax);
ax.set_xlabel("Year", fontsize=15);
ax.set_ylabel("Fraction", fontsize=15);
ax.set_title(label, fontsize=15);
ax.tick_params(labelrotation=45)
for i in range(3):
subgraph(cols[i], axes[i], labels[i], ylim)
return fig.tight_layout()
def capitalize(string):
return str.upper(string[0]) + string[1:]
def regress(df, labels, xs, y):
fig, axes = plt.subplots(1,3)
fig.set_figheight(5)
fig.set_figwidth(15)
def subplot(label, x, ax):
res = stats.linregress(df[x], df[y])
title = "Slope: " + str(np.round(res.slope, 3))
df.plot.scatter(x=x, y=y, ax=ax);
ax.plot(df[x], res.intercept + res.slope*df[x], color="red")
ax.set_xlabel(label, fontsize=12);
ax.set_ylabel(capitalize(y), fontsize=12);
ax.set_title(title, fontsize=15);
ax.set_xlim((-2,2))
ax.set_ylim((-2,2))
for i in range(3):
subplot(labels[i], xs[i], axes[i])
return fig.tight_layout()
def graph_coefs(df, xs, title):
ys = ["burglary", "larceny", "motor", "assault", "murder", "rape", "robbery"]
#blue-ish for property, red-ish for violent
colors = ["lightsteelblue", "deepskyblue", "royalblue", "darksalmon", "red", "darkorange", "firebrick"]
# we then get the coefficients for every combination
all_coefs = pd.DataFrame({"Crime": ys})
all_coefs.set_index("Crime", inplace=True)
for x in xs:
coefs = []
for y in ys:
res = stats.linregress(df[x], df[y])
coefs.append(res.slope)
all_coefs[x] = coefs
all_coefs = all_coefs.transpose()
all_coefs.reset_index(inplace=True)
# we plot those results here
fig, ax = plt.subplots()
for i in range(len(ys)):
all_coefs.plot.scatter(x="index", y=ys[i], ax=ax, marker='o', color=colors[i], s=100, alpha=0.65)
plt.title(title)
plt.legend(ys)
ax.set_ylabel("Correlation Coefficients");
fig.set_figheight(5)
ax.set_ylim((-1,1))
fig.set_figwidth(15)
fig.set_figheight(6)
def get_scaled_state(df, state):
s = df[df["State"] == state]
return pd.concat([s.iloc[:, :2], stats.zscore(s.iloc[:, 2:])], axis=1)
In order to make the data more usable, created a method to obtain a copy of a certain section of either DateFrame that is relevant to our current analysis. One example of this is excluding the actual population values from the data table because these values are not very useful. Our table conforms with tidy data guidelines and we did not have to change many aspects to achieve this. We decided that the State and Year are the best options by which to index the data in most situations. This ensures there are no repeated indexes and allows us to clearly see both changes by state and over time.
# get just percentages, not raw user numbers
drugs_pct = get_columns(drugs, "Rates", 2)
# rename columns
drugs_pct.rename(columns={
"Rates.Alcohol.Use Disorder Past Year.12-17": "alcoholism past year:12-17",
"Rates.Alcohol.Use Disorder Past Year.18-25": "alcoholism past year:18-25",
"Rates.Alcohol.Use Disorder Past Year.26+": "alcoholism past year:26+",
"Rates.Alcohol.Use Past Month.12-17": "alcohol used past month:12-17",
"Rates.Alcohol.Use Past Month.18-25": "alcohol used past month:18-25",
"Rates.Alcohol.Use Past Month.26+": "alcohol used past month:26+",
"Rates.Illicit Drugs.Cocaine Used Past Year.12-17": "cocaine used past year:12-17",
"Rates.Illicit Drugs.Cocaine Used Past Year.18-25": "cocaine used past year:18-25",
"Rates.Illicit Drugs.Cocaine Used Past Year.26+": "cocaine used past year:26+",
"Rates.Marijuana.Used Past Month.12-17": "marijuana used past month:12-17",
"Rates.Marijuana.Used Past Month.18-25": "marijuana used past month:18-25",
"Rates.Marijuana.Used Past Month.26+": "marijuana used past month:26+",
"Rates.Marijuana.Used Past Year.12-17": "marijuana used past year:12-17",
"Rates.Marijuana.Used Past Year.18-25": "marijuana used past year:18-25",
"Rates.Marijuana.Used Past Year.26+": "marijuana used past year:26+",
"Rates.Tobacco.Use Past Month.12-17": "tobacco used past month:12-17",
"Rates.Tobacco.Use Past Month.18-25": "tobacco used past month:18-25",
"Rates.Tobacco.Use Past Month.26+": "tobacco used past month:26+",
}, inplace=True)
# get alochol related usage rates
alcohol_pct = get_columns(drugs_pct, "alcohol", 2)
# get tobacco related usage rates
tobacco_pct = get_not_columns(get_columns(drugs_pct, "tobacco", 2), "Cigarette", 2)
# get cocaine related usage rates
cocaine_pct = get_columns(drugs_pct, "cocaine", 2)
# get marijuana related usage rates
marijuana_pct = get_not_columns(get_columns(drugs_pct, "marijuana", 2), "New Users", 2)
# rename columns
crimes.rename(columns={
"Data.Rates.Property.Burglary": "burglary",
"Data.Rates.Property.Larceny": "larceny",
"Data.Rates.Property.Motor": "motor",
"Data.Rates.Violent.Assault": "assault",
"Data.Rates.Violent.Murder": "murder",
"Data.Rates.Violent.Rape": "rape",
"Data.Rates.Violent.Robbery": "robbery"
},inplace=True)
crimes_pct = get_not_columns(crimes, "Data", 2)
In this section we analyze alcohol use and alcoholism in the United States by age group, over time, and by state.
In the cell below, we establish the range of alcohol abuse usage rates. This summary statistic offers a preliminary insight into what states may have high or low usage rates for other drugs. It also poses questions to answer about the states with the most extreme values. For example, what aspect of North Dakota in 2003 made the alcohol usage rate so high, and has it changed since then? The following calculations were done with the age rage of 18-25.
print('State with highest alcohol use in the past year rate from 2002 to 2018:',
str(alcohol_pct.set_index(["State", "Year"])["alcoholism past year:18-25"].idxmax()) +
';', alcohol_pct.set_index(["State", "Year"])["alcoholism past year:18-25"].max())
print('State with lowest alcohol use in the past year rate from 2002 to 2018:',
str(alcohol_pct.set_index(["State", "Year"])["alcoholism past year:18-25"].idxmin()) +
';', alcohol_pct.set_index(["State", "Year"])["alcoholism past year:18-25"].min())
State with highest alcohol use in the past year rate from 2002 to 2018: ('North Dakota', 2003); 0.272941 State with lowest alcohol use in the past year rate from 2002 to 2018: ('Florida', 2018); 0.071218
In the cell below we find the mean of alcohol use in the past year usage rates accross all states from 2002 to 2018 to be about 15.12%. This summary statistic is a measure of centrality can can be used as a reference point to compare states to. This allows us to make realizations such as whether or not a state may be considered a "heavy drinking state."
alcohol_pct['alcoholism past year:18-25'].mean()
0.15118001384083027
cols = ["alcoholism past year:12-17", "alcoholism past year:18-25", "alcoholism past year:26+"]
labels = ["Alcoholism: Ages 12 - 17", "Alcoholism: Ages 18 - 25", "Alcoholism: Ages 26+"]
graph_drug(alcohol_pct, cols, labels, 0.3)
The subplot above shows alcoholism in the past year rates for each state from 2002-2018, where each graph is a different age group and each line is a state. As we can see, alcoholism varies greatly by age range. For ages 12 to 17, the alcohol disorder rate has dropped in all states. In the 18 to 25 age range, the rate varies greatly by state, however, the rate is still decreasing over time. Finally, for the 26+ age range, the rate has stayed fairly consistent over this time period.
In this section we analyze tobacco use in the United States by age group, over time, and by state.
cols = ["tobacco used past month:12-17", "tobacco used past month:18-25", "tobacco used past month:26+"]
labels = ["Tobacco Use in the Past Month: Ages 12-17", "Tobacco Use in the Past Month: Ages 18-25", "Tobacco Use in the Past Month: Ages 26+"]
graph_drug(tobacco_pct, cols, labels, 0.65)
In the sublot above, we provide a similar overview of tobacco use in the past month across age ranges for all states. We see a significant nationwide decrease in tobacco use among the 12 to 17 age range. One possible factor of this decrease is the introduction of artificial vape devices that offer an alternative method of nicotine intake, the addictive ingredient in tobacco. The 18 to 25 age range has also seen a decrease across most states, but have the highest rates overall among the three age groups. This is likely due to the historic prevelance of nicotine among teenagers, which has also been affected by the introduction of vape devices. Finally, the 26+ age range has maintained steady rates overtime, but have a greater variance by state.
In this section we analyze cocaine use in the United States by age group, over time, and by state.
cols = ["cocaine used past year:12-17", "cocaine used past year:18-25", "cocaine used past year:26+"]
labels = ["Cocaine Use in the Past Year: Ages 12-17", "Cocaine Use in the Past Year: Ages 18-25", "Cocaine Use in the Past Year: Ages 26+"]
graph_drug(cocaine_pct, cols, labels, 0.13)
Cocaine has the lowest usage rates of all four drugs in our analysis. The 12 to 17 range has decreased its cocaine use in the past year since 2002, although usage rates were never very high in this age group. The 18 to 25 range has by far the highest rates and greatest range of usage rate between states, and vary greatly. It is interesting to note an uptick in usage for this group around the years 2012-2014. The 26+ range also has lower rates than the 18 to 25 range. The oldest group has one outlier with much higher usage rates, District of Columbia, which peaked in 2006 with a usage rate in the past year of 0.0525 (please note that the District of Columbia is treated as a state in this dataset). The general trend has remaied fairly constant over time.
In this section we analyze marijuana use by age group, over time, and by state.
cols = ["marijuana used past year:12-17", "marijuana used past year:18-25", "marijuana used past year:26+"]
labels = ["Marijuana Use in the Past Year: Ages 12-17", "Marijuana Use in the Past Year: Ages 18-25", "Marijuana Use in the Past Year: Ages 26+"]
graph_drug(marijuana_pct, cols, labels, 0.55)
Marijuana is the only drug we analyzed with an upward trend in multiple age ranges. The 12 to 17 age range has remaied similar over time. For ranges 18 to 25, there is a slight upward trend overall with a high degree of variation between each state. For ages 26+ there is a clear upward trend in all states, although some are increasing at different rates. These increases are likely due to the recent focus on legalization of medical and, in some cases, recreational marijuana use.
Below, we include the crimes dataset in our analysis which will help us draw meaningful conclusions from both datasets. This dataset has the same unit of observation as our intial dataset: a state in a certain year. It contains a total population column, then it has data on two different types of crime: property related crimes and violent crimes. It has a column that is an aggregation of each of these two categories, basically a total property crimes rate and a total violent crimes rate, as well as columns for each type of crime within each category. Each entry is the rate of how often that crime was committed in that state in that year.
#this is the df that just has the rates and within the relevant years
crimes_pct = crimes_pct[(crimes_pct.Year >= 2002) & (crimes_pct.Year<=2018)].iloc[:, :12]
crimes_pct.head()
State | Year | burglary | larceny | motor | assault | murder | rape | robbery | |
---|---|---|---|---|---|---|---|---|---|
42 | Alabama | 2002 | 950.6 | 2767.0 | 310.1 | 268.0 | 6.8 | 37.2 | 133.1 |
43 | Alabama | 2003 | 960.2 | 2754.1 | 332.1 | 251.7 | 6.6 | 36.8 | 134.1 |
44 | Alabama | 2004 | 987.0 | 2732.4 | 309.9 | 249.4 | 5.6 | 38.5 | 133.5 |
45 | Alabama | 2005 | 955.8 | 2656.0 | 289.0 | 248.3 | 8.2 | 34.4 | 141.7 |
46 | Alabama | 2006 | 973.7 | 2640.8 | 326.5 | 227.5 | 8.3 | 35.8 | 153.6 |
fig, axes = plt.subplots(3,3)
fig.set_figwidth(14)
fig.set_figheight(14)
crimes_list = ["burglary", "larceny", "motor", "assault", "murder", "rape", "robbery"]
starter = crimes_pct.set_index("Year").groupby("State")
counter = 0
for i in range(3):
for j in range(3):
if counter < 7:
starter[crimes_list[counter]].plot.line(ylabel="Offenses per 100,000 People", ax = axes[i][j])
axes[i][j].set_title(capitalize(crimes_list[counter]))
counter += 1
fig.delaxes(axes[2][1])
fig.delaxes(axes[2][2])
fig.tight_layout()
In the graph above we have displayed the rates of each crime for each state from 2002 to 2018. This graph is meant to provide an overview, but we can observe an overall decrease across most crimes during this time, with the exception of assault, murder, and rape. The rates of the three crimes remain fairly constant, but some states have actually seen an increase rape. This increase could be due to many factors, but in recent years, coming forward to report rape has seemingly become a much more viable option for victims. An increase in reporting may account for some of the increase we can see for this crime.
Some of the graphs above present obvious outliers. First of all, Washington D.C. has had by far the largest increase in larceny since 2006, when most other states have had declining rates since then. Additionally, Washington D.C. had a murder rate above the rest of the states during this time period. It has seen an overall decrease, but still remain far above the average of the other states. The outlier in the rape category that has actually increased is Alaska. They are one of the only states with an increase in rape rates, and their rates were already above average. Finally, the extremely obvious outlier in the robbery category is again, Washington D.C. Their robbery rate is decreasing but remains in extreme excess of the other states.
Washington D.C. finds itself the outlier in many categories. This could be because its population density is so much larger than real states. The rate at which these crimes take place are usually higher in a city environment and almost all of Washington D.C. is urban. Thus it would make sense that the rates there are higher.
Now, we will create a DataFrame consisting of alcohol usage rates in the past month merged with relevant crime data. We will use this new DataFrame to observe correlations between alcohol usage rates and crime rates at both a country wide and state level.
features = list(alcohol_pct.iloc[:, 5:].columns
.append(cocaine_pct.iloc[:, 1:].columns)
.append(tobacco_pct.iloc[:, 2:].columns)
.append(marijuana_pct.iloc[:, 5:].columns))
merged = drugs_pct.merge(crimes_pct, on = ["State", "Year"], how="inner")
all_combos = list(powerset(features))[7099:] #less than 9 features is never optimal
In the cell below we use a K Nearest Neighbors model to determine which set of drug use features best predict a certain crime rate at a country wide level. To do this we created a powerset of drug use features and and ran our model for each set of at least 9 features against each type of crime. We decided 9 was the best number to choose after experimenting with different combinations. In order to have the cell run in a reasonable amount of time, we elimated the combinations of features that we knew would never be selected.
Additionally, we choose the value of K to be 2. Once again, we experimented with different combinations to find the best output and found that 2 was always the best. To avoid unnecessarily looping over more values than required, we set K to 2.
Using the mean absolute percentage error, we found the combination of features that scored the best for each crime.
crimes = ["burglary", "larceny", "motor", "assault", "murder", "rape", "robbery"]
for crime in crimes:
final_mape = np.inf
for features in all_combos:
model = KNeighborsRegressor(n_neighbors=2) # the best value is always 2
x_train = merged[list(features)]
y_train = merged[crime]
scaler = StandardScaler()
scaler.fit(x_train)
x_train_sc = scaler.transform(x_train)
model.fit(x_train_sc, y_train)
y_pred = model.predict(x_train_sc)
mape = mean_absolute_percentage_error(y_train, y_pred)
if mape < final_mape:
final_mape = mape
final_features = list(features)
print("crime:", crime)
print("mape:", np.round(final_mape*100, 2), "%")
print("# of features:", len(final_features))
print("features:", final_features)
print("---------------------------------------------------------------------------------")
crime: burglary mape: 5.47 % # of features: 12 features: ['alcohol used past month:12-17', 'alcohol used past month:18-25', 'alcohol used past month:26+', 'Year', 'cocaine used past year:12-17', 'cocaine used past year:18-25', 'cocaine used past year:26+', 'tobacco used past month:12-17', 'tobacco used past month:18-25', 'tobacco used past month:26+', 'marijuana used past year:18-25', 'marijuana used past year:26+'] --------------------------------------------------------------------------------- crime: larceny mape: 3.7 % # of features: 11 features: ['alcohol used past month:12-17', 'alcohol used past month:18-25', 'alcohol used past month:26+', 'Year', 'cocaine used past year:12-17', 'cocaine used past year:26+', 'tobacco used past month:12-17', 'tobacco used past month:18-25', 'tobacco used past month:26+', 'marijuana used past year:18-25', 'marijuana used past year:26+'] --------------------------------------------------------------------------------- crime: motor mape: 7.75 % # of features: 10 features: ['alcohol used past month:12-17', 'alcohol used past month:18-25', 'alcohol used past month:26+', 'Year', 'cocaine used past year:12-17', 'cocaine used past year:26+', 'tobacco used past month:18-25', 'tobacco used past month:26+', 'marijuana used past year:18-25', 'marijuana used past year:26+'] --------------------------------------------------------------------------------- crime: assault mape: 7.49 % # of features: 11 features: ['alcohol used past month:12-17', 'alcohol used past month:18-25', 'alcohol used past month:26+', 'Year', 'cocaine used past year:12-17', 'cocaine used past year:26+', 'tobacco used past month:18-25', 'tobacco used past month:26+', 'marijuana used past year:12-17', 'marijuana used past year:18-25', 'marijuana used past year:26+'] --------------------------------------------------------------------------------- crime: murder mape: 9.64 % # of features: 11 features: ['alcohol used past month:12-17', 'alcohol used past month:18-25', 'alcohol used past month:26+', 'Year', 'cocaine used past year:12-17', 'cocaine used past year:26+', 'tobacco used past month:18-25', 'tobacco used past month:26+', 'marijuana used past year:12-17', 'marijuana used past year:18-25', 'marijuana used past year:26+'] --------------------------------------------------------------------------------- crime: rape mape: 7.08 % # of features: 9 features: ['alcohol used past month:12-17', 'alcohol used past month:26+', 'Year', 'cocaine used past year:18-25', 'cocaine used past year:26+', 'tobacco used past month:12-17', 'tobacco used past month:26+', 'marijuana used past year:18-25', 'marijuana used past year:26+'] --------------------------------------------------------------------------------- crime: robbery mape: 8.31 % # of features: 11 features: ['alcohol used past month:12-17', 'alcohol used past month:18-25', 'alcohol used past month:26+', 'Year', 'cocaine used past year:12-17', 'cocaine used past year:26+', 'tobacco used past month:12-17', 'tobacco used past month:18-25', 'tobacco used past month:26+', 'marijuana used past year:18-25', 'marijuana used past year:26+'] ---------------------------------------------------------------------------------
Burglary was found to be best predicted by 12 features, the only feature left out is majijuana use in the past year ages 12-17.
Larceny, assault, murder, and robbery were all found to be best predicted by 11 features, although they differ in features included. All of them included alcohol use in the past month for all three age groups, cocaine use in the past year for the 12 to 17 age group and the 26+ age group, and the year in the list of features. They have different mixes of tobacco features and marijuana features.
Theft of a motor vehicle was found to have 10 features to get the best prediction. It had all alcohol categories, and the same 2 cocaine catagories as the 11 feature crimes above. It also had the two older age groups for both tobacco and marijuana.
Rape had the fewest factors at only 9. It is the only crime to not consider all 3 age groups for alcohol. It still has cocaine and marijuana use for the older 2 age groups, and the youngest and oldest age group for tobacco.
It is significant that all 3 age groups of the alcohol use feature are present for every crime except rape which only has 2 age groups. In addition, there are at least 2 age groups of cocaine features in every model. These 2 drugs are what we predicted would have the more influence on crime rates.
The fact that rape only has 2 catagories of alcohol use is also an interesting finding. Rape is arguably an outlier in our set of crimes in that it is sexual in nature. In general, the uniqueness of this crime relative to the other crimes we considered makes it harder to predict. We believe that the motivators of a sexual crime are not necessarily similar to those for other non-sexual crimes, which may explain why so few features are used for that crime.
Now that we have looked at the country as a whole, we will dive deeper into our home state: Lousiana. Below we will point out some more specific correlations between certain crime and certain drugs in Lousiana. We will also look at how these correlations differ between age groups.
merged = drugs_pct.merge(crimes_pct, on=["State", "Year"], how="inner")
la_sc = get_scaled_state(merged, "Louisiana")
labels = ["Alcohol Use in Past Month: Ages 12-17", "Alcohol Use in Past Month: Ages 18-25", "Alcohol Use in Past Month: Ages 26+"]
xs = ["alcohol used past month:12-17", "alcohol used past month:18-25", "alcohol used past month:26+"]
y = "burglary"
regress(la_sc, labels, xs, y)
In the subplot above, we have used our merged DataFrame to get correlation values between Alcohol usage rate in the past month and burglary rates, across our 3 age ranges in Lousiana. We can see that the 12 to 17 age range has a correlation coefficient of 0.767, while the 18 to 25 has the largest positive correlation coefficient of the three age ranges: 0.882. This value represents a strong positive relationship between alcohol use and burglaries among Lousiana residents in this age group. The 26+ age range has a negative correlation coefficient, indicating an inverse relationship in Lousiana.
labels = ["Cocaine Use in Past Year: Ages 12-17", "Cocaine Use in Past Year: Ages 18-25", "Cocaine Use in Past Year: Ages 26+"]
xs = ["cocaine used past year:12-17", "cocaine used past year:18-25", "cocaine used past year:26+"]
y = "larceny"
regress(la_sc, labels, xs, y)
The 12 to 17 age range has the highest correlation coefficient of 0.892, indicating a strong positive relationship of coacine use in the past year and larceny among 12 to 17 year olds in Lousiana. It is important to note that the cocaine usage rates for this age group are far lower than the other 2 ranges. This indicates that while there may be a positive relationship, there are less cocaine users from which to draw this conclusion. Larceny is the lowest level of a theft related crime and indicates simple theft as opposed to robbery or burglary, so it makes sense that it is correlated with a younger age group. The next 2 graphs have similar coefficients, with the 26+ group being slightly higher. The reduction of coefficient in the older two age groups means that people in these age groups are not necessarily more likely to commit larceny if they have used cocaine in the past year.
After observing correlations between different drug usage rates and different crimes, we were able to go a step further and look at changes in coefficients (betas) between different age groups. Here we calculate the betas for all crimes and and for different drugs.
An important thing to note is that because the data is scaled and each coefficient is calculated in a single variable regression, the coefficients are the same number as the correlation coefficients. This is not always the case, but here it does hold true.
xs = ["alcohol used past month:12-17", "alcohol used past month:18-25", "alcohol used past month:26+"]
graph_coefs(la_sc, xs, "Alcohol Use in the Past Month and All Crimes in Lousiana")
In the graph above we have three age groups on our x axis and coefficient values between alcohol use in the past month and 7 types of crime, listed above. We can see an obvious trend in all but one crime; alcohol use in older age groups is less correlated with commiting crimes. The one crime that has not seen any real decrease is rape. This is arguably the most serious and vulgar crime of those we have analyzed, which may explain why it does not behave like the others.
xs = ['tobacco used past month:12-17', 'tobacco used past month:18-25', 'tobacco used past month:26+']
graph_coefs(la_sc, xs, "Tobacco Use in the Past Year and All Crimes in Lousiana")
This graph is very similar to the last graph. We find that the coefficients are generally positive in younger age groups and then decrease for the oldest age group. Once again, rape behaves very differently from the other crimes especially in the two younger age groups.
xs = ['cocaine used past year:12-17', 'cocaine used past year:18-25', 'cocaine used past year:26+']
graph_coefs(la_sc, xs, "Cocaine Use in the Past Year and All Crimes in Lousiana")
This graph is slightly different than the previous two graphs. Instead of the betas decreasing in the older age groups we see that the coefficients remain high. However, it is similar to the previous graphs because rape is once again an outlier in 2 of the 3 age groups.
xs = ['marijuana used past year:12-17', 'marijuana used past year:18-25', 'marijuana used past year:26+']
graph_coefs(la_sc, xs, "Marijuana Use in the Past Year and All Crimes in Lousiana")
This graph is suprisingly different from the previous 3 graphs. In the 2 older age groups, almost all crimes are actually negatively correlated with marijuana use. Rape, again, is the obvious outlier in this graph.
We recognize our data set is not conducive for a predictive model because the unit of observation is a state in a particular year. Each of the numbers in our table is already a huge aggregation of many statistics. It would be more helpful to predict the outcome of an individual rather than the average crime of a state. Unforturnately, we don't have the data to make individual predictions.
Our analysis of the country, while interesting, might not say that much about any one specific drug. Like mentioned above, these numbers are already an average of a lot of numbers. Despite this, we think the determining which set of features is the best predictor may give some insight into the relationship between specific drugs and specific crimes.
Furthermore many states my cancel each other out making the model far from optimal. This is why we decided to examine one state, Louisiana, to see what these relationships looked like on a smaller scale. This view of the problem revealed some stronger relationships between drugs and crime. We can't say certain drugs and certain crimes are always correlated. It depends highly on the age group and state.
Finally, it is important to note that we cannot infer any causation from this analysis. The data is not following the same people through different age groups or following people that do drugs and commit crimes. The data is simply a measurement of rates in different states.