The Relationship Between Drug Use and Crime Rates¶

Jonathan Licht and Jim Haines¶

https://jlicht27.github.io/¶

Overview¶

Our team members Jim Haines and Jonathan Licht have chosen to analyze the relationship between drug usage and crime rates in the United States. We considered 2 data sets, the first containing data on usage rates of alcohol, tobacco, cocaine, and marijuana by state for the years 2002 to 2018. The second data set contains data on crime rates from 1960 to 2019. The crimes included in the dataset are three property related crimes: burglary, larceny, and theft of motor vehicle and four violent crimes: assault, murder, rape, and robbery. We cleaned up our datasets by removing unnecessary columns like the raw total number of users of each drug. We only considered the usage rates of each drug to account for differeneces in populations between states.

We initially wanted to see if there is a relationship between drug use and crime rates, and if so, determine if drug usage rates be used to effectively predict future crime rates. In the end, we want to determine which set of drug use factors are best at predicting the crime rate for a specific crime.

Libraries and Importing¶

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_percentage_error
from more_itertools import powerset

drugs = pd.read_csv("drugs.csv")
crimes = pd.read_csv("state_crime.csv")

Below, find the tables we just imported.

In [2]:
drugs.head()
Out[2]:
State Year Population.12-17 Population.18-25 Population.26+ Totals.Alcohol.Use Disorder Past Year.12-17 Totals.Alcohol.Use Disorder Past Year.18-25 Totals.Alcohol.Use Disorder Past Year.26+ Rates.Alcohol.Use Disorder Past Year.12-17 Rates.Alcohol.Use Disorder Past Year.18-25 ... Totals.Marijuana.Used Past Year.26+ Rates.Marijuana.Used Past Year.12-17 Rates.Marijuana.Used Past Year.18-25 Rates.Marijuana.Used Past Year.26+ Totals.Tobacco.Use Past Month.12-17 Totals.Tobacco.Use Past Month.18-25 Totals.Tobacco.Use Past Month.26+ Rates.Tobacco.Use Past Month.12-17 Rates.Tobacco.Use Past Month.18-25 Rates.Tobacco.Use Past Month.26+
0 Alabama 2002 380805 499453 2812905 18 68 138 0.048336 0.136490 ... 141 0.127535 0.237880 0.050275 63 226 930 0.166578 0.451976 0.330659
1 Alaska 2002 69400 62791 368460 4 12 27 0.061479 0.187891 ... 46 0.188730 0.389026 0.124566 11 30 112 0.163918 0.484270 0.304220
2 Arizona 2002 485521 602265 3329482 36 117 258 0.073819 0.193626 ... 215 0.169646 0.275435 0.064640 73 240 1032 0.151071 0.397968 0.309969
3 Arkansas 2002 232986 302029 1687337 14 53 101 0.061457 0.175913 ... 104 0.157567 0.288856 0.061510 46 169 660 0.195714 0.558846 0.391210
4 California 2002 3140739 3919577 21392421 173 581 1298 0.055109 0.148312 ... 1670 0.141067 0.282887 0.078068 290 1377 4721 0.092235 0.351353 0.220699

5 rows × 53 columns

In [3]:
crimes.head()
Out[3]:
State Year Data.Population Data.Rates.Property.All Data.Rates.Property.Burglary Data.Rates.Property.Larceny Data.Rates.Property.Motor Data.Rates.Violent.All Data.Rates.Violent.Assault Data.Rates.Violent.Murder ... Data.Rates.Violent.Robbery Data.Totals.Property.All Data.Totals.Property.Burglary Data.Totals.Property.Larceny Data.Totals.Property.Motor Data.Totals.Violent.All Data.Totals.Violent.Assault Data.Totals.Violent.Murder Data.Totals.Violent.Rape Data.Totals.Violent.Robbery
0 Alabama 1960 3266740 1035.4 355.9 592.1 87.3 186.6 138.1 12.4 ... 27.5 33823 11626 19344 2853 6097 4512 406 281 898
1 Alabama 1961 3302000 985.5 339.3 569.4 76.8 168.5 128.9 12.9 ... 19.1 32541 11205 18801 2535 5564 4255 427 252 630
2 Alabama 1962 3358000 1067.0 349.1 634.5 83.4 157.3 119.0 9.4 ... 22.5 35829 11722 21306 2801 5283 3995 316 218 754
3 Alabama 1963 3347000 1150.9 376.9 683.4 90.6 182.7 142.1 10.2 ... 24.7 38521 12614 22874 3033 6115 4755 340 192 828
4 Alabama 1964 3407000 1358.7 466.6 784.1 108.0 213.1 163.0 9.3 ... 29.1 46290 15898 26713 3679 7260 5555 316 397 992

5 rows × 21 columns

Below, find all the dtypes for each table. Pandas correctly interpretted all of them.

In [4]:
drugs.dtypes
Out[4]:
State                                                 object
Year                                                   int64
Population.12-17                                       int64
Population.18-25                                       int64
Population.26+                                         int64
Totals.Alcohol.Use Disorder Past Year.12-17            int64
Totals.Alcohol.Use Disorder Past Year.18-25            int64
Totals.Alcohol.Use Disorder Past Year.26+              int64
Rates.Alcohol.Use Disorder Past Year.12-17           float64
Rates.Alcohol.Use Disorder Past Year.18-25           float64
Rates.Alcohol.Use Disorder Past Year.26+             float64
Totals.Alcohol.Use Past Month.12-17                    int64
Totals.Alcohol.Use Past Month.18-25                    int64
Totals.Alcohol.Use Past Month.26+                      int64
Rates.Alcohol.Use Past Month.12-17                   float64
Rates.Alcohol.Use Past Month.18-25                   float64
Rates.Alcohol.Use Past Month.26+                     float64
Totals.Tobacco.Cigarette Past Month.12-17              int64
Totals.Tobacco.Cigarette Past Month.18-25              int64
Totals.Tobacco.Cigarette Past Month.26+                int64
Rates.Tobacco.Cigarette Past Month.12-17             float64
Rates.Tobacco.Cigarette Past Month.18-25             float64
Rates.Tobacco.Cigarette Past Month.26+               float64
Totals.Illicit Drugs.Cocaine Used Past Year.12-17      int64
Totals.Illicit Drugs.Cocaine Used Past Year.18-25      int64
Totals.Illicit Drugs.Cocaine Used Past Year.26+        int64
Rates.Illicit Drugs.Cocaine Used Past Year.12-17     float64
Rates.Illicit Drugs.Cocaine Used Past Year.18-25     float64
Rates.Illicit Drugs.Cocaine Used Past Year.26+       float64
Totals.Marijuana.New Users.12-17                       int64
Totals.Marijuana.New Users.18-25                       int64
Totals.Marijuana.New Users.26+                         int64
Rates.Marijuana.New Users.12-17                      float64
Rates.Marijuana.New Users.18-25                      float64
Rates.Marijuana.New Users.26+                        float64
Totals.Marijuana.Used Past Month.12-17                 int64
Totals.Marijuana.Used Past Month.18-25                 int64
Totals.Marijuana.Used Past Month.26+                   int64
Rates.Marijuana.Used Past Month.12-17                float64
Rates.Marijuana.Used Past Month.18-25                float64
Rates.Marijuana.Used Past Month.26+                  float64
Totals.Marijuana.Used Past Year.12-17                  int64
Totals.Marijuana.Used Past Year.18-25                  int64
Totals.Marijuana.Used Past Year.26+                    int64
Rates.Marijuana.Used Past Year.12-17                 float64
Rates.Marijuana.Used Past Year.18-25                 float64
Rates.Marijuana.Used Past Year.26+                   float64
Totals.Tobacco.Use Past Month.12-17                    int64
Totals.Tobacco.Use Past Month.18-25                    int64
Totals.Tobacco.Use Past Month.26+                      int64
Rates.Tobacco.Use Past Month.12-17                   float64
Rates.Tobacco.Use Past Month.18-25                   float64
Rates.Tobacco.Use Past Month.26+                     float64
dtype: object
In [5]:
crimes.dtypes
Out[5]:
State                             object
Year                               int64
Data.Population                    int64
Data.Rates.Property.All          float64
Data.Rates.Property.Burglary     float64
Data.Rates.Property.Larceny      float64
Data.Rates.Property.Motor        float64
Data.Rates.Violent.All           float64
Data.Rates.Violent.Assault       float64
Data.Rates.Violent.Murder        float64
Data.Rates.Violent.Rape          float64
Data.Rates.Violent.Robbery       float64
Data.Totals.Property.All           int64
Data.Totals.Property.Burglary      int64
Data.Totals.Property.Larceny       int64
Data.Totals.Property.Motor         int64
Data.Totals.Violent.All            int64
Data.Totals.Violent.Assault        int64
Data.Totals.Violent.Murder         int64
Data.Totals.Violent.Rape           int64
Data.Totals.Violent.Robbery        int64
dtype: object

Some useful functions¶

In [6]:
def get_columns(dataframe, key, num):
    to_be_dropped = []

    for col in list(dataframe.columns)[num:]: # never drop certain columns
        if key not in col:
            to_be_dropped.append(col)
                    
    return dataframe.drop(to_be_dropped, axis=1)

def get_not_columns(dataframe, key, num):
    to_be_dropped = []

    for col in list(dataframe.columns)[num:]: # never drop certain columns
        if key in col:
            to_be_dropped.append(col)
                    
    return dataframe.drop(to_be_dropped, axis=1)
In [7]:
def graph_drug(df, cols, labels, ylim):
    
    fig, axes = plt.subplots(1,3)

    fig.set_figheight(5)
    fig.set_figwidth(15)
    
    def subgraph(col, ax, label, ylim):
        df.set_index("Year").groupby("State")[col].plot.line(
            ylim=(0, ylim), alpha=0.7, fontsize=14, ax=ax);

        ax.set_xlabel("Year", fontsize=15);
        ax.set_ylabel("Fraction", fontsize=15);
        ax.set_title(label, fontsize=15);
        ax.tick_params(labelrotation=45)
        
    for i in range(3):
        subgraph(cols[i], axes[i], labels[i], ylim)

    return fig.tight_layout()
In [8]:
def capitalize(string):
    return str.upper(string[0]) + string[1:]
In [9]:
def regress(df, labels, xs, y):
    
    fig, axes = plt.subplots(1,3)
    fig.set_figheight(5)
    fig.set_figwidth(15)
    
    def subplot(label, x, ax):
        res = stats.linregress(df[x], df[y])
        title = "Slope: " + str(np.round(res.slope, 3))
        
        df.plot.scatter(x=x, y=y, ax=ax);
        ax.plot(df[x], res.intercept + res.slope*df[x], color="red")
        
        ax.set_xlabel(label, fontsize=12);
        ax.set_ylabel(capitalize(y), fontsize=12);
        ax.set_title(title, fontsize=15);
        ax.set_xlim((-2,2))
        ax.set_ylim((-2,2))
        
        
    for i in range(3):
        subplot(labels[i], xs[i], axes[i])
        
    return fig.tight_layout()
In [10]:
def graph_coefs(df, xs, title):
    ys = ["burglary", "larceny", "motor", "assault", "murder", "rape", "robbery"]

    #blue-ish for property, red-ish for violent
    colors = ["lightsteelblue", "deepskyblue", "royalblue", "darksalmon", "red", "darkorange", "firebrick"]

    # we then get the coefficients for every combination
    all_coefs = pd.DataFrame({"Crime": ys})
    all_coefs.set_index("Crime", inplace=True)
    for x in xs:
        coefs = []
        for y in ys:
            res = stats.linregress(df[x], df[y])
            coefs.append(res.slope)
        all_coefs[x] = coefs
    all_coefs = all_coefs.transpose()
    all_coefs.reset_index(inplace=True)

    # we plot those results here
    fig, ax = plt.subplots()

    for i in range(len(ys)):
        all_coefs.plot.scatter(x="index", y=ys[i], ax=ax, marker='o', color=colors[i], s=100, alpha=0.65)
    
    plt.title(title)
    plt.legend(ys)
    ax.set_ylabel("Correlation Coefficients");
    fig.set_figheight(5)
    ax.set_ylim((-1,1))
    fig.set_figwidth(15)
    fig.set_figheight(6)
In [11]:
def get_scaled_state(df, state):
    s = df[df["State"] == state]
    return pd.concat([s.iloc[:, :2], stats.zscore(s.iloc[:, 2:])], axis=1)

Data Preprocessing¶

In order to make the data more usable, created a method to obtain a copy of a certain section of either DateFrame that is relevant to our current analysis. One example of this is excluding the actual population values from the data table because these values are not very useful. Our table conforms with tidy data guidelines and we did not have to change many aspects to achieve this. We decided that the State and Year are the best options by which to index the data in most situations. This ensures there are no repeated indexes and allows us to clearly see both changes by state and over time.

In [12]:
# get just percentages, not raw user numbers
drugs_pct = get_columns(drugs, "Rates", 2)
In [13]:
# rename columns
drugs_pct.rename(columns={
    "Rates.Alcohol.Use Disorder Past Year.12-17": "alcoholism past year:12-17",
    "Rates.Alcohol.Use Disorder Past Year.18-25": "alcoholism past year:18-25",
    "Rates.Alcohol.Use Disorder Past Year.26+": "alcoholism past year:26+",
    "Rates.Alcohol.Use Past Month.12-17": "alcohol used past month:12-17",
    "Rates.Alcohol.Use Past Month.18-25": "alcohol used past month:18-25",
    "Rates.Alcohol.Use Past Month.26+": "alcohol used past month:26+",
    
    "Rates.Illicit Drugs.Cocaine Used Past Year.12-17": "cocaine used past year:12-17",
    "Rates.Illicit Drugs.Cocaine Used Past Year.18-25": "cocaine used past year:18-25",
    "Rates.Illicit Drugs.Cocaine Used Past Year.26+": "cocaine used past year:26+",
    
    "Rates.Marijuana.Used Past Month.12-17": "marijuana used past month:12-17",
    "Rates.Marijuana.Used Past Month.18-25": "marijuana used past month:18-25",
    "Rates.Marijuana.Used Past Month.26+": "marijuana used past month:26+",
    "Rates.Marijuana.Used Past Year.12-17": "marijuana used past year:12-17",
    "Rates.Marijuana.Used Past Year.18-25": "marijuana used past year:18-25",
    "Rates.Marijuana.Used Past Year.26+": "marijuana used past year:26+",
    
    "Rates.Tobacco.Use Past Month.12-17": "tobacco used past month:12-17",
    "Rates.Tobacco.Use Past Month.18-25": "tobacco used past month:18-25",
    "Rates.Tobacco.Use Past Month.26+": "tobacco used past month:26+",
}, inplace=True)
In [14]:
# get alochol related usage rates
alcohol_pct = get_columns(drugs_pct, "alcohol", 2)

# get tobacco related usage rates
tobacco_pct = get_not_columns(get_columns(drugs_pct, "tobacco", 2), "Cigarette", 2)

# get cocaine related usage rates
cocaine_pct = get_columns(drugs_pct, "cocaine", 2)

# get marijuana related usage rates
marijuana_pct = get_not_columns(get_columns(drugs_pct, "marijuana", 2), "New Users", 2)
In [15]:
# rename columns
crimes.rename(columns={
    "Data.Rates.Property.Burglary": "burglary",
    "Data.Rates.Property.Larceny": "larceny",
    "Data.Rates.Property.Motor": "motor",
    "Data.Rates.Violent.Assault": "assault",
    "Data.Rates.Violent.Murder": "murder",
    "Data.Rates.Violent.Rape": "rape",
    "Data.Rates.Violent.Robbery": "robbery"
},inplace=True)
In [16]:
crimes_pct = get_not_columns(crimes, "Data", 2)

Drugs¶

Alcohol¶

In this section we analyze alcohol use and alcoholism in the United States by age group, over time, and by state.

In the cell below, we establish the range of alcohol abuse usage rates. This summary statistic offers a preliminary insight into what states may have high or low usage rates for other drugs. It also poses questions to answer about the states with the most extreme values. For example, what aspect of North Dakota in 2003 made the alcohol usage rate so high, and has it changed since then? The following calculations were done with the age rage of 18-25.

In [17]:
print('State with highest alcohol use in the past year rate from 2002 to 2018:', 
      str(alcohol_pct.set_index(["State", "Year"])["alcoholism past year:18-25"].idxmax()) + 
      ';', alcohol_pct.set_index(["State", "Year"])["alcoholism past year:18-25"].max())

print('State with lowest alcohol use in the past year rate from 2002 to 2018:', 
      str(alcohol_pct.set_index(["State", "Year"])["alcoholism past year:18-25"].idxmin()) + 
      ';', alcohol_pct.set_index(["State", "Year"])["alcoholism past year:18-25"].min())
State with highest alcohol use in the past year rate from 2002 to 2018: ('North Dakota', 2003); 0.272941
State with lowest alcohol use in the past year rate from 2002 to 2018: ('Florida', 2018); 0.071218

In the cell below we find the mean of alcohol use in the past year usage rates accross all states from 2002 to 2018 to be about 15.12%. This summary statistic is a measure of centrality can can be used as a reference point to compare states to. This allows us to make realizations such as whether or not a state may be considered a "heavy drinking state."

In [18]:
alcohol_pct['alcoholism past year:18-25'].mean()
Out[18]:
0.15118001384083027
In [19]:
cols = ["alcoholism past year:12-17", "alcoholism past year:18-25", "alcoholism past year:26+"]
labels = ["Alcoholism: Ages 12 - 17", "Alcoholism: Ages 18 - 25", "Alcoholism: Ages 26+"]

graph_drug(alcohol_pct, cols, labels, 0.3)

The subplot above shows alcoholism in the past year rates for each state from 2002-2018, where each graph is a different age group and each line is a state. As we can see, alcoholism varies greatly by age range. For ages 12 to 17, the alcohol disorder rate has dropped in all states. In the 18 to 25 age range, the rate varies greatly by state, however, the rate is still decreasing over time. Finally, for the 26+ age range, the rate has stayed fairly consistent over this time period.

Tobacco¶

In this section we analyze tobacco use in the United States by age group, over time, and by state.

In [20]:
cols = ["tobacco used past month:12-17", "tobacco used past month:18-25", "tobacco used past month:26+"]
labels = ["Tobacco Use in the Past Month: Ages 12-17", "Tobacco Use in the Past Month: Ages 18-25", "Tobacco Use in the Past Month: Ages 26+"]

graph_drug(tobacco_pct, cols, labels, 0.65)

In the sublot above, we provide a similar overview of tobacco use in the past month across age ranges for all states. We see a significant nationwide decrease in tobacco use among the 12 to 17 age range. One possible factor of this decrease is the introduction of artificial vape devices that offer an alternative method of nicotine intake, the addictive ingredient in tobacco. The 18 to 25 age range has also seen a decrease across most states, but have the highest rates overall among the three age groups. This is likely due to the historic prevelance of nicotine among teenagers, which has also been affected by the introduction of vape devices. Finally, the 26+ age range has maintained steady rates overtime, but have a greater variance by state.

Cocaine¶

In this section we analyze cocaine use in the United States by age group, over time, and by state.

In [21]:
cols = ["cocaine used past year:12-17", "cocaine used past year:18-25", "cocaine used past year:26+"]
labels = ["Cocaine Use in the Past Year: Ages 12-17", "Cocaine Use in the Past Year: Ages 18-25", "Cocaine Use in the Past Year: Ages 26+"]

graph_drug(cocaine_pct, cols, labels, 0.13)

Cocaine has the lowest usage rates of all four drugs in our analysis. The 12 to 17 range has decreased its cocaine use in the past year since 2002, although usage rates were never very high in this age group. The 18 to 25 range has by far the highest rates and greatest range of usage rate between states, and vary greatly. It is interesting to note an uptick in usage for this group around the years 2012-2014. The 26+ range also has lower rates than the 18 to 25 range. The oldest group has one outlier with much higher usage rates, District of Columbia, which peaked in 2006 with a usage rate in the past year of 0.0525 (please note that the District of Columbia is treated as a state in this dataset). The general trend has remaied fairly constant over time.

Marijuana¶

In this section we analyze marijuana use by age group, over time, and by state.

In [22]:
cols = ["marijuana used past year:12-17", "marijuana used past year:18-25", "marijuana used past year:26+"]
labels = ["Marijuana Use in the Past Year: Ages 12-17", "Marijuana Use in the Past Year: Ages 18-25", "Marijuana Use in the Past Year: Ages 26+"]

graph_drug(marijuana_pct, cols, labels, 0.55)

Marijuana is the only drug we analyzed with an upward trend in multiple age ranges. The 12 to 17 age range has remaied similar over time. For ranges 18 to 25, there is a slight upward trend overall with a high degree of variation between each state. For ages 26+ there is a clear upward trend in all states, although some are increasing at different rates. These increases are likely due to the recent focus on legalization of medical and, in some cases, recreational marijuana use.

Crimes¶

Below, we include the crimes dataset in our analysis which will help us draw meaningful conclusions from both datasets. This dataset has the same unit of observation as our intial dataset: a state in a certain year. It contains a total population column, then it has data on two different types of crime: property related crimes and violent crimes. It has a column that is an aggregation of each of these two categories, basically a total property crimes rate and a total violent crimes rate, as well as columns for each type of crime within each category. Each entry is the rate of how often that crime was committed in that state in that year.

In [23]:
#this is the df that just has the rates and within the relevant years
crimes_pct = crimes_pct[(crimes_pct.Year >= 2002) & (crimes_pct.Year<=2018)].iloc[:, :12]
crimes_pct.head()
Out[23]:
State Year burglary larceny motor assault murder rape robbery
42 Alabama 2002 950.6 2767.0 310.1 268.0 6.8 37.2 133.1
43 Alabama 2003 960.2 2754.1 332.1 251.7 6.6 36.8 134.1
44 Alabama 2004 987.0 2732.4 309.9 249.4 5.6 38.5 133.5
45 Alabama 2005 955.8 2656.0 289.0 248.3 8.2 34.4 141.7
46 Alabama 2006 973.7 2640.8 326.5 227.5 8.3 35.8 153.6
In [24]:
fig, axes = plt.subplots(3,3)
fig.set_figwidth(14)
fig.set_figheight(14)


crimes_list = ["burglary", "larceny", "motor", "assault", "murder", "rape", "robbery"]
starter = crimes_pct.set_index("Year").groupby("State")

counter = 0
for i in range(3):
    for j in range(3):
        if counter < 7:
            starter[crimes_list[counter]].plot.line(ylabel="Offenses per 100,000 People", ax = axes[i][j])
            axes[i][j].set_title(capitalize(crimes_list[counter]))
            counter += 1
            
fig.delaxes(axes[2][1])
fig.delaxes(axes[2][2])
fig.tight_layout()

In the graph above we have displayed the rates of each crime for each state from 2002 to 2018. This graph is meant to provide an overview, but we can observe an overall decrease across most crimes during this time, with the exception of assault, murder, and rape. The rates of the three crimes remain fairly constant, but some states have actually seen an increase rape. This increase could be due to many factors, but in recent years, coming forward to report rape has seemingly become a much more viable option for victims. An increase in reporting may account for some of the increase we can see for this crime.

Some of the graphs above present obvious outliers. First of all, Washington D.C. has had by far the largest increase in larceny since 2006, when most other states have had declining rates since then. Additionally, Washington D.C. had a murder rate above the rest of the states during this time period. It has seen an overall decrease, but still remain far above the average of the other states. The outlier in the rape category that has actually increased is Alaska. They are one of the only states with an increase in rape rates, and their rates were already above average. Finally, the extremely obvious outlier in the robbery category is again, Washington D.C. Their robbery rate is decreasing but remains in extreme excess of the other states.

Washington D.C. finds itself the outlier in many categories. This could be because its population density is so much larger than real states. The rate at which these crimes take place are usually higher in a city environment and almost all of Washington D.C. is urban. Thus it would make sense that the rates there are higher.

Drug/Alcohol Use and Crimes¶

Now, we will create a DataFrame consisting of alcohol usage rates in the past month merged with relevant crime data. We will use this new DataFrame to observe correlations between alcohol usage rates and crime rates at both a country wide and state level.

Country Level¶

In [25]:
features = list(alcohol_pct.iloc[:, 5:].columns
                .append(cocaine_pct.iloc[:, 1:].columns)
                .append(tobacco_pct.iloc[:, 2:].columns)
                .append(marijuana_pct.iloc[:, 5:].columns))

merged = drugs_pct.merge(crimes_pct, on = ["State", "Year"], how="inner")

all_combos = list(powerset(features))[7099:] #less than 9 features is never optimal

KNN Model¶

In the cell below we use a K Nearest Neighbors model to determine which set of drug use features best predict a certain crime rate at a country wide level. To do this we created a powerset of drug use features and and ran our model for each set of at least 9 features against each type of crime. We decided 9 was the best number to choose after experimenting with different combinations. In order to have the cell run in a reasonable amount of time, we elimated the combinations of features that we knew would never be selected.

Additionally, we choose the value of K to be 2. Once again, we experimented with different combinations to find the best output and found that 2 was always the best. To avoid unnecessarily looping over more values than required, we set K to 2.

Using the mean absolute percentage error, we found the combination of features that scored the best for each crime.

In [26]:
crimes = ["burglary", "larceny", "motor", "assault", "murder", "rape", "robbery"]
for crime in crimes:
    final_mape = np.inf
    for features in all_combos:

        model = KNeighborsRegressor(n_neighbors=2) # the best value is always 2
        x_train = merged[list(features)]
        y_train = merged[crime]
        
        scaler = StandardScaler()
        scaler.fit(x_train)
        x_train_sc = scaler.transform(x_train)
        
        model.fit(x_train_sc, y_train)

        y_pred = model.predict(x_train_sc)
        mape = mean_absolute_percentage_error(y_train, y_pred)

        if mape < final_mape:
            final_mape = mape
            final_features = list(features)

    print("crime:", crime)
    print("mape:", np.round(final_mape*100, 2), "%")
    print("# of features:", len(final_features))
    print("features:", final_features)
    print("---------------------------------------------------------------------------------")
crime: burglary
mape: 5.47 %
# of features: 12
features: ['alcohol used past month:12-17', 'alcohol used past month:18-25', 'alcohol used past month:26+', 'Year', 'cocaine used past year:12-17', 'cocaine used past year:18-25', 'cocaine used past year:26+', 'tobacco used past month:12-17', 'tobacco used past month:18-25', 'tobacco used past month:26+', 'marijuana used past year:18-25', 'marijuana used past year:26+']
---------------------------------------------------------------------------------
crime: larceny
mape: 3.7 %
# of features: 11
features: ['alcohol used past month:12-17', 'alcohol used past month:18-25', 'alcohol used past month:26+', 'Year', 'cocaine used past year:12-17', 'cocaine used past year:26+', 'tobacco used past month:12-17', 'tobacco used past month:18-25', 'tobacco used past month:26+', 'marijuana used past year:18-25', 'marijuana used past year:26+']
---------------------------------------------------------------------------------
crime: motor
mape: 7.75 %
# of features: 10
features: ['alcohol used past month:12-17', 'alcohol used past month:18-25', 'alcohol used past month:26+', 'Year', 'cocaine used past year:12-17', 'cocaine used past year:26+', 'tobacco used past month:18-25', 'tobacco used past month:26+', 'marijuana used past year:18-25', 'marijuana used past year:26+']
---------------------------------------------------------------------------------
crime: assault
mape: 7.49 %
# of features: 11
features: ['alcohol used past month:12-17', 'alcohol used past month:18-25', 'alcohol used past month:26+', 'Year', 'cocaine used past year:12-17', 'cocaine used past year:26+', 'tobacco used past month:18-25', 'tobacco used past month:26+', 'marijuana used past year:12-17', 'marijuana used past year:18-25', 'marijuana used past year:26+']
---------------------------------------------------------------------------------
crime: murder
mape: 9.64 %
# of features: 11
features: ['alcohol used past month:12-17', 'alcohol used past month:18-25', 'alcohol used past month:26+', 'Year', 'cocaine used past year:12-17', 'cocaine used past year:26+', 'tobacco used past month:18-25', 'tobacco used past month:26+', 'marijuana used past year:12-17', 'marijuana used past year:18-25', 'marijuana used past year:26+']
---------------------------------------------------------------------------------
crime: rape
mape: 7.08 %
# of features: 9
features: ['alcohol used past month:12-17', 'alcohol used past month:26+', 'Year', 'cocaine used past year:18-25', 'cocaine used past year:26+', 'tobacco used past month:12-17', 'tobacco used past month:26+', 'marijuana used past year:18-25', 'marijuana used past year:26+']
---------------------------------------------------------------------------------
crime: robbery
mape: 8.31 %
# of features: 11
features: ['alcohol used past month:12-17', 'alcohol used past month:18-25', 'alcohol used past month:26+', 'Year', 'cocaine used past year:12-17', 'cocaine used past year:26+', 'tobacco used past month:12-17', 'tobacco used past month:18-25', 'tobacco used past month:26+', 'marijuana used past year:18-25', 'marijuana used past year:26+']
---------------------------------------------------------------------------------
  • Burglary was found to be best predicted by 12 features, the only feature left out is majijuana use in the past year ages 12-17.

  • Larceny, assault, murder, and robbery were all found to be best predicted by 11 features, although they differ in features included. All of them included alcohol use in the past month for all three age groups, cocaine use in the past year for the 12 to 17 age group and the 26+ age group, and the year in the list of features. They have different mixes of tobacco features and marijuana features.

  • Theft of a motor vehicle was found to have 10 features to get the best prediction. It had all alcohol categories, and the same 2 cocaine catagories as the 11 feature crimes above. It also had the two older age groups for both tobacco and marijuana.

  • Rape had the fewest factors at only 9. It is the only crime to not consider all 3 age groups for alcohol. It still has cocaine and marijuana use for the older 2 age groups, and the youngest and oldest age group for tobacco.

It is significant that all 3 age groups of the alcohol use feature are present for every crime except rape which only has 2 age groups. In addition, there are at least 2 age groups of cocaine features in every model. These 2 drugs are what we predicted would have the more influence on crime rates.

The fact that rape only has 2 catagories of alcohol use is also an interesting finding. Rape is arguably an outlier in our set of crimes in that it is sexual in nature. In general, the uniqueness of this crime relative to the other crimes we considered makes it harder to predict. We believe that the motivators of a sexual crime are not necessarily similar to those for other non-sexual crimes, which may explain why so few features are used for that crime.

Louisiana¶

Now that we have looked at the country as a whole, we will dive deeper into our home state: Lousiana. Below we will point out some more specific correlations between certain crime and certain drugs in Lousiana. We will also look at how these correlations differ between age groups.

In [27]:
merged = drugs_pct.merge(crimes_pct, on=["State", "Year"], how="inner")
In [28]:
la_sc = get_scaled_state(merged, "Louisiana")

Alcohol Use in the Past Month and Burglary, in Lousiana¶

In [29]:
labels = ["Alcohol Use in Past Month: Ages 12-17", "Alcohol Use in Past Month: Ages 18-25", "Alcohol Use in Past Month: Ages 26+"]
xs = ["alcohol used past month:12-17", "alcohol used past month:18-25", "alcohol used past month:26+"]
y = "burglary"

regress(la_sc, labels, xs, y)

In the subplot above, we have used our merged DataFrame to get correlation values between Alcohol usage rate in the past month and burglary rates, across our 3 age ranges in Lousiana. We can see that the 12 to 17 age range has a correlation coefficient of 0.767, while the 18 to 25 has the largest positive correlation coefficient of the three age ranges: 0.882. This value represents a strong positive relationship between alcohol use and burglaries among Lousiana residents in this age group. The 26+ age range has a negative correlation coefficient, indicating an inverse relationship in Lousiana.

Cocaine Use in the Past Year and Larceny, in Lousiana¶

In [30]:
labels = ["Cocaine Use in Past Year: Ages 12-17", "Cocaine Use in Past Year: Ages 18-25", "Cocaine Use in Past Year: Ages 26+"]
xs = ["cocaine used past year:12-17", "cocaine used past year:18-25", "cocaine used past year:26+"]
y = "larceny"

regress(la_sc, labels, xs, y)

The 12 to 17 age range has the highest correlation coefficient of 0.892, indicating a strong positive relationship of coacine use in the past year and larceny among 12 to 17 year olds in Lousiana. It is important to note that the cocaine usage rates for this age group are far lower than the other 2 ranges. This indicates that while there may be a positive relationship, there are less cocaine users from which to draw this conclusion. Larceny is the lowest level of a theft related crime and indicates simple theft as opposed to robbery or burglary, so it makes sense that it is correlated with a younger age group. The next 2 graphs have similar coefficients, with the 26+ group being slightly higher. The reduction of coefficient in the older two age groups means that people in these age groups are not necessarily more likely to commit larceny if they have used cocaine in the past year.

Correlation Coefficients Visualized¶

After observing correlations between different drug usage rates and different crimes, we were able to go a step further and look at changes in coefficients (betas) between different age groups. Here we calculate the betas for all crimes and and for different drugs.

An important thing to note is that because the data is scaled and each coefficient is calculated in a single variable regression, the coefficients are the same number as the correlation coefficients. This is not always the case, but here it does hold true.

In [31]:
xs = ["alcohol used past month:12-17", "alcohol used past month:18-25", "alcohol used past month:26+"]
graph_coefs(la_sc, xs, "Alcohol Use in the Past Month and All Crimes in Lousiana")

In the graph above we have three age groups on our x axis and coefficient values between alcohol use in the past month and 7 types of crime, listed above. We can see an obvious trend in all but one crime; alcohol use in older age groups is less correlated with commiting crimes. The one crime that has not seen any real decrease is rape. This is arguably the most serious and vulgar crime of those we have analyzed, which may explain why it does not behave like the others.

In [32]:
xs = ['tobacco used past month:12-17', 'tobacco used past month:18-25', 'tobacco used past month:26+']
graph_coefs(la_sc, xs, "Tobacco Use in the Past Year and All Crimes in Lousiana")

This graph is very similar to the last graph. We find that the coefficients are generally positive in younger age groups and then decrease for the oldest age group. Once again, rape behaves very differently from the other crimes especially in the two younger age groups.

In [33]:
xs = ['cocaine used past year:12-17', 'cocaine used past year:18-25', 'cocaine used past year:26+']
graph_coefs(la_sc, xs, "Cocaine Use in the Past Year and All Crimes in Lousiana")

This graph is slightly different than the previous two graphs. Instead of the betas decreasing in the older age groups we see that the coefficients remain high. However, it is similar to the previous graphs because rape is once again an outlier in 2 of the 3 age groups.

In [34]:
xs = ['marijuana used past year:12-17', 'marijuana used past year:18-25', 'marijuana used past year:26+']
graph_coefs(la_sc, xs, "Marijuana Use in the Past Year and All Crimes in Lousiana")

This graph is suprisingly different from the previous 3 graphs. In the 2 older age groups, almost all crimes are actually negatively correlated with marijuana use. Rape, again, is the obvious outlier in this graph.

Conclusion¶

We recognize our data set is not conducive for a predictive model because the unit of observation is a state in a particular year. Each of the numbers in our table is already a huge aggregation of many statistics. It would be more helpful to predict the outcome of an individual rather than the average crime of a state. Unforturnately, we don't have the data to make individual predictions.

Our analysis of the country, while interesting, might not say that much about any one specific drug. Like mentioned above, these numbers are already an average of a lot of numbers. Despite this, we think the determining which set of features is the best predictor may give some insight into the relationship between specific drugs and specific crimes.

Furthermore many states my cancel each other out making the model far from optimal. This is why we decided to examine one state, Louisiana, to see what these relationships looked like on a smaller scale. This view of the problem revealed some stronger relationships between drugs and crime. We can't say certain drugs and certain crimes are always correlated. It depends highly on the age group and state.

Finally, it is important to note that we cannot infer any causation from this analysis. The data is not following the same people through different age groups or following people that do drugs and commit crimes. The data is simply a measurement of rates in different states.