# Dataset Description
*heart_disease_train_dataset.csv* contains 250 training samples, consisting of 6 features and 1 target label. The first row is the header row containing the column names, for a total of 251 rows and 7 columns.

Below is the list of column names, their possible values, and the information that they represent:

Column Name                            | Variable Type | Remarks 
---------------------------------------|---------------|--------
resting_blood_pressure                 | Categorical   | 0 = normal ; 1 = elevated ; 2 = hypertension
serum_cholesterol                      | Categorical   | 0 = normal ; 1 = mild-risk ; 2 = high-risk
diabetes                               | Binary        | 0 = no ; 1 = yes
left_ventricular_hypertrophy           | Binary        | 0 = no ; 1 = yes
ST_slope_anomaly                       | Binary        | 0 = no ; 1 = yes
myocardial_defect                      | Binary        | 0 = no ; 1 = yes
heart_disease                          | Binary        | 0 = no ; 1 = yes

\
*heart_disease_test_dataset.csv* contains 50 test samples, consisting of the same 6 features and 1 target label. The first row is also the header row containing the column names, for a total of 51 rows and 7 columns. Your task is to predict the 50 target labels with *Naive Bayes Classifier*, with conditional probabilities calculated from the 250 training samples, and then check your accuracy against the actual target labels.


## Mount Google Drive
Download and save a copy of the Lab2 Notebook and Excel files (*lab2.ipynb*, *heart_disease_train_dataset.csv*, *heart_disease_test_dataset.csv*) to your Google Drive, ensuring that all three files are in the same location.

In [1]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


Modify the pathway of the *%cd* shell command according to the location in your Google Drive where you have saved the Notebook and Excel files. Double check the output of the *%ls* shell command to ensure that you are in the correct working directory; it should display the filenames of *lab2.ipynb*, *heart_disease_train_dataset.csv*, and *heart_disease_test_dataset.csv*.

In [2]:
%cd "/content/drive/My Drive/HKUST CSE IA/COMP2211 Exploring Artificial Intelligence Spring 2022/preparation/Lab2"
%ls

/content/drive/My Drive/HKUST CSE IA/COMP2211 Exploring Artificial Intelligence Spring 2022/preparation/Lab2
heart_disease_dataset.xlsx              lab2.ipynb
heart_disease_features_description.txt  lab2_sol.ipynb
heart_disease_test_dataset.csv          Naive-Bayes-From-Scratch.ipynb
heart_disease_train_dataset.csv         TODO.txt


## Load Datasets
Load the train and test datasets from Google Drive into Numpy using *loadtxt()*.
To read *.csv* files, we specify comma as the delimiter, and also skip the header row.

In [3]:
import numpy as np
train = np.loadtxt("heart_disease_train_dataset.csv", delimiter=',', skiprows=1)
test = np.loadtxt("heart_disease_test_dataset.csv", delimiter=',', skiprows=1)

# Naive Bayes Classifier

## Task 1: Relative Frequencies
First of all, calculative the relative frequencies of each feature given the target label (Likelihoods), as well as the frequencies of each target label (Prior Probability). Since our goal is only Classification, we don't need to worry about the denominator (Marginal Probability). 

Optional: Fill-in the table below to help keep track of our calculations.

|               | Resting Blood Pressure |     |    | / | Serum Cholesterol |     |    | / | Diabetes |     |    | / | Left Ventricular Hypertrophy |     |    | / | ST Slope Anomaly |     |    | / | Myocardial Defect |     |    | / | Heart Disease |     |    |
|---------------|------------------------|-----|----|---|-------------------|-----|----|---|----------|-----|----|---|------------------------------|-----|----|---|------------------|-----|----|---|-------------------|-----|----|---|---------------|-----|----|
| Heart Disease |                        | Yes | No | / |                   | Yes | No | / |          | Yes | No | / |                              | Yes | No | / |                  | Yes | No | / |                   | Yes | No | / |               | Yes | No |
|               | Hypertension           |     |    | / | High-Risk         |     |    | / | Yes      |     |    | / | Yes                          |     |    | / | Yes              |     |    | / | Yes               |     |    | / | -             |     |    |
|               | Elevated               |     |    | / | Mild-Risk         |     |    | / | No       |     |    | / | No                           |     |    | / | No               |     |    | / | No                |     |    | / | -             | -   | -  |
|               | Normal                 |     |    | / | Normal            |     |    | / | -        | -   | -  | / | -                            | -   | -  | / | -                | -   | -  | / | -                 | -   | -  | / | -             | -   | -  |


In [4]:
train_features = train[:, :-1] # All except the last column.
train_labels = train[:, -1] # Only the last column.

num_heart_disease_yes = 0 # Count of (heart_disease = yes)
num_heart_disease_no = 0 # Count of (heart_disease = no)

num_resting_blood_pressure_hypertension_yes = 0 # Count of (resting_blood_pressure = hypertension | heart_disease = yes)
num_resting_blood_pressure_elevated_yes = 0 # Count of (resting_blood_pressure = elevated | heart_disease = yes)
num_resting_blood_pressure_normal_yes = 0 # Count of (resting_blood_pressure = normal | heart_disease = yes)
num_resting_blood_pressure_hypertension_no = 0 # Count of (resting_blood_pressure = hypertension | heart_disease = no)
num_resting_blood_pressure_elevated_no = 0 # Count of (resting_blood_pressure = elevated | heart_disease = no)
num_resting_blood_pressure_normal_no = 0 # Count of (resting_blood_pressure = normal | heart_disease = no)

num_serum_cholesterol_highrisk_yes = 0 # Count of (serum_cholesterol = high-risk | heart_disease = yes)
num_serum_cholesterol_mildrisk_yes = 0 # Count of (serum_cholesterol = mild-risk | heart_disease = yes)
num_serum_cholesterol_normal_yes = 0 # Count of (serum_cholesterol = normal | heart_disease = yes)
num_serum_cholesterol_highrisk_no = 0 # Count of (serum_cholesterol = high-risk | heart_disease = no)
num_serum_cholesterol_mildrisk_no = 0 # Count of (serum_cholesterol = mild-risk | heart_disease = no)
num_serum_cholesterol_normal_no = 0 # Count of (serum_cholesterol = normal | heart_disease = no)

num_diabetes_yes_yes = 0 # Count of (diabetes = yes | heart_disease = yes)
num_diabetes_no_yes = 0 # Count of (diabetes = no | heart_disease = yes)
num_diabetes_yes_no = 0 # Count of (diabetes = yes | heart_disease = no)
num_diabetes_no_no = 0 # Count of (diabetes = no | heart_disease = no)

num_left_ventricular_hypertrophy_yes_yes = 0 # Count of (left_ventricular_hypertrophy = yes | heart_disease = yes)
num_left_ventricular_hypertrophy_no_yes = 0 # Count of (left_ventricular_hypertrophy = no | heart_disease = yes)
num_left_ventricular_hypertrophy_yes_no = 0 # Count of (left_ventricular_hypertrophy = yes | heart_disease = no)
num_left_ventricular_hypertrophy_no_no = 0 # Count of (left_ventricular_hypertrophy = no | heart_disease = no)

num_ST_slope_anomaly_yes_yes = 0 # Count of (ST_slope_anomaly = yes | heart_disease = yes)
num_ST_slope_anomaly_no_yes = 0 # Count of (ST_slope_anomaly = no | heart_disease = yes)
num_ST_slope_anomaly_yes_no = 0 # Count of (ST_slope_anomaly = yes | heart_disease = no)
num_ST_slope_anomaly_no_no = 0 # Count of (ST_slope_anomaly = no | heart_disease = no)

num_myocardial_defect_yes_yes = 0 # Count of (myocardial_defect = yes | heart_disease = yes)
num_myocardial_defect_no_yes = 0 # Count of (myocardial_defect = no | heart_disease = yes)
num_myocardial_defect_yes_no = 0 # Count of (myocardial_defect = yes | heart_disease = no)
num_myocardial_defect_no_no = 0 # Count of (myocardial_defect = no | heart_disease = no)

for row in range(train_features.shape[0]):
  if train_labels[row] == 1:
    num_heart_disease_yes += 1

    if train_features[row, 0] == 2:
      num_resting_blood_pressure_hypertension_yes += 1
    elif train_features[row, 0] == 1:
      num_resting_blood_pressure_elevated_yes += 1
    else:
      num_resting_blood_pressure_normal_yes += 1
    
    if train_features[row, 1] == 2:
      num_serum_cholesterol_highrisk_yes += 1
    elif train_features[row, 1] == 1:
      num_serum_cholesterol_mildrisk_yes += 1
    else:
      num_serum_cholesterol_normal_yes += 1
    
    if train_features[row, 2] == 1:
      num_diabetes_yes_yes += 1
    else:
      num_diabetes_no_yes += 1

    if train_features[row, 3] == 1:
      num_left_ventricular_hypertrophy_yes_yes += 1
    else:
      num_left_ventricular_hypertrophy_no_yes += 1

    if train_features[row, 4] == 1:
      num_ST_slope_anomaly_yes_yes += 1
    else:
      num_ST_slope_anomaly_no_yes += 1

    if train_features[row, 5] == 1:
      num_myocardial_defect_yes_yes += 1
    else:
      num_myocardial_defect_no_yes += 1

  else:
    num_heart_disease_no += 1

    if train_features[row, 0] == 2:
      num_resting_blood_pressure_hypertension_no += 1
    elif train_features[row, 0] == 1:
      num_resting_blood_pressure_elevated_no += 1
    else:
      num_resting_blood_pressure_normal_no += 1
    
    if train_features[row, 1] == 2:
      num_serum_cholesterol_highrisk_no += 1
    elif train_features[row, 1] == 1:
      num_serum_cholesterol_mildrisk_no += 1
    else:
      num_serum_cholesterol_normal_no += 1
    
    if train_features[row, 2] == 1:
      num_diabetes_yes_no += 1
    else:
      num_diabetes_no_no += 1

    if train_features[row, 3] == 1:
      num_left_ventricular_hypertrophy_yes_no += 1
    else:
      num_left_ventricular_hypertrophy_no_no += 1

    if train_features[row, 4] == 1:
      num_ST_slope_anomaly_yes_no += 1
    else:
      num_ST_slope_anomaly_no_no += 1

    if train_features[row, 5] == 1:
      num_myocardial_defect_yes_no += 1
    else:
      num_myocardial_defect_no_no += 1

heart_disease_yes = num_heart_disease_yes/train.shape[0] # P(heart_disease = yes)
heart_disease_no = num_heart_disease_no/train.shape[0] # P(heart_disease = no)

resting_blood_pressure_hypertension_yes = num_resting_blood_pressure_hypertension_yes/num_heart_disease_yes # P(resting_blood_pressure = hypertension | heart_disease = yes)
resting_blood_pressure_elevated_yes = num_resting_blood_pressure_elevated_yes/num_heart_disease_yes # P(resting_blood_pressure = elevated | heart_disease = yes)
resting_blood_pressure_normal_yes = num_resting_blood_pressure_normal_yes/num_heart_disease_yes # P(resting_blood_pressure = normal | heart_disease = yes)
resting_blood_pressure_hypertension_no = num_resting_blood_pressure_hypertension_no/num_heart_disease_no # P(resting_blood_pressure = hypertension | heart_disease = no)
resting_blood_pressure_elevated_no = num_resting_blood_pressure_elevated_no/num_heart_disease_no # P(resting_blood_pressure = elevated | heart_disease = no)
resting_blood_pressure_normal_no = num_resting_blood_pressure_normal_no/num_heart_disease_no # P(resting_blood_pressure = normal | heart_disease = no)

serum_cholesterol_highrisk_yes = num_serum_cholesterol_highrisk_yes/num_heart_disease_yes # P(serum_cholesterol = high-risk | heart_disease = yes)
serum_cholesterol_mildrisk_yes = num_serum_cholesterol_mildrisk_yes/num_heart_disease_yes # P(serum_cholesterol = mid-risk | heart_disease = yes)
serum_cholesterol_normal_yes = num_serum_cholesterol_normal_yes/num_heart_disease_yes # P(serum_cholesterol = normal | heart_disease = yes)
serum_cholesterol_highrisk_no = num_serum_cholesterol_highrisk_no/num_heart_disease_no # P(serum_cholesterol = high-risk | heart_disease = no)
serum_cholesterol_mildrisk_no = num_serum_cholesterol_mildrisk_no/num_heart_disease_no # P(serum_cholesterol = mid-risk | heart_disease = no)
serum_cholesterol_normal_no = num_serum_cholesterol_normal_no/num_heart_disease_no # P(serum_cholesterol = normal | heart_disease = no)

diabetes_yes_yes = num_diabetes_yes_yes/num_heart_disease_yes # P(diabetes = yes | heart_disease = yes)
diabetes_no_yes = num_diabetes_no_yes/num_heart_disease_yes # P(diabetes = no | heart_disease = yes)
diabetes_yes_no = num_diabetes_yes_no/num_heart_disease_no # P(diabetes = yes | heart_disease = no)
diabetes_no_no = num_diabetes_no_no/num_heart_disease_no # P(diabetes = no | heart_disease = no)

left_ventricular_hypertrophy_yes_yes = num_left_ventricular_hypertrophy_yes_yes/num_heart_disease_yes # P(left_ventricular_hypertrophy = yes | heart_disease = yes)
left_ventricular_hypertrophy_no_yes = num_left_ventricular_hypertrophy_no_yes/num_heart_disease_yes # P(left_ventricular_hypertrophy = no | heart_disease = yes)
left_ventricular_hypertrophy_yes_no = num_left_ventricular_hypertrophy_yes_no/num_heart_disease_no # P(left_ventricular_hypertrophy = yes | heart_disease = no)
left_ventricular_hypertrophy_no_no = num_left_ventricular_hypertrophy_no_no/num_heart_disease_no # P(left_ventricular_hypertrophy = no | heart_disease = no)

ST_slope_anomaly_yes_yes = num_ST_slope_anomaly_yes_yes/num_heart_disease_yes # P(ST_slope_anomaly = yes | heart_disease = yes)
ST_slope_anomaly_no_yes = num_ST_slope_anomaly_no_yes/num_heart_disease_yes # P(ST_slope_anomaly = no | heart_disease = yes)
ST_slope_anomaly_yes_no = num_ST_slope_anomaly_yes_no/num_heart_disease_no # P(ST_slope_anomaly = yes | heart_disease = no)
ST_slope_anomaly_no_no = num_ST_slope_anomaly_no_no/num_heart_disease_no # P(ST_slope_anomaly = no | heart_disease = no)

myocardial_defect_yes_yes = num_myocardial_defect_yes_yes/num_heart_disease_yes # P(myocardial_defect = yes | heart_disease = yes)
myocardial_defect_no_yes = num_myocardial_defect_no_yes/num_heart_disease_yes # P(myocardial_defect = no | heart_disease = yes)
myocardial_defect_yes_no = num_myocardial_defect_yes_no/num_heart_disease_no # P(myocardial_defect = yes | heart_disease = no)
myocardial_defect_no_no = num_myocardial_defect_no_no/num_heart_disease_no # P(myocardial_defect = no | heart_disease = no)

## Task 2: Prediction
Now that we have the Prior Probabilities of Heart Disease and the Likelihoods of each feature, the next step is to use *Naive Bayes Classifer* to predict the target labels of *heart_disease_test_dataset.csv*.
To avoid floating-point underflow, we will use the **sum-of-log-probabilities** version of the *Naive Bayes* formula - the predicted label has the highest sum-of-log-probabilities score.

$B_{NB} = argmax_{B_i}(logP(B_{i}) + âˆ‘_{n=1}^d logP(e_n|B_i))$

Note: It doesn't matter which log base we use because of the change-of-base formula: $log_ax = \frac{log_bx}{log_ba}$

In [5]:
test_features = test[:, :-1] # All except the last column.
test_labels = test[:, -1] # Only the last column.
predict_labels = np.zeros_like(test_labels) # Create a numpy array of zeros with the same shape as test_labels. 

log_heart_disease_yes = np.log(heart_disease_yes) # log_e of P(heart_disease = yes)
log_heart_disease_no = np.log(heart_disease_no) # log_e of P(heart_disease = no)

log_resting_blood_pressure_hypertension_yes = np.log(resting_blood_pressure_hypertension_yes) # log_e of P(resting_blood_pressure = hypertension | heart_disease = yes)
log_resting_blood_pressure_elevated_yes = np.log(resting_blood_pressure_elevated_yes) # log_e of P(resting_blood_pressure = elevated | heart_disease = yes)
log_resting_blood_pressure_normal_yes = np.log(resting_blood_pressure_normal_yes) # log_e of P(resting_blood_pressure = normal | heart_disease = yes)
log_resting_blood_pressure_hypertension_no = np.log(resting_blood_pressure_hypertension_no) # log_e of P(resting_blood_pressure = hypertension | heart_disease = no)
log_resting_blood_pressure_elevated_no = np.log(resting_blood_pressure_elevated_no) # log_e of P(resting_blood_pressure = elevated | heart_disease = no)
log_resting_blood_pressure_normal_no = np.log(resting_blood_pressure_normal_no) # log_e of P(resting_blood_pressure = normal | heart_disease = no)

log_serum_cholesterol_highrisk_yes = np.log(serum_cholesterol_highrisk_yes) # log_e of P(serum_cholesterol = high-risk | heart_disease = yes)
log_serum_cholesterol_mildrisk_yes = np.log(serum_cholesterol_mildrisk_yes) # log_e of P(serum_cholesterol = mid-risk | heart_disease = yes)
log_serum_cholesterol_normal_yes = np.log(serum_cholesterol_normal_yes) # log_e of P(serum_cholesterol = normal | heart_disease = yes)
log_serum_cholesterol_highrisk_no = np.log(serum_cholesterol_highrisk_no) # log_e of P(serum_cholesterol = high-risk | heart_disease = no)
log_serum_cholesterol_mildrisk_no = np.log(serum_cholesterol_mildrisk_no) # log_e of P(serum_cholesterol = mid-risk | heart_disease = no)
log_serum_cholesterol_normal_no = np.log(serum_cholesterol_normal_no) # log_e of P(serum_cholesterol = normal | heart_disease = no)

log_diabetes_yes_yes = np.log(diabetes_yes_yes) # log_e of P(diabetes = yes | heart_disease = yes)
log_diabetes_no_yes = np.log(diabetes_no_yes) # log_e of P(diabetes = no | heart_disease = yes)
log_diabetes_yes_no = np.log(diabetes_yes_no) # log_e of P(diabetes = yes | heart_disease = no)
log_diabetes_no_no = np.log(diabetes_no_no) # log_e of P(diabetes = no | heart_disease = no)

log_left_ventricular_hypertrophy_yes_yes = np.log(left_ventricular_hypertrophy_yes_yes) # log_e of P(left_ventricular_hypertrophy = yes | heart_disease = yes)
log_left_ventricular_hypertrophy_no_yes = np.log(left_ventricular_hypertrophy_no_yes) # log_e of P(left_ventricular_hypertrophy = no | heart_disease = yes)
log_left_ventricular_hypertrophy_yes_no = np.log(left_ventricular_hypertrophy_yes_no) # log_e of P(left_ventricular_hypertrophy = yes | heart_disease = no)
log_left_ventricular_hypertrophy_no_no = np.log(left_ventricular_hypertrophy_no_no) # log_e of P(left_ventricular_hypertrophy = no | heart_disease = no)

log_ST_slope_anomaly_yes_yes = np.log(ST_slope_anomaly_yes_yes) # log_e of P(ST_slope_anomaly = yes | heart_disease = yes)
log_ST_slope_anomaly_no_yes = np.log(ST_slope_anomaly_no_yes) # log_e of P(ST_slope_anomaly = no | heart_disease = yes)
log_ST_slope_anomaly_yes_no = np.log(ST_slope_anomaly_yes_no) # log_e of P(ST_slope_anomaly = yes | heart_disease = no)
log_ST_slope_anomaly_no_no = np.log(ST_slope_anomaly_no_no) # log_e of P(ST_slope_anomaly = no | heart_disease = no)

log_myocardial_defect_yes_yes = np.log(myocardial_defect_yes_yes) # log_e of P(myocardial_defect = yes | heart_disease = yes)
log_myocardial_defect_no_yes = np.log(myocardial_defect_no_yes) # log_e of P(myocardial_defect = no | heart_disease = yes)
log_myocardial_defect_yes_no = np.log(myocardial_defect_yes_no) # log_e of P(myocardial_defect = yes | heart_disease = no)
log_myocardial_defect_no_no = np.log(myocardial_defect_no_no) # log_e of P(myocardial_defect = no | heart_disease = no)

for row in range(test_features.shape[0]):
  predict_yes = log_heart_disease_yes # log_e of P(heart_disease = yes)
  predict_no = log_heart_disease_no #log_e of P(heart_disease = no)

  if test_features[row, 0] == 2:
    predict_yes += log_resting_blood_pressure_hypertension_yes # log_e of P(resting_blood_pressure = hypertension | heart_disease = yes)
    predict_no += log_resting_blood_pressure_hypertension_no # log_e of P(resting_blood_pressure = hypertension | heart_disease = no)
  elif test_features[row, 0] == 1:
    predict_yes += log_resting_blood_pressure_elevated_yes # log_e of P(resting_blood_pressure = elevated | heart_disease = yes)
    predict_no += log_resting_blood_pressure_elevated_no # log_e of P(resting_blood_pressure = elevated | heart_disease = no)
  else:
    predict_yes += log_resting_blood_pressure_normal_yes # log_e of P(resting_blood_pressure = normal | heart_disease = yes)
    predict_no += log_resting_blood_pressure_normal_no # log_e of P(resting_blood_pressure = normal | heart_disease = no)

  if test_features[row, 1] == 2:
    predict_yes += log_serum_cholesterol_highrisk_yes # log_e of P(serum_cholesterol = high-risk | heart_disease = yes)
    predict_no += log_serum_cholesterol_highrisk_no # log_e of P(serum_cholesterol = high-risk | heart_disease = no)
  elif test_features[row, 1] == 1:
    predict_yes += log_serum_cholesterol_mildrisk_yes # log_e of P(serum_cholesterol = mid-risk | heart_disease = yes)
    predict_no += log_serum_cholesterol_mildrisk_no # log_e of P(serum_cholesterol = mid-risk | heart_disease = no)
  else:
    predict_yes += log_serum_cholesterol_normal_yes # log_e of P(serum_cholesterol = normal | heart_disease = yes)
    predict_no += log_serum_cholesterol_normal_no # log_e of P(serum_cholesterol = normal | heart_disease = no)

  if test_features[row, 2] == 1:
    predict_yes += log_diabetes_yes_yes # log_e of P(diabetes = yes | heart_disease = yes)
    predict_no += log_diabetes_yes_no # log_e of P(diabetes = yes | heart_disease = no)
  else:
    predict_yes += log_diabetes_no_yes # log_e of P(diabetes = no | heart_disease = yes)
    predict_no += log_diabetes_no_no # log_e of P(diabetes = no | heart_disease = no)

  if test_features[row, 3] == 1:
    predict_yes += log_left_ventricular_hypertrophy_yes_yes # log_e of P(left_ventricular_hypertrophy = yes | heart_disease = yes)
    predict_no += log_left_ventricular_hypertrophy_yes_no # log_e of P(left_ventricular_hypertrophy = yes | heart_disease = no)
  else:
    predict_yes += log_left_ventricular_hypertrophy_no_yes # log_e of P(left_ventricular_hypertrophy = no | heart_disease = yes)
    predict_no += log_left_ventricular_hypertrophy_no_no # log_e of P(left_ventricular_hypertrophy = no | heart_disease = no)

  if test_features[row, 4] == 1:
    predict_yes += log_ST_slope_anomaly_yes_yes # log_e of P(ST_slope_anomaly = yes | heart_disease = yes)
    predict_no += log_ST_slope_anomaly_yes_no # log_e of P(ST_slope_anomaly = yes | heart_disease = no)
  else:
    predict_yes += log_ST_slope_anomaly_no_yes # log_e of P(ST_slope_anomaly = no | heart_disease = yes)
    predict_no += log_ST_slope_anomaly_no_no # log_e of P(ST_slope_anomaly = no | heart_disease = no)

  if test_features[row, 5] == 1:
    predict_yes += log_myocardial_defect_yes_yes # log_e of P(myocardial_defect = yes | heart_disease = yes)
    predict_no += log_myocardial_defect_yes_no # log_e of P(myocardial_defect = yes | heart_disease = no)
  else:
    predict_yes += log_myocardial_defect_no_yes # log_e of P(myocardial_defect = no | heart_disease = yes)
    predict_no += log_myocardial_defect_no_no # log_e of P(myocardial_defect = no | heart_disease = no)

  # The predicted label has the highest sum-of-log-probabilities score.
  if predict_yes > predict_no:
    predict_labels[row] = 1

## Test Accuracy
Now we compare our predictions versus the actual target labels of *heart_disease_test_dataset.csv*. At the very least, we should achieve significantly higher than 50% accuracy, which is the baseline for guessing all 0s or all 1s. 

In [6]:
num_match = 0
for i in range(predict_labels.shape[0]):
  if predict_labels[i] == test_labels[i]:
    num_match += 1
accuracy_score = num_match/predict_labels.shape[0]

print(accuracy_score)
print(predict_labels)
print(test_labels)

0.74
[1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0. 1. 1. 0. 1. 0. 0. 1. 0.
 1. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0.
 1. 0.]
[1. 0. 0. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 0. 1. 1. 0.
 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0.
 0. 0.]


# Unmount Google Drive

In [7]:
drive.flush_and_unmount()