COMP 2211 Programming Assignment 1

Introduction

Heart diseases are the leading cause of death globally. According to World Health Organization, an estimated 17.9 million people died from heart diseases in 2019, representing 32% of all global deaths. Of these deaths, 85% were due to heart attack and stroke.
https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)

In this assignment, you are going to use a small heart disease dataset to build a K-Nearest Neighbors Classifier for Heart Disease Detection. In addition, you are going to conduct D-fold Cross-Validation to find the optimal K for the dataset.

End of Introduction

Assignment Tasks

The following tasks are given to you to practice your skills on building an AI model using K-Nearest Neighbors.

Task 1: Standardize Dataset
Task 2: K-Nearest Neighbors Classifier
- Task 2.1: Calculate Euclidean Distance
- Task 2.2: Find K-Nearest Neighbor Labels
- Task 2.3: Predict Label
Task 3: Scoring Metric
- Task 3.1: Confusion Matrix
- Task 3.2: Accuracy Score
- Task 3.3: Matthews Correlation Coefficient
Task 4: D-Fold Cross-Validation
- Task 4.1: Generate Folds
- Task 4.2: Cross Validate
- Task 4.3: Validate Best K
Optional Task: Test Run

Please download the notebook, the training dataset (here) and testing dataset (here), and open the notebook using Google Colab. You should see the following if you open the notebook successfully.

End of Assignment Tasks

Grading Scheme

Each test case listed below is worth 1%:

standardize_dataset() correctly standardizes the input array.
calculate_euclidean_distance() calculates correct distances for a small dataset.
calculate_euclidean_distance() calculates correct distances for a large dataset.
find_k_nearest_neighbor_labels() finds correct nearest neighbor labels for small odd-numbered k values.
find_k_nearest_neighbor_labels() finds correct nearest neighbor labels for large odd-numbered k values.
predict() returns the correct majority vote based on nearest neighbor labels for odd numbered k values.
generate_confusion_matrix() calculates correct True Positive, True Negative, False Positive, and False Negative values.
calculate_accuracy_score() calculates correct accuracy score given arbitrary y_predict and y_actual.
calculate_MCC_score() calculates correct MCC score given arbitrary y_predict and y_actual.
generate_folds() generates the correct combination of training and testing folds for a small dataset and small d value.
generate_folds() generates the correct combination of training and testing folds for a large dataset and a large d value.
cross_validate() calculates the correct validation scores for a small dataset and small d value.
cross_validate() calculates the correct validation scores for a large dataset and a large d value.
validate_best_k() returns the correct best k from k_list based on the average of validation scores.
Stress test for code stability and efficiency.

End of Grading Scheme

Submission and Deadline

Deadline: 23:59:00 on 26 March 2022 (Saturday).

ZINC Submission

Create a single zip file that contains pa1_tasks.py file, only this file, NOT a folder containing them. Submit the zip file to ZINC. ZINC usage instructions can be found .

Notes:

You may submit your file multiple times, but only the latest version will be graded.
Submit early to avoid any last-minute problem. Only ZINC submissions will be accepted.
The ZINC server will be very busy in the last day especially in the last few hours, so you should expect you would get the grading result report not-very-quickly. However, as long as your submission is successful, we would grade your latest submission with all test cases after the deadline.
If you have encountered any server-side problem or webpage glitches with ZINC, you may post on the ZINC support forum to get attention and help from the ZINC team quickly and directly. If you post on Piazza, you may not get the fastest response as we need to forward your report to them, and then forward their reply to you, etc.

Running Requirement

It is required that your submissions can be run successfully in our online autograder ZINC. If we cannot even run your work, it won't be graded. Therefore, for parts that you cannot finish, just put in dummy implementation so that your whole program can be compiled for ZINC to grade the other parts that you have done. Empty implementations can be like:

def SomeFunctionICannotFinishRightNow():
    return 0

def SomeFunctionICannotFinishRightNowButIWantOtherPartsGraded():
    pass

Reminders

Make sure you actually upload the correct version of your source files - we only grade what you upload. Some students in the past submitted an empty file or a wrong file which is worth zero mark. So you must double-check the file you have submitted.

Late Submission Policy

There will be a penalty of -1 point (out of a maximum 100 points) for every minute you are late. For instance, since the deadline of assignment 1 is 23:59:00 on March 26th, if you submit your solution at 1:00:00 on March 27th, there will be a penalty of -61 points for your assignment. However, the lowest grade you may get from an assignment is zero: any negative score after the deduction due to late penalty (and any other penalties) will be reset to zero.

End of Submission and Deadline

Frequently Asked Questions

Q: My code doesn't work, there is an error/bug, here is the code, can you help me fix it?
A: As the assignment is a major course assessment, to be fair, you are supposed to work on it on your own and we should not finish the tasks for you. We are happy to help with explanations and advice, but we shall not directly debug the code for you.

Q: Are we allowed to use external libraries (e.g., scikit-klearn) to implement this assignment?
A: In this assignment, we will only be using NumPy, and you are NOT allowed to import extra external libraries (i.e., no scikit-learn). The goal of this assignment is to implement kNN from scratch (we use NumPy only because Python doesn't have native support for arrays), and to give you a hands-on experience of what is going on underneath scikit-learn.

Q: Are we allowed to use Python standard libraries (e.g., from collections import defaultdict)?
A: Yes, Python standard libraries are allowed. Please visit here for an official comprehensive list of modules included in Python 3.

Q: If ZINC says that I have achieved "Total Score 1/1", does that mean I have passed the assignment and obtained full marks?
A: No, may not be. We will regrade your submitted assignment file using another set of test cases. So, you may get different marks if you do not pass some of the test cases during the re-grading performed after the submission-deadline. Please check your code more thoroughly.

Q: Is there any test cases of verified output that we can use to see if our codes work?
A: Such test cases will be on ZINC by this Saturday (12 March). However, they will only be simple test cases. Detailed grading test cases will remain hidden and will only activate 100 minutes after the deadline. Designing and performing your own test cases and debugging is part of the Intended Learning Outcomes of PA1. Highly recommend to compare your code output with that of the equivalent scikit-learn modules. The revealed test cases before the deadline will only be to verify that your code was successfully submitted and can run on ZINC, but will NOT cover any grading/correctness cases.
Refer to these links : KNeighborsClassifier, MCC , GridSearchCV

Q: Divide by zero error during Matthews Correlation Coefficient (MCC) calculation
A: MCC denominator will only be zero in two cases, where all the predicted or actual labels are only of one value (i.e., all 0s or all 1s). If all the actual labels are only of one value, most probably you've set too large D for D-Fold Cross-Validation, and being unlucky to generate a test fold with all 0s or all 1s. If all the predicted labels are only of one value, then something is wrong with the KNNClassifier prediction code. Of course, there might be bugs in your code elsewhere.

Q: 401 Unauthorised Error in PA1 Setup
A: Make sure that you are using the same username/password as the one you use to login onto the course website. To use Google Drive instead, you can copy the "Mount Google Drive" code from Lab 2.

Q: Where do we put the standardize_dataset() function?
A: standardize_dataset() is a standalone function. Figuring out when and where to use it is part of the Optional Task. Hint: Lecture Notes.

Q: Should we standardize categorical variables as well?
A: For simplicity, you are expected to standardize every feature column, including categorical and binary variables. This is so that you can use Numpy broadcasting and don't have to check each column individually.

End of Frequently Asked Questions

COMP 2211 Exploring Artificial Intelligence

Programming Assignment 1
K-Nearest Neighbors and Cross-validation

Introduction

Assignment Tasks

Grading Scheme

Submission and Deadline

ZINC Submission

Running Requirement

Reminders

Late Submission Policy

Frequently Asked Questions

Frequently Asked Questions

Menu

Page maintained by

Homepage

COMP 2211 Exploring Artificial Intelligence

Programming Assignment 1 K-Nearest Neighbors and Cross-validation

Introduction

Assignment Tasks

Grading Scheme

Submission and Deadline

ZINC Submission

Running Requirement

Reminders

Late Submission Policy

Frequently Asked Questions

Frequently Asked Questions

Menu

Page maintained by

Homepage

Programming Assignment 1
K-Nearest Neighbors and Cross-validation