Heart diseases are the leading cause of death globally. According to World Health Organization, an estimated 17.9 million people died from heart diseases in 2019, representing 32% of all global deaths. Of these deaths, 85% were due to heart attack and stroke.
https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)
In this assignment, you are going to use a small
heart disease dataset to build a K-Nearest Neighbors
Classifier for Heart Disease Detection. In addition, you
are going to conduct D-fold Cross-Validation to find the
optimal K for the dataset.
The following tasks are given to you to practice your skills on building an AI model using K-Nearest Neighbors.
Each test case listed below is worth 1%:
standardize_dataset()
correctly standardizes the input array.calculate_euclidean_distance()
calculates correct distances for a small dataset.calculate_euclidean_distance()
calculates correct distances for a large dataset.find_k_nearest_neighbor_labels()
finds correct nearest neighbor labels for small odd-numbered k values.find_k_nearest_neighbor_labels()
finds correct nearest neighbor labels for large odd-numbered k values.predict()
returns the correct majority vote based on nearest neighbor labels for odd numbered k values.generate_confusion_matrix()
calculates correct True Positive, True Negative, False Positive, and False Negative values.calculate_accuracy_score()
calculates correct accuracy score given arbitrary y_predict and y_actual.calculate_MCC_score()
calculates correct MCC score given arbitrary y_predict and y_actual.generate_folds()
generates the correct combination of training and testing folds for a small dataset and small d value.generate_folds()
generates the correct combination of training and testing folds for a large dataset and a large d value.cross_validate()
calculates the correct validation scores for a small dataset and small d value.cross_validate()
calculates the correct validation scores for a large dataset and a large d value.validate_best_k()
returns the correct best k from k_list based on the average of validation scores.
Deadline: 23:59:00 on 26 March 2022 (Saturday).
Create a single zip file
that contains pa1_tasks.py file, only this file, NOT a
folder
containing them. Submit the zip file to ZINC. ZINC usage instructions can be found .
Notes:
It is required that your submissions can be run successfully in our online autograder ZINC. If we cannot even run your work, it won't be graded. Therefore, for parts that you cannot finish, just put in dummy implementation so that your whole program can be compiled for ZINC to grade the other parts that you have done. Empty implementations can be like:
def SomeFunctionICannotFinishRightNow():
return 0
def SomeFunctionICannotFinishRightNowButIWantOtherPartsGraded():
pass
Make sure you actually upload the correct version of your source files - we only grade what you upload. Some students in the past submitted an empty file or a wrong file which is worth zero mark. So you must double-check the file you have submitted.
There will be a penalty of -1 point (out of a maximum 100 points) for every minute you are late. For instance, since the deadline of assignment 1 is 23:59:00 on March 26th, if you submit your solution at 1:00:00 on March 27th, there will be a penalty of -61 points for your assignment. However, the lowest grade you may get from an assignment is zero: any negative score after the deduction due to late penalty (and any other penalties) will be reset to zero.
Q: My code doesn't work, there is an
error/bug, here is the code, can you help me fix it?
A: As the assignment is a major course
assessment, to be fair, you are supposed to work on it on
your own and we should not finish the tasks for you. We are
happy to help with explanations and advice, but we shall
not directly debug the code for you.
Q: Are we allowed to use external
libraries (e.g., scikit-klearn
) to implement
this assignment?
A: In this assignment, we will only be using NumPy
, and you are NOT allowed to
import extra external libraries (i.e., no
scikit-learn
).
The goal of this assignment is to implement kNN from scratch (we use NumPy
only because Python doesn't
have
native support for arrays), and to give you a hands-on experience of what is going on underneath
scikit-learn
.
Q: Are we allowed to use Python standard libraries (e.g., from collections import defaultdict
)?
A: Yes, Python standard libraries are allowed. Please visit here for an official comprehensive list of modules included in Python 3.
Q: If ZINC says that I have achieved
"Total Score 1/1", does that mean I have passed the
assignment and obtained full marks?
A: No, may not be. We
will regrade your submitted assignment file
using another set of test cases. So, you may get
different marks if you do not pass some of the test
cases during the re-grading performed after the
submission-deadline. Please check your code more
thoroughly.
Q: Is there any test cases of verified output that we can use to see if our codes work?
A: Such test cases will be on ZINC by this Saturday (12 March).
However, they will only be simple test cases.
Detailed grading test cases will remain hidden and will only activate 100 minutes after the deadline.
Designing and performing your own test cases and debugging is part of the Intended Learning Outcomes of PA1.
Highly recommend to compare your code output with that of the equivalent scikit-learn modules.
The revealed test cases before the deadline will only be to verify that your code was successfully submitted and can run on ZINC, but will NOT cover any grading/correctness cases.
Refer to these links :
KNeighborsClassifier,
MCC ,
GridSearchCV
Q: Divide by zero error during Matthews Correlation Coefficient (MCC) calculation
A: MCC denominator will only be zero in two cases, where all the predicted or actual labels are only of one value (i.e., all 0s or all 1s).
If all the actual labels are only of one value, most probably you've set too large D for D-Fold Cross-Validation, and being unlucky to generate a test fold with all 0s or all 1s.
If all the predicted labels are only of one value, then something is wrong with the KNNClassifier prediction code.
Of course, there might be bugs in your code elsewhere.
Q: 401 Unauthorised Error in PA1 Setup
A:
Make sure that you are using the same username/password as the one you use to login onto the course website.
To use Google Drive instead, you can copy the "Mount Google Drive" code from Lab 2.
Q: Where do we put the standardize_dataset()
function?
A: standardize_dataset()
is a standalone function.
Figuring out when and where to use it is part of the Optional Task.
Hint: Lecture Notes.
Q: Should we standardize categorical variables as well?
A: For simplicity, you are expected to standardize every feature column, including categorical and binary variables.
This is so that you can use Numpy broadcasting and don't have to check each column individually.