COMP 2211 Lab 3: K-Nearest Neighbors

COMP 2211 Exploring Artificial Intelligence

Lab 3 K-Nearest Neighbors

Review

This part of this lab is a review of the K-Nearest Neighbors. It aims to refresh your memory of what you have learned in class.

K-Nearest Neighbors

Computation Steps
Standardization
Outlier

Please download the notebook by right-clicking and selecting "Save link as" and opening it using using Google Colab. You should see the following if you open the notebook successfully.

End of Review

Introduction

People are always striving to make money for a living. The income status of a person can be impactful.
http://archive.ics.uci.edu/ml/datasets/Adult

In this lab, we are going to use a fraction of the 1994 Census database to build a K-nearest Neighbors Classifier for predicting whether a person makes over 50k a year.

End of Introduction

Lab Work

A couple of lab tasks are given to you to practice your skills in processing data and to build an AI model using KNN. Please download the notebook, the dataset to be used as well as the prediction file for answer checking and open it using Google Colab. You should see the following if you open the notebook successfully.

End of Lab Work

Submission & Grading

This is an odd-numbered lab, so there is no need to submit anything. Have fun playing with the notebooks! ;)

End of Submission & Grading

Frequently Asked Questions

UPDATES on lab3_tasks.ipynb (Please download the latest version):
- Error of using the metrics module:
add from sklearn import metrics
- Self checking on predicted result:
change to TA_y_pred = pd.read_csv('y_pred.csv', header=None).to_numpy()

Q: Do we perform standardization on all attributes or just the attributes with large values?
A: We are performing it on all attributes in order to transform them into a comparable scale. Thus all attributes can contribute equally to the model prediction. If we only do it on attributes with large ranges, they are still likely to have different scales with the unstandardized ones.

Q: In lab3_tasks.ipynb, it stated "the attributes with text values need to be converted into float type to create the vector representation", may I confirmed whether we need to transform the datatype of pandas dataframe from str to float or to int?
A: To int during transformation on dataframe and to float on later numpy operation. We need to convert the data information to float type matrix eventually. But regarding only the encoding of str value in pandas’ dataframe, int datatype is expected.

Q: Can you give us some basic introduction to pandas?
A: I will give an introduction to some pandas basic functions during Lab3. Please come to the tutorials if you wish to know more :)

Q: Regarding the encoding the catergorical attributes into integers, isn't the distance between the encoded value not consistent with attribute value itself? e.g. for the marital status attribute, never-married(encode to 0), divorced(encode to 1) and married-civ-spouse(encode to 2) doesn’t have the consistent linear relationship with their encoded value
A: Using labelencoder directly indeed will have that limitation, you can also try binarising follow by standardisation, but there will be much more new attributes and leading the original single attribute contributing with a larger weight to the model prediction
You can read more from
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
- https://scikit-learn.org/stable/modules/preprocessing_targets.html#preprocessing-targets

This list is incomplete; you can help by expanding it.

End of Frequently Asked Questions. Don't hesitate to ask ;)

Page maintained by

Ms. Chung Tsz Ting
Email: ttchungac@connect.ust.hk
Last Modified:

Homepage

Course Homepage