This section is meant to provide a discussion on the kth Nearest Neighbor (kNN) algorithm and clustering using K-means. Python version for kNN is discussed in the video and instructions for both Java and Python are mentioned in the slides. Plotviz is used for generating 3D visualizations.
We discuss simple Python k Nearest Neighbor code and its application to an artificial data set in 3 dimensions. Results are visualized in Matplotlib in 2D and with Plotviz in 3D. The concept of training and testing sets are introduced with training set pre-labelled.
Todo
The slides or videos are going to be updated
Slides (23 pages): https://iu.app.box.com/s/i9et3dxnhr3qt5gn14bg
Files:
kNN.py
kNN_Driver.py
DatingTesting2.txt
clusterFinal-M3-C3Dating-ReClustered.pviz
DatingRating-OriginalLabels.pviz
clusterFinal-M30-C28.pviz
This lesson considers the Python k Nearest Neighbor code found on the web associated with a book by Harrington on Machine Learning. There are two data sets. First we consider a set of 4 2D vectors divided into two categories (clusters) and use k=3 Nearest Neighbor algorithm to classify 3 test points. Second we consider a 3D dataset that has already been classified and show how to normalize. In this lesson we just use Matplotlib to give 2D plots.
Todo
The slides or videos are going to be updated
The lesson modifies the online code to allow it to produce files readable by PlotViz. We visualize already classified 3D set and rotate in 3D.
Todo
The slides or videos are going to be updated
Video: 7:41: Recommender System IV 3: Visualization: https://youtu.be/fLtH-ZI1Jqk
The lesson goes through an example of using k NN classification algorithm by dividing dataset into 2 subsets. One is training set with initial classification; the other is test point to be classified by k=3 NN using training set. The code records fraction of points with a different classification from that input. One can experiment with different sizes of the two subsets. The Python implementation of algorithm is analyzed in detail.
Todo
The slides or videos are going to be updated
Video: 10:06: Recommender System IV 4: Testing k’th Nearest Neighbor Algorithms: https://youtu.be/zLaPGMIQ9So
We use example of recommender system to discuss clustering. The details of methods are not discussed but k-means based clustering methods are used and their results examined in Plotviz. The original labelling is compared to clustering results and extension to 28 clusters given. General issues in clustering are discussed including local optima, the use of annealing to avoid this and value of heuristic algorithms.
Todo
The slides or videos are going to be updated
Slides (35 slides): https://iu.app.box.com/s/70qn6d61oln9b50jqobl
Files:
Fungi_LSU_3_15_to_3_26_zeroidx.pviz
DatingRating-OriginalLabels.pviz
clusterFinal-M30-C28.pviz
clusterFinal-M3-C3Dating-ReClustered.pviz
We introduce the k means algorithm in a gentle fashion and describes its key features including dangers of local minima. A simple example from Wikipedia is examined.
Todo
The slides or videos are going to be updated
Video: 9:06: Kmeans Clustering: https://youtu.be/3KTNJ0Okrqs
Plotviz is used to examine and compare the original classification with an ‘’optimal’’ clustering into 3 clusters using a fancy deterministic annealing method that is similar to k means. The new clustering has centers marked.
Todo
The slides or videos are going to be updated
Video: 9:00: Clustering of Recommender System Example: https://youtu.be/yl_KZ86NT-A
The previous division into 3 clusters is compared into a clustering into 28 separate clusters that are naturally smaller in size and divide 3D space covered by 1000 points into compact geometrically local regions.
Todo
The slides or videos are going to be updated
Video: 8:13: Clustering into more than 3 Clusters: https://youtu.be/JWZmh48l0cw
This lesson introduces some general principles. First many important processes are ‘’just’’ optimization problems. Most such problems are rife with local optima. The key idea behind annealing to avoid local optima is described. The pervasive greedy optimization method is described.
Todo
The slides or videos are going to be updated
Video: 9:23: Local Optima in Clustering: https://youtu.be/Zmq8O_axCmc
The two different applications of clustering are described. First find geometrically distinct regions and secondly divide spaces into geometrically compact regions that may have no ‘’thin air’’ between them. Generalizations such as mixture models and latent factor methods are just mentioned. The important distinction between applications in vector spaces and those where only inter-point distances are defined is described. Examples are then given using PlotViz from 2D clustering of a mass spectrometry example and the results of clustering genomic data mapped into 3D with Multi Dimensional Scaling MDS.
Todo
The slides or videos are going to be updated
Video: 8:33: Clustering in General: https://youtu.be/JejNZhBxjRU
Some remarks are given on heuristics; why are they so important why getting exact answers is often not so important?
Todo
The slides or videos are going to be updated
Video: 5:54: Heuristics: https://youtu.be/KT22YuX8ZMY