We use the K-means Python code in SciPy package to show real code for clustering. After a simple example we generate 4 clusters of distinct centers and various choice for sizes using Matplotlib tor visualization. We show results can sometimes be incorrect and sometimes make different choices among comparable solutions. We discuss the ‘’hill’’ between different solutions and rationale for running K-means many times and choosing best answer. Then we introduce MapReduce with the basic architecture and a homely example. The discussion of advanced topics includes an extension to Iterative MapReduce from Indiana University called Twister and a generalized Map Collective model. Some measurements of parallel performance are given. The SciPy K-means code is modified to support a MapReduce execution style. This illustrates the key ideas of mappers and reducers. With appropriate runtime this code would run in parallel but here the parallel maps run sequentially. This simple 2 map version can be generalized to scalable parallelism. Python is used to Calculate PageRank from Web Linkage Matrix showing several different formulations of the basic matrix equations to finding leading eigenvector. The unit is concluded by a calculation of PageRank for general web pages by extracting the secret from Google.
We use the K-means Python code in SciPy package to show real code for clustering. After a simple example we generate 4 clusters of distinct centers and various choice for sizes using Matplotlib tor visualization. We show results can sometimes be incorrect and sometimes make different choices among comparable solutions. We discuss the hill between different solutions and rationale for running K-means many times and choosing best answer.
Todo
The slides or videos are going to be updated
Slides (47 pages): https://iu.app.box.com/s/ltgbehfjwvgh40l5d3w8
Files:
We use the K-means Python code in SciPy package to show real code for clustering and applies it a set of 85 two dimensional vectors – officially sets of weights and heights to be clustered to find T-shirt sizes. We run through Python code with Matplotlib displays to divide into 2-5 clusters. Then we discuss Python to generate 4 clusters of varying sizes and centered at corners of a square in two dimensions. We formally give the K means algorithm better than before and make definition consistent with code in SciPy.
Todo
The slides or videos are going to be updated
Video: 11:42: Kmeans I: https://youtu.be/I79ISV6XBbE
We present clustering results on the artificial set of 1000 2D points described in previous lesson for 3 choices of cluster sizes small large and very large. We emphasize the SciPy always does 20 independent K means and takes the best result – an approach to avoiding local minima. We allow this number of independent runs to be changed and in particular set to 1 to generate more interesting erratic results. We define changes in our new K means code that also has two measures of quality allowed. The slides give many results of clustering into 2 4 6 and 8 clusters (there were only 4 real clusters). We show that the very small case has two very different solutions when clustered into two clusters and use this to discuss functions with multiple minima and a hill between them. The lesson has both discussion of already produced results in slides and interactive use of Python for new runs.
Todo
The slides or videos are going to be updated
Video 1: 11:54: Kmeans II: https://youtu.be/Srgq9VDg4C8
Video 2: 9:59: Kmeans III: https://youtu.be/rjyAXjA_mOk
Video 3: 8:38: Kmeans IV: https://youtu.be/N6QKyrhNVAc
We describe the basic architecture of MapReduce and a homely example. The discussion of advanced topics includes extension to Iterative MapReduce from Indiana University called Twister and a generalized Map Collective model. Some measurements of parallel performance are given.
Todo
The slides or videos are going to be updated
Slides (16 pages): https://iu.app.box.com/s/hqykdx1bquez7ers3d1j
This introduction uses an analogy to making fruit punch by slicing and blending fruit to illustrate MapReduce. The formal structure of MapReduce and Iterative MapReduce is presented with parallel data flowing from disks through multiple Map and Reduce phases to be inspected by the user.
Todo
The slides or videos are going to be updated
Video: 9:46: MapReduce Introduction: https://youtu.be/67qFY64aj7g
This defines 4 types of MapReduce and the Map Collective model of Qiu. The Iterative MapReduce model from Indiana University called Twister is described and a few performance measurements on Microsoft Azure are presented.
Todo
The slides or videos are going to be updated
Video 1: 11:16: MapReduce II: https://youtu.be/lo4movzSyVw
Video 2: 9:13: MapReduce III: https://youtu.be/wnanWncQBow
We modify the SciPy K-means code to support a MapReduce execution style and runs it in this short unit. This illustrates the key ideas of mappers and reducers. With appropriate runtime this code would run in parallel but here the parallel maps run sequentially. We stress that this simple 2 map version can be generalized to scalable parallelism.
Todo
The slides or videos are going to be updated
Slides (9 pages): https://iu.app.box.com/s/zc9pckhyehn0cog8wy19
Files:
We modify the SciPy K-means code to support a MapReduce execution style and runs it in this short unit. This illustrates the key ideas of mappers and reducers. With appropriate runtime this code would run in parallel but here the parallel maps run sequentially. We stress that this simple 2 map version can be generalized to scalable parallelism.
Todo
The slides or videos are going to be updated
Video 1: 9:00: Kmeans Python I: https://youtu.be/2El1oL3gKpQ
Video 2: 7:18: Kmeans Python II: https://youtu.be/LLrTWWdE3T0
We use Python to Calculate PageRank from Web Linkage Matrix showing several different formulations of the basic matrix equations to finding leading eigenvector. The unit is concluded by a calculation of PageRank for general web pages by extracting the secret from Google.
Todo
The slides or videos are going to be updated
Slides (19 pages): https://iu.app.box.com/s/gwq1qp0kmwbvilo0kjqq
Files:
We take two simple matrices for 6 and 8 web sites respectively to illustrate the calculation of PageRank.
Todo
The slides or videos are going to be updated
Video 1: 9:18: PageRank I: https://youtu.be/rLWUvvcHrCQ
Video 2: 9:57: PageRank II: https://youtu.be/UzQRukCFQv8
This tiny lesson presents a Python code that finds the Page Rank that Google calculates for any page on the web.
Todo
The slides or videos are going to be updated
Video: 9:57: PageRank III: https://youtu.be/8L_72bRLQVk