Syllabus¶

Warning

This Syllabus will be modified. A final Syllabus will be available at the beginning of class. The class will be enhanced with SIGNIFICANT programming activities. However an option of taking the class without programming is available, but the maximum grade is limited to an A- in that case.

Errata
Section 1 - Introduction
- Unit 1.1 - Course Introduction
- Lesson 1
- Lesson 2 - Overall Introduction
- Lesson 3 - Course Topics I
- Lesson 4 - Course Topics II
- Lesson 5 - Course Topics III
- Unit 1.2 - Course Motivation
- Slides
- Lesson 1 - Introduction
- Lesson 2: Data Deluge
- Lesson 3 - Jobs
- Lesson 4 - Industrial Trends
- Lesson 5 - Digital Disruption of Old Favorites
- Lesson 6 - Computing Model: Industry adopted clouds which are attractive for data analytics
- Lesson 7 - Research Model: 4th Paradigm; From Theory to Data driven science?
- Lesson 8 - Data Science Process
- Lesson 9 - Physics-Informatics Looking for Higgs Particle with Large Hadron Collider LHC
- Lesson 10 - Recommender Systems I
- Lesson 11 - Recommender Systems II
- Lesson 12 - Web Search and Information Retrieval
- Lesson 13 - Cloud Application in Research
- Lesson 14 - Parallel Computing and MapReduce
- Lesson 15 - Data Science Education
- Lesson 16 - Conclusions
- Resources
Section 2 - Overview of Data Science: What is Big Data, Data Analytics and X-Informatics?
- Unit 3 - Part I: Data Science generics and Commercial Data Deluge
- Slides
- Lesson 1 - What is X-Informatics and its Motto
- Lesson 2 - Jobs
- Lesson 3 - Data Deluge ~~ General Structure
- Lesson 4 - Data Science ~~ Process
- Lesson 5 - Data Deluge ~~ Internet
- Lesson 6 - Data Deluge ~~ Business I
- Lesson 7 - Data Deluge ~~ Business II
- Lesson 8 - Data Deluge ~~ Business III
- Resources
- Unit 4 - Part II: Data Deluge and Scientific Applications and Methodology
- Unit Overview
- Slides
- Lesson 1 - Science & Research I
- Lesson 2 - Science & Research II
- Lesson 3 - Implications for Scientific Method
- Lesson 4 - Long Tail of Science
- Lesson 5 - Internet of Things
- Resources
- Unit 5 - Part III: Clouds and Big Data Processing; Data Science Process and Analytics
- Unit Overview
- Slides
- Lesson 1 - Clouds
- Lesson 2 - Features of Data Deluge I
- Lesson 3 - Features of Data Deluge II
- Lesson 4 - Data Science Process
- Lesson 5 - Data Analytics I
- Lesson 6 - Data Analytics II
- Resources
Section 3 - Health Informatics Case Study
- Unit 6 - X-Informatics Case Study: Health Informatics
- Unit Overview
- Slides:
- Lesson 1 - Big Data and Health
- Lesson 2 - Status of Healthcare Today
- Lesson 3 - Telemedicine (Virtual Health)
- Lesson 4 - Big Data and Healthcare Industry
- Lesson 5 - Medical Big Data in the Clouds
- Lesson 6 - Medical image Big Data
- Lesson 7 - Clouds and Health
- Lesson 8 - McKinsey Report on the big-data revolution in US health care
- Lesson 9 - Microsoft Report on Big Data in Health
- Lesson 10 - EU Report on Redesigning health in Europe for 2020
- Lesson 11 - Medicine and the Internet of Things
- Lesson 12 - Extrapolating to 2032
- Lesson 13 - Genomics, Proteomics and Information Visualization I
- Lesson 14 - Genomics, Proteomics and Information Visualization II
- Lesson 15 - Genomics, Proteomics and Information Visualization III
- Resources
- Slides
Section 4 - Sports Case Study
Unit 7 - Sports Informatics I : Sabermetrics (Basic)
- Unit Overview
- Slides
- Lesson 1 - Introduction and Sabermetrics (Baseball Informatics) Lesson
- Lesson 2 - Basic Sabermetrics
- Lesson 3 - Wins Above Replacement
- Resources
Unit 8 - Sports Informatics II : Sabermetrics (Advanced)
- Unit Overview
- Slides
- Lesson 1 - Pitching Clustering
- Lesson 2 - Pitcher Quality
- Lesson 3 - PITCHf/X
- Lesson 4 - Other Video Data Gathering in Baseball
- Resources
Unit 9 - Sports Informatics III : Other Sports
- Unit Overview
- Slides
- Lesson 1 - Wearables
- Lesson 2 - Soccer and the Olympics
- Lesson 3 - Spatial Visualization in NFL and NBA
- Lesson 4 - Tennis and Horse Racing
- Resources
Section 5 - Technology Training - Python & FutureSystems (will be updated)
Unit 10 - Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib
- Unit Overview
- Lesson 1 - Introduction
- Pycharm
- Python in 45 minutes
- Lesson 3 - Numpy 1
- Lesson 4 - Numpy 2
- Lesson 5 - Numpy 3
- Lesson 6 - Matplotlib 1
- Lesson 7 - Matplotlib 2
- Lesson 8 - Scipy 1
- Lesson 9 - Scipy 2
Unit 11 - Using FutureSystems (Please do not do yet)
- Unit Overview
- Lesson 1 - FutureSystems Overview
- Lesson 2 - Creating Portal Account
- Lesson 3 - Upload an OpenId
- Lesson 4 - SSH Key Generation using ssh-keygen command
- Lesson 5 - Shell Access via SSH
- Lesson 6 - Advanced SSH
- Lesson 7 - SSH Key Generation via putty (Windows user only)
- Lesson 8 - Using FS - Creating VM using Cloudmesh and running IPython
Section 6 - Physics Case Study
Unit 12 - I: Looking for Higgs Particles, Bumps in Histograms, Experiments and Accelerators
- Unit Overview
- Slides
- Files
- Lesson 1 - Looking for Higgs Particle and Counting Introduction I
- Lesson 2 - Looking for Higgs Particle and Counting Introduction II
- Lesson 3 - Physics-Informatics Looking for Higgs Particle Experiments
- Lesson 4 - Accelerator Picture Gallery of Big Science
- Resources
Unit 13 - II: Looking for Higgs Particles: Python Event Counting for Signal and Background
- Unit Overview
- Slides
- Files
- Lesson 1 - Physics Use Case II 1: Class Software
- Lesson 2 - Physics Use Case II 2: Event Counting
- Lesson 3 - Physics Use Case II 3: With Python examples of Signal plus Background
- Lesson 4 - Physics Use Case II 4: Change shape of background & num of Higgs Particles
- Resources
Unit 14 - III: Looking for Higgs Particles: Random Variables, Physics and Normal Distributions
- Unit Overview
- Slides
- Lesson 1 - Statistics Overview and Fundamental Idea: Random Variables
- Lesson 2 - Physics and Random Variables I
- Lesson 3 - Physics and Random Variables II
- Lesson 4 - Statistics of Events with Normal Distributions
- Lesson 5 - Gaussian Distributions
- Lesson 6 - Using Statistics
- Resources
Unit 15 - IV: Looking for Higgs Particles: Random Numbers, Distributions and Central Limit Theorem
- Unit Overview
- Slides
- Files
- Lesson 1 - Generators and Seeds I
- Lesson 2 - Generators and Seeds II
- Lesson 3 - Binomial Distribution
- Lesson 4 - Accept-Reject
- Lesson 5 - Monte Carlo Method
- Lesson 6 - Poisson Distribution
- Lesson 7 - Central Limit Theorem
- Lesson 8 - Interpretation of Probability: Bayes v. Frequency
- Resources
Section 7 - Big Data Use Cases Survey
Unit 16 - Overview of NIST Big Data Public Working Group (NBD-PWG) Process and Results
- Unit Overview
- Slides
- Lesson 1 - Introduction to NIST Big Data Public Working Group (NBD-PWG) Process
- Lesson 2 - Definitions and Taxonomies Subgroup
- Lesson 3 - Reference Architecture Subgroup
- Lesson 4 - Security and Privacy Subgroup
- Lesson 5 - Technology Roadmap Subgroup
- Lesson 6 - Requirements and Use Case Subgroup Introduction I
- Lesson 7 - Requirements and Use Case Subgroup Introduction II
- Lesson 8 - Requirements and Use Case Subgroup Introduction III
- Resources
Unit 17 - 51 Big Data Use Cases
- Unit Overview
- Slides
- Lesson 1 - Government Use Cases I
- Lesson 2 - Government Use Cases II
- Lesson 3 - Commercial Use Cases I
- Lesson 4 - Commercial Use Cases II
- Lesson 5 - Commercial Use Cases III
- Lesson 6 - Defense Use Cases I
- Lesson 7 - Defense Use Cases II
- Lesson 8 - Healthcare and Life Science Use Cases I
- Lesson 9 - Healthcare and Life Science Use Cases II
- Lesson 10 - Healthcare and Life Science Use Cases III
- Lesson 11 - Deep Learning and Social Networks Use Cases
- Lesson 12 - Research Ecosystem Use Cases
- Lesson 13 - Astronomy and Physics Use Cases I
- Lesson 14 - Astronomy and Physics Use Cases II
- Lesson 15 - Environment, Earth and Polar Science Use Cases I
- Lesson 16 - Environment, Earth and Polar Science Use Cases II
- Lesson 17 - Energy Use Case
- Resources
Unit 18 - Features of 51 Big Data Use Cases
- Slides
- Lesson 1 - Summary of Use Case Classification I
- Lesson 2 - Summary of Use Case Classification II
- Lesson 3 - Summary of Use Case Classification III
- Lesson 4 - Database(SQL) Use Case Classification
- Lesson 5 - NoSQL Use Case Classification
- Lesson 6 - Use Case Classifications I
- Lesson 7 - Use Case Classifications II Part 1
- Lesson 8 - Use Case Classifications II Part 2
- Lesson 9 - Use Case Classifications III Part 1
- Lesson 10 - Use Case Classifications III Part 2
- Resources
Section 8 - Technology Training - Plotviz
Unit 19 - Using Plotviz Software for Displaying Point Distributions in 3D
- Slides
- Files
- Lesson 1 - Motivation and Introduction to use
- Lesson 2 - Example of Use I: Cube and Structured Dataset
- Lesson 3 - Example of Use II: Proteomics and Synchronized Rotation
- Lesson 4 - Example of Use III: More Features and larger Proteomics Sample
- Lesson 5 - Example of Use IV: Tools and Examples
- Lesson 6 - Example of Use V: Final Examples
- Resources
Section 9 - e-Commerce and LifeStyle Case Study
Unit 20 - Recommender Systems: Introduction
- Slides
- Lesson 1 - Recommender Systems as an Optimization Problem
- Lesson 2 - Recommender Systems Introduction
- Lesson 3 - Kaggle Competitions
- Lesson 4 - Examples of Recommender Systems
- Lesson 5 - Netflix on Recommender Systems I
- Lesson 6 - Netflix on Recommender Systems II
- Lesson 7 - Consumer Data Science
- Resources
Unit 21 - Recommender Systems: Examples and Algorithms
- Slides
- Lesson 1 - Recap and Examples of Recommender Systems
- Lesson 2 - Examples of Recommender Systems
- Lesson 3 - Recommender Systems in Yahoo Use Case Example I
- Lesson 4 - Recommender Systems in Yahoo Use Case Example II
- Lesson 5 - Recommender Systems in Yahoo Use Case Example III: Particular Module
- Lesson 6 - User-based nearest-neighbor collaborative filtering I
- Lesson 7 - User-based nearest-neighbor collaborative filtering II
- Lesson 8 - Vector Space Formulation of Recommender Systems
- Resources
Unit 22 - Item-based Collaborative Filtering and its Technologies
- Slides
- Lesson 1 - Item-based Collaborative Filtering I
- Lesson 2 - Item-based Collaborative Filtering II
- Lesson 3 - k Nearest Neighbors and High Dimensional Spaces
Section 10 - Technology Training - kNN & Clustering
- Unit 23 - Recommender Systems - K-Nearest Neighbors (Python & Java Track)
- Slides
- Files
- Lesson 1 - Python k’th Nearest Neighbor Algorithms I
- Lesson 2 - Python k’th Nearest Neighbor Algorithms II
- Lesson 3 - 3D Visualization
- Lesson 4 - Testing k’th Nearest Neighbor Algorithms
Unit 24 - Clustering and heuristic methods
- Slides
- Files
- Lesson 1 - Kmeans Clustering
- Lesson 2 - Clustering of Recommender System Example
- Lesson 3 - Clustering of Recommender Example into more than 3 Clusters
- Lesson 4 - Local Optima in Clustering
- Lesson 5 - Clustering in General
- Lesson 6 - Heuristics
- Resources
Section 11 - Cloud Computing Technology for Big Data Applications & Analytics (will be updated)
Unit 25 - Parallel Computing: Overview of Basic Principles with familiar Examples
- Slides
- Lesson 1 - Decomposition I
- Lesson 2 - Decomposition II
- Lesson 3 - Decomposition III
- Lesson 4 - Parallel Computing in Society I
- Lesson 5 - Parallel Computing in Society II
- Lesson 6 - Parallel Processing for Hadrian’s Wall
- Resources
Unit 26 - Cloud Computing Technology Part I: Introduction
- Slides
- Lesson 1 - Cyberinfrastructure for E-MoreOrLessAnything
- Lesson 2 - What is Cloud Computing: Introduction
- Lesson 3 - What and Why is Cloud Computing: Several Other Views I
- Lesson 4 - What and Why is Cloud Computing: Several Other Views II
- Lesson 5 - What and Why is Cloud Computing: Several Other Views III
- Lesson 6 - Gartner’s Emerging Technology Landscape for Clouds and Big Data
- Lesson 7 - Simple Examples of use of Cloud Computing
- Lesson 8 - Value of Cloud Computing
- Resources
Unit 27 - Cloud Computing Technology Part II: Software and Systems
- Slides
- Lesson 1 - What is Cloud Computing
- Lesson 2 - Introduction to Cloud Software Architecture: IaaS and PaaS I
- Lesson 3 - Introduction to Cloud Software Architecture: IaaS and PaaS II
- Lesson 4 - Using the HPC-ABDS Software Stack
- Resources
Unit 28 - Cloud Computing Technology Part III: Architectures, Applications and Systems
- Slides
- Lesson 1 - Cloud (Data Center) Architectures I
- Lesson 2 - Cloud (Data Center) Architectures II
- Lesson 3 - Analysis of Major Cloud Providers
- Lesson 4 - Commercial Cloud Storage Trends
- Lesson 5 - Cloud Applications I
- Lesson 6 - Cloud Applications II
- Lesson 7 - Science Clouds
- Lesson 8 - Security
- Lesson 9 - Comments on Fault Tolerance and Synchronicity Constraints
- Resources
Unit 29 - Cloud Computing Technology Part IV: Data Systems
- Slides
- Lesson 1 - The 10 Interaction scenarios (access patterns) I
- Lesson 2 - The 10 Interaction scenarios - Science Examples
- Lesson 3 - Remaining general access patterns
- Lesson 4 - Data in the Cloud
- Lesson 5 - Applications Processing Big Data
- Resources
Section 12 - Web Search and Text Mining and their technologies
Unit 30 - Web Search and Text Mining I
- Slides
- Lesson 1 - Web and Document/Text Search: The Problem
- Lesson 2 - Information Retrieval leading to Web Search
- Lesson 3 - History behind Web Search
- Lesson 4 - Key Fundamental Principles behind Web Search
- Lesson 5 - Information Retrieval (Web Search) Components
- Lesson 6 - Search Engines
- Lesson 7 - Boolean and Vector Space Models
- Lesson 8 - Web crawling and Document Preparation
- Lesson 9 - Indices
- Lesson 10 - TF-IDF and Probabilistic Models
- Resources
Unit 31 - Web Search and Text Mining II
- Slides
- Lesson 1 - Data Analytics for Web Search
- Lesson 2 - Link Structure Analysis including PageRank I
- Lesson 3 - Link Structure Analysis including PageRank II
- Lesson 4 - Web Advertising and Search
- Lesson 5 - Clustering and Topic Models
- Resources
Section 13 - Technology for Big Data Applications and Analytics
Unit 32 - Technology for X-Informatics: K-means (Python & Java Track)
- Slides
- Files
- Lesson 1 - K-means in Python
- Lesson 2 - Analysis of 4 Artificial Clusters I
- Lesson 3 - Analysis of 4 Artificial Clusters II
- Lesson 4 - Analysis of 4 Artificial Clusters III
Unit 33 - Technology for X-Informatics: MapReduce
- Slides
- Lesson 1 - Introduction
- Lesson 2 - Advanced Topics I
- Lesson 3 - Advanced Topics II
- Unit 34 - Technology: Kmeans and MapReduce Parallelism
- Slides
- Files
- Lesson 1 - MapReduce Kmeans in Python I
- Lesson 2 - MapReduce Kmeans in Python II
Unit 35 - Technology: PageRank (Python & Java Track)
- Slides
- Files
- Lesson 1 - Calculate PageRank from Web Linkage Matrix I
- Lesson 2 - Calculate PageRank from Web Linkage Matrix II
- Lesson 3 - Calculate PageRank of a real page
Section 14 - Sensors Case Study
Unit 36 - Case Study: Sensors
- Slides
- Lesson 1 - Internet of Things
- Lesson 2 - Robotics and IOT Expectations
- Lesson 3 - Industrial Internet of Things I
- Lesson 4 - Industrial Internet of Things II
- Lesson 5 - Sensor Clouds
- Lesson 6 - Earth/Environment/Polar Science data gathered by Sensors
- Lesson 7 - Ubiquitous/Smart Cities
- Lesson 8 - U-Korea (U=Ubiquitous)
- Lesson 9 - Smart Grid
- Resources
Section 15 - Radar Case Study
- Unit 37 - Case Study: Radar
- Slides
- Lesson 1 - Introduction
- Lesson 2 - Remote Sensing
- Lesson 3 - Ice Sheet Science
- Lesson 4 - Global Climate Change
- Lesson 5 - Radio Overview
- Lesson 6 - Radio Informatics

Errata ¶

Note

You may find that some videos may have a different lesson, section or unit number. Please ignore this. In case the content does not correspond to the title, please let us know.

Section 1 - Introduction ¶

This section has a technical overview of course followed by a broad motivation for course.

The course overview covers it’s content and structure. It presents the X-Informatics fields (defined values of X) and the Rallying cry of course: Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics ( or e-X). The courses is set up as a MOOC divided into units that vary in length but are typically around an hour and those are further subdivided into 5-15 minute lessons.

The course covers a mix of applications (the X in X-Informatics) and technologies needed to support the field electronically i.e. to process the application data. The overview ends with a discussion of course content at highest level. The course starts with a longish Motivation unit summarizing clouds and data science, then units describing applications (X = Physics, e-Commerce, Web Search and Text mining, Health, Sensors and Remote Sensing). These are interspersed with discussions of infrastructure (clouds) and data analytics (algorithms like clustering and collaborative filtering used in applications). The course uses either Python or Java and there are Side MOOCs discussing Python and Java tracks.

The course motivation starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data. Then the cloud computing model developed at amazing speed by industry is introduced. The 4 paradigms of scientific research are described with growing importance of data oriented version.He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC’s.

Unit 1.1 - Course Introduction ¶

Lesson 1 ¶

We provide a short introduction to the course covering it’s content and structure. It presents the X-Informatics fields (defined values of X) and the Rallying cry of course: Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics ( or e-X). The courses is set up as a MOOC divided into units that vary in length but are typically around an hour and those are further subdivided into 5-15 minute lessons. It follows discussion of mechanics of course with a list of all the units offered.

Video: https://youtu.be/CRYz3iTJxRQ

VIdeo with cc: https://www.youtube.com/watch?v=WZxnCa9Ltoc

Lesson 2 - Overall Introduction ¶

This course gives an overview of big data from a use case (application) point of view noting that big data in field X drives the concept of X-Informatics. It covers applications, algorithms and infrastructure/technology (cloud computing). We are providing a short overview of the Syllabus

Video: https://youtu.be/Gpivfx4v5eY

Video with cc: https://www.youtube.com/watch?v=aqgDnu5fRMM

Lesson 3 - Course Topics I ¶

Discussion of some of the available units:

Motivation: Big Data and the Cloud; Centerpieces of the Future Economy
Introduction: What is Big Data, Data Analytics and X-Informatics
Python for Big Data Applications and Analytics: NumPy, SciPy, MatPlotlib
Using FutureGrid for Big Data Applications and Analytics Course
X-Informatics Physics Use Case, Discovery of Higgs Particle; Counting Events and Basic Statistics Parts I-IV.

Video: http://youtu.be/9NgG-AUOpYQ

Lesson 4 - Course Topics II ¶

Discussion of some more of the available units:

X-Informatics Use Cases: Big Data Use Cases Survey
Using Plotviz Software for Displaying Point Distributions in 3D
X-Informatics Use Case: e-Commerce and Lifestyle with recommender systems
Technology Recommender Systems - K-Nearest Neighbors, Clustering and heuristic methods
Parallel Computing Overview and familiar examples
Cloud Technology for Big Data Applications & Analytics

Video http://youtu.be/pxuyjeLQc54

Lesson 5 - Course Topics III ¶

Discussion of the remainder of the available units:

X-Informatics Use Case: Web Search and Text Mining and their technologies
Technology for X-Informatics: PageRank
Technology for X-Informatics: Kmeans
Technology for X-Informatics: MapReduce
Technology for X-Informatics: Kmeans and MapReduce Parallelism
X-Informatics Use Case: Sports
X-Informatics Use Case: Health
X-Informatics Use Case: Sensors
X-Informatics Use Case: Radar for Remote Sensing.

Video: http://youtu.be/rT4thK_i5ig

Unit 1.2 - Course Motivation ¶

We motivate the study of X-informatics by describing data science and clouds. He starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data.

He introduces the cloud computing model developed at amazing speed by industry. The 4 paradigms of scientific research are described with growing importance of data oriented version. He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC’s.

Slides ¶

https://iu.box.com/s/muldo1qkcdlpdeiog3zo

Lesson 1 - Introduction ¶

This presents the overview of talk, some trends in computing and data and jobs. Gartner’s emerging technology hype cycle shows many areas of Clouds and Big Data. We highlight 6 issues of importance: economic imperative, computing model, research model, Opportunities in advancing computing, Opportunities in X-Informatics, Data Science Education

Video: http://youtu.be/kyJxstTivoI

Lesson 2: Data Deluge ¶

We give some amazing statistics for total storage; uploaded video and uploaded photos; the social media interactions every minute; aspects of the business big data tidal wave; monitors of aircraft engines; the science research data sizes from particle physics to astronomy and earth science; genes sequenced; and finally the long tail of science. The next slide emphasizes applications using algorithms on clouds. This leads to the rallying cry “Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics educated in data science’’ with a catalog of the many values of X ‘’Astronomy, Biology, Biomedicine, Business, Chemistry, Climate, Crisis, Earth Science, Energy, Environment, Finance, Health, Intelligence, Lifestyle, Marketing, Medicine, Pathology, Policy, Radar, Security, Sensor, Social, Sustainability, Wealth and Wellness”

Video: http://youtu.be/sVNV0NxlQ6A

Lesson 3 - Jobs ¶

Jobs abound in clouds and data science. There are documented shortages in data science, computer science and the major tech companies advertise for new talent.

Video: http://youtu.be/h9u7YeKkHHU

Lesson 4 - Industrial Trends ¶

Trends include the growing importance of mobile devices and comparative decrease in desktop access, the export of internet content, the change in dominant client operating systems, use of social media, thriving Chinese internet companies.

Video: http://youtu.be/EIRIPDYN5nM

Lesson 5 - Digital Disruption of Old Favorites ¶

Not everything goes up. The rise of the Internet has led to declines in some traditional areas including Shopping malls and Postal Services.

Video: http://youtu.be/RxGopRuMWOE

Lesson 6 - Computing Model: Industry adopted clouds which are attractive for data analytics ¶

Clouds and Big Data are transformational on a 2-5 year time scale. Already Amazon AWS is a lucrative business with almost a $4B revenue. We describe the nature of cloud centers with economies of scale and gives examples of importance of virtualization in server consolidation. Then key characteristics of clouds are reviewed with expected high growth in Infrastructure, Platform and Software as a Service.

Video: http://youtu.be/NBZPQqXKbiw

Lesson 7 - Research Model: 4th Paradigm; From Theory to Data driven science?¶

We introduce the 4 paradigms of scientific research with the focus on the new fourth data driven methodology.

Video: http://youtu.be/2ke459BRBhw

Lesson 8 - Data Science Process ¶

We introduce the DIKW data to information to knowledge to wisdom paradigm. Data flows through cloud services transforming itself and emerging as new information to input into other transformations.

Video: http://youtu.be/j9ytOaBoe2k

Lesson 9 - Physics-Informatics Looking for Higgs Particle with Large Hadron Collider LHC ¶

We look at important particle physics example where the Large hadron Collider has observed the Higgs Boson. He shows this discovery as a bump in a histogram; something that so amazed him 50 years ago that he got a PhD in this field. He left field partly due to the incredible size of author lists on papers.

Video: http://youtu.be/qUB0q4AOavY

Lesson 10 - Recommender Systems I ¶

Many important applications involve matching users, web pages, jobs, movies, books, events etc. These are all optimization problems with recommender systems one important way of performing this optimization. We go through the example of Netflix ~~ everything is a recommendation and muses about the power of viewing all sorts of things as items in a bag or more abstractly some space with funny properties.

Video: http://youtu.be/Aj5k0Sa7XGQ

Lesson 11 - Recommender Systems II ¶

Continuation of Lesson 10 - Part 2

Video: http://youtu.be/VHS7il5OdjM

Lesson 12 - Web Search and Information Retrieval ¶

This course also looks at Web Search and here we give an overview of the data analytics for web search, Pagerank as a method of ranking web pages returned and uses material from Yahoo on the subtle algorithms for dynamic personalized choice of material for web pages.

Video: http://youtu.be/i9gR9PdVXUU

Lesson 13 - Cloud Application in Research ¶

We describe scientific applications and how they map onto clouds, supercomputers, grids and high throughput systems. He likes the cloud use of the Internet of Things and gives examples.

Video: http://youtu.be/C19-5WQH2TU

Lesson 14 - Parallel Computing and MapReduce ¶

We define MapReduce and gives a homely example from fruit blending.

Video: http://youtu.be/BbW1PFNnKrE

Lesson 15 - Data Science Education ¶

We discuss one reason you are taking this course ~~ Data Science as an educational initiative and aspects of its Indiana University implementation. Then general; features of online education are discussed with clear growth spearheaded by MOOC’s where we use this course and others as an example. He stresses the choice between one class to 100,000 students or 2,000 classes to 50 students and an online library of MOOC lessons. In olden days he suggested ‘’hermit’s cage virtual university’’ ~~ gurus in isolated caves putting together exciting curricula outside the traditional university model. Grading and mentoring models and important online tools are discussed. Clouds have MOOC’s describing them and MOOC’s are stored in clouds; a pleasing symmetry.

Video: http://youtu.be/x2LuiX8DYLs

Lesson 16 - Conclusions ¶

The conclusions highlight clouds, data-intensive methodology, employment, data science, MOOC’s and never forget the Big Data ecosystem in one sentence “Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics educated in data science”

Video: http://youtu.be/C0GszJg-MjE

Resources ¶

http://www.gartner.com/technology/home.jsp and many web links
Meeker/Wu May 29 2013 Internet Trends D11 Conference http://www.slideshare.net/kleinerperkins/kpcb-internet-trends-2013
http://cs.metrostate.edu/~sbd/slides/Sun.pdf
Taming The Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics, Bill Franks Wiley ISBN: 978-1-118-20878-6
Bill Ruh http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html
http://www.genome.gov/sequencingcosts/
CSTI General Assembly 2012, Washington, D.C., USA Technical Activities Coordinating Committee (TACC) Meeting, Data Management, Cloud Computing and the Long Tail of Science October 2012 Dennis Gannon
http://www.microsoft.com/en-us/news/features/2012/mar12/03-05CloudComputingJobs.aspx
http://www.mckinsey.com/mgi/publications/big_data/index.asp
Tom Davenport http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html
http://research.microsoft.com/en-us/people/barga/sc09_cloudcomp_tutorial.pdf
http://research.microsoft.com/pubs/78813/AJ18_EN.pdf
http://www.google.com/green/pdfs/google-green-computing.pdf
http://www.wired.com/wired/issue/16-07
http://research.microsoft.com/en-us/collaboration/fourthparadigm/
Jeff Hammerbacher http://berkeleydatascience.files.wordpress.com/2012/01/20120117berkeley1.pdf
http://grids.ucs.indiana.edu/ptliupages/publications/Where%20does%20all%20the%20data%20come%20from%20v7.pdf
http://www.interactions.org/cms/?pid=1032811
http://www.quantumdiaries.org/2012/09/07/why-particle-detectors-need-a-trigger/atlasmgg/
http://www.sciencedirect.com/science/article/pii/S037026931200857X
http://www.slideshare.net/xamat/building-largescale-realworld-recommender-systems-recsys2012-tutorial
http://www.ifi.uzh.ch/ce/teaching/spring2012/16-Recommender-Systems_Slides.pdf
http://en.wikipedia.org/wiki/PageRank
http://pages.cs.wisc.edu/~beechung/icml11-tutorial/
https://sites.google.com/site/opensourceiotcloud/
http://datascience101.wordpress.com/2013/04/13/new-york-times-data-science-articles/
http://blog.coursera.org/post/49750392396/on-the-topic-of-boredom
http://x-informatics.appspot.com/course
http://iucloudsummerschool.appspot.com/preview
https://www.youtube.com/watch?v=M3jcSCA9_hM

Section 2 - Overview of Data Science: What is Big Data, Data Analytics and X-Informatics?¶

The course introduction starts with X-Informatics and its rallying cry. The growing number of jobs in data science is highlighted. The first unit offers a look at the phenomenon described as the Data Deluge starting with its broad features. Data science and the famous DIKW (Data to Information to Knowledge to Wisdom) pipeline are covered. Then more detail is given on the flood of data from Internet and Industry applications with eBay and General Electric discussed in most detail.

In the next unit, we continue the discussion of the data deluge with a focus on scientific research. He takes a first peek at data from the Large Hadron Collider considered later as physics Informatics and gives some biology examples. He discusses the implication of data for the scientific method which is changing with the data-intensive methodology joining observation, theory and simulation as basic methods. Two broad classes of data are the long tail of sciences: many users with individually modest data adding up to a lot; and a myriad of Internet connected devices ~~ the Internet of Things.

We give an initial technical overview of cloud computing as pioneered by companies like Amazon, Google and Microsoft with new centers holding up to a million servers. The benefits of Clouds in terms of power consumption and the environment are also touched upon, followed by a list of the most critical features of Cloud computing with a comparison to supercomputing. Features of the data deluge are discussed with a salutary example where more data did better than more thought. Then comes Data science and one part of it ~~ data analytics ~~ the large algorithms that crunch the big data to give big wisdom. There are many ways to describe data science and several are discussed to give a good composite picture of this emerging field.

Unit 3 - Part I: Data Science generics and Commercial Data Deluge ¶

We start with X-Informatics and its rallying cry. The growing number of jobs in data science is highlighted. This unit offers a look at the phenomenon described as the Data Deluge starting with its broad features. Then he discusses data science and the famous DIKW (Data to Information to Knowledge to Wisdom) pipeline. Then more detail is given on the flood of data from Internet and Industry applications with eBay and General Electric discussed in most detail.

Slides ¶

https://iu.box.com/s/rmnw3soy81kc82a5qzow

Lesson 1 - What is X-Informatics and its Motto ¶

This discusses trends that are driven by and accompany Big data. We give some key terms including data, information, knowledge, wisdom, data analytics and data science. WE introduce the motto of the course: Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics. We list many values of X you can defined in various activities across the world.

Video: http://youtu.be/AKkyWF95Fp4

Lesson 2 - Jobs ¶

Big data is especially important as there are some many related jobs. We illustrate this for both cloud computing and data science from reports by Microsoft and the McKinsey institute respectively. We show a plot from LinkedIn showing rapid increase in the number of data science and analytics jobs as a function of time.

Video: http://youtu.be/pRlfEigUJAc

Lesson 3 - Data Deluge ~~ General Structure ¶

We look at some broad features of the data deluge starting with the size of data in various areas especially in science research. We give examples from real world of the importance of big data and illustrate how it is integrated into an enterprise IT architecture. We give some views as to what characterizes Big data and why data science is a science that is needed to interpret all the data.

Video: http://youtu.be/mPJ9twAFRQU

Lesson 4 - Data Science ~~ Process ¶

We stress the DIKW pipeline: Data becomes information that becomes knowledge and then wisdom, policy and decisions. This pipeline is illustrated with Google maps and we show how complex the ecosystem of data, transformations (filters) and its derived forms is.

Video: http://youtu.be/ydH34L-z0Rk

Lesson 5 - Data Deluge ~~ Internet ¶

We give examples of Big data from the Internet with Tweets, uploaded photos and an illustration of the vitality and size of many commodity applications.

Video: http://youtu.be/rtuq5y2Bx2g

Lesson 6 - Data Deluge ~~ Business I ¶

We give examples including the Big data that enables wind farms, city transportation, telephone operations, machines with health monitors, the banking, manufacturing and retail industries both online and offline in shopping malls. We give examples from ebay showing how analytics allowing them to refine and improve the customer experiences.

Video: http://youtu.be/PJz38t6yn_s

Lesson 7 - Data Deluge ~~ Business II ¶

Continuation of Lesson 6 - Part 2

Video: http://youtu.be/fESm-2Vox9M

Lesson 8 - Data Deluge ~~ Business III ¶

Continuation of Lesson 6 - Part 3

Video: http://youtu.be/fcvn-IxPO00

Resources ¶

http://www.microsoft.com/en-us/news/features/2012/mar12/03-05CloudComputingJobs.aspx
http://www.mckinsey.com/mgi/publications/big_data/index.asp
Tom Davenport http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html
Anjul Bhambhri http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html
Jeff Hammerbacher http://berkeleydatascience.files.wordpress.com/2012/01/20120117berkeley1.pdf
http://www.economist.com/node/15579717
http://cs.metrostate.edu/~sbd/slides/Sun.pdf
http://jess3.com/geosocial-universe-2/
Bill Ruhhttp://fisheritcenter.haas.berkeley.edu/Big_Data/index.html
http://www.hsph.harvard.edu/ncb2011/files/ncb2011-z03-rodriguez.pptx
Hugh Williams http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html

Unit 4 - Part II: Data Deluge and Scientific Applications and Methodology ¶

Unit Overview ¶

We continue the discussion of the data deluge with a focus on scientific research. He takes a first peek at data from the Large Hadron Collider considered later as physics Informatics and gives some biology examples. He discusses the implication of data for the scientific method which is changing with the data-intensive methodology joining observation, theory and simulation as basic methods. We discuss the long tail of sciences; many users with individually modest data adding up to a lot. The last lesson emphasizes how everyday devices ~~ the Internet of Things ~~ are being used to create a wealth of data.

Slides ¶

https://iu.box.com/s/e73lyv9sx7xcaqymb2n6

Lesson 1 - Science & Research I ¶

We look into more big data examples with a focus on science and research. We give astronomy, genomics, radiology, particle physics and discovery of Higgs particle (Covered in more detail in later lessons), European Bioinformatics Institute and contrast to Facebook and Walmart.

Video: http://youtu.be/u1h6bAkuWQ8

Lesson 2 - Science & Research II ¶

Continuation of Lesson 1 - Part 2

Video: http://youtu.be/_JfcUg2cheg

Lesson 3 - Implications for Scientific Method ¶

We discuss the emergences of a new fourth methodology for scientific research based on data driven inquiry. We contrast this with third ~~ computation or simulation based discovery - methodology which emerged itself some 25 years ago.

Video: http://youtu.be/srEbOAmU_g8

Lesson 4 - Long Tail of Science ¶

There is big science such as particle physics where a single experiment has 3000 people collaborate!.Then there are individual investigators who don’t generate a lot of data each but together they add up to Big data.

Video: http://youtu.be/dwzEKEGYhqE

Lesson 5 - Internet of Things ¶

A final category of Big data comes from the Internet of Things where lots of small devices ~~ smart phones, web cams, video games collect and disseminate data and are controlled and coordinated in the cloud.

Video: http://youtu.be/K2anbyxX48w

Resources ¶

http://www.economist.com/node/15579717
Geoffrey Fox and Dennis Gannon Using Clouds for Technical Computing To be published in Proceedings of HPC 2012 Conference at Cetraro, Italy June 28 2012
http://grids.ucs.indiana.edu/ptliupages/publications/Clouds_Technical_Computing_FoxGannonv2.pdf
http://grids.ucs.indiana.edu/ptliupages/publications/Where%20does%20all%20the%20data%20come%20from%20v7.pdf
http://www.genome.gov/sequencingcosts/
http://www.quantumdiaries.org/2012/09/07/why-particle-detectors-need-a-trigger/atlasmgg
http://salsahpc.indiana.edu/dlib/articles/00001935/
http://en.wikipedia.org/wiki/Simple_linear_regression
http://www.ebi.ac.uk/Information/Brochures/
http://www.wired.com/wired/issue/16-07
http://research.microsoft.com/en-us/collaboration/fourthparadigm/
CSTI General Assembly 2012, Washington, D.C., USA Technical Activities Coordinating Committee (TACC) Meeting, Data Management, Cloud Computing and the Long Tail of Science October 2012 Dennis Gannon https://sites.google.com/site/opensourceiotcloud/

Unit 5 - Part III: Clouds and Big Data Processing; Data Science Process and Analytics ¶

Unit Overview ¶

He discusses features of the data deluge with a salutary example where more data did better than more thought. He introduces data science and one part of it ~~ data analytics ~~ the large algorithms that crunch the big data to give big wisdom. There are many ways to describe data science and several are discussed to give a good composite picture of this emerging field.

Slides ¶

https://iu.box.com/s/38z9ryldgi3b8dgcbuan

Lesson 1 - Clouds ¶

We describe cloud data centers with their staggering size with up to a million servers in a single data center and centers built modularly from shipping containers full of racks. The benefits of Clouds in terms of power consumption and the environment are also touched upon, followed by a list of the most critical features of Cloud computing and a comparison to supercomputing.

Video: http://youtu.be/8RBzooC_2Fw

Lesson 2 - Features of Data Deluge I ¶

Data, Information, intelligence algorithms, infrastructure, data structure, semantics and knowledge are related. The semantic web and Big data are compared. We give an example where “More data usually beats better algorithms”. We discuss examples of intelligent big data and list 8 different types of data deluge

Video: http://youtu.be/FMktnTQGyrw

Lesson 3 - Features of Data Deluge II ¶

Continuation of Lesson 2 - Part 2

Video: http://youtu.be/QNVZobXHiZw

Lesson 4 - Data Science Process ¶

We describe and critique one view of the work of a data scientists. Then we discuss and contrast 7 views of the process needed to speed data through the DIKW pipeline.

Note

You may find that some videos may have a different lesson, section or unit number. Please ignore this. In case the content does not correspond to the title, please let us know.

Video: http://youtu.be/lpQ-Q9ZidR4

Lesson 5 - Data Analytics I ¶

We stress the importance of data analytics giving examples from several fields. We note that better analytics is as important as better computing and storage capability.

Video: http://youtu.be/RPVojR8jrb8

Lesson 6 - Data Analytics II ¶

Continuation of Lesson 5 - Part 2

Link to the slide: http://archive2.cra.org/ccc/files/docs/nitrdsymposium/keyes.pdf

High Performance Computing in Science and Engineering: the Tree and the Fruit

Video: http://youtu.be/wOSgywqdJDY

Resources ¶

CSTI General Assembly 2012, Washington, D.C., USA Technical Activities Coordinating Committee (TACC) Meeting, Data Management, Cloud Computing and the Long Tail of Science October 2012 Dennis Gannon
Dan Reed Roger Barga Dennis Gannon Rich Wolskihttp://research.microsoft.com/en-us/people/barga/sc09_cloudcomp_tutorial.pdf
http://www.datacenterknowledge.com/archives/2011/05/10/uptime-institute-the-average-pue-is-1-8/
http://loosebolts.wordpress.com/2008/12/02/our-vision-for-generation-4-modular-data-centers-one-way-of-getting-it-just-right/
http://www.mediafire.com/file/zzqna34282frr2f/koomeydatacenterelectuse2011finalversion.pdf
Bina Ramamurthy http://www.cse.buffalo.edu/~bina/cse487/fall2011/
Jeff Hammerbacher http://berkeleydatascience.files.wordpress.com/2012/01/20120117berkeley1.pdf
Jeff Hammerbacher http://berkeleydatascience.files.wordpress.com/2012/01/20120119berkeley.pdf
Anjul Bhambhri http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html
http://cs.metrostate.edu/~sbd/slides/Sun.pdf
Hugh Williams http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html
Tom Davenport http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html
http://www.mckinsey.com/mgi/publications/big_data/index.asp
http://cra.org/ccc/docs/nitrdsymposium/pdfs/keyes.pdf

Section 3 - Health Informatics Case Study ¶

This section starts by discussing general aspects of Big Data and Health including data sizes, different areas including genomics, EBI, radiology and the Quantified Self movement. We review current state of health care and trends associated with it including increased use of Telemedicine. We summarize an industry survey by GE and Accenture and an impressive exemplar Cloud-based medicine system from Potsdam. We give some details of big data in medicine. Some remarks on Cloud computing and Health focus on security and privacy issues.

We survey an April 2013 McKinsey report on the Big Data revolution in US health care; a Microsoft report in this area and a European Union report on how Big Data will allow patient centered care in the future. Examples are given of the Internet of Things, which will have great impact on health including wearables. A study looks at 4 scenarios for healthcare in 2032. Two are positive, one middle of the road and one negative. The final topic is Genomics, Proteomics and Information Visualization.

Unit 6 - X-Informatics Case Study: Health Informatics ¶

Unit Overview ¶

Slides:¶

https://iu.app.box.com/s/4v7omhmfpzd4y1bkpy9iab6o4jyephoa

Lesson 1 - Big Data and Health ¶

This lesson starts with general aspects of Big Data and Health including listing subareas where Big data important. Data sizes are given in radiology, genomics, personalized medicine, and the Quantified Self movement, with sizes and access to European Bioinformatics Institute.

Video: http://youtu.be/i7volfOVAmY

Lesson 2 - Status of Healthcare Today ¶

This covers trends of costs and type of healthcare with low cost genomes and an aging population. Social media and government Brain initiative.

Video: http://youtu.be/tAT3pux4zeg

Lesson 3 - Telemedicine (Virtual Health)¶

This describes increasing use of telemedicine and how we tried and failed to do this in 1994.

Video: http://youtu.be/4JbGim9FFXg

Lesson 4 - Big Data and Healthcare Industry ¶

Summary of an industry survey by GE and Accenture.

Video: http://youtu.be/wgK9JIUiWpQ

Lesson 5 - Medical Big Data in the Clouds ¶

An impressive exemplar Cloud-based medicine system from Potsdam.

Video: http://youtu.be/-D9mEdM62uY

Lesson 6 - Medical image Big Data ¶

Video: http://youtu.be/aaNplveyKf0

Lesson 7 - Clouds and Health ¶

Video: http://youtu.be/9Whkl_UPS5g

Lesson 8 - McKinsey Report on the big-data revolution in US health care ¶

This lesson covers 9 aspects of the McKinsey report. These are the convergence of multiple positive changes has created a tipping point for innovation; Primary data pools are at the heart of the big data revolution in healthcare; Big data is changing the paradigm: these are the value pathways; Applying early successes at scale could reduce US healthcare costs by $300 billion to $450 billion; Most new big-data applications target consumers and providers across pathways; Innovations are weighted towards influencing individual decision-making levers; Big data innovations use a range of public, acquired, and proprietary data types; Organizations implementing a big data transformation should provide the leadership required for the associated cultural transformation; Companies must develop a range of big data capabilities.

Video: http://youtu.be/bBoHzRjMEmY

Lesson 9 - Microsoft Report on Big Data in Health ¶

This lesson identifies data sources as Clinical Data, Pharma & Life Science Data, Patient & Consumer Data, Claims & Cost Data and Correlational Data. Three approaches are Live data feed, Advanced analytics and Social analytics.

Video: http://youtu.be/PjffvVgj1PE

Lesson 10 - EU Report on Redesigning health in Europe for 2020 ¶

This lesson summarizes an EU Report on Redesigning health in Europe for 2020. The power of data is seen as a lever for change in My Data, My decisions; Liberate the data; Connect up everything; Revolutionize health; and Include Everyone removing the current correlation between health and wealth.

Video: http://youtu.be/9mbt_ZSs0iw

Lesson 11 - Medicine and the Internet of Things ¶

The Internet of Things will have great impact on health including telemedicine and wearables. Examples are given.

Video: http://youtu.be/QGRfWlvw584

Lesson 12 - Extrapolating to 2032 ¶

A study looks at 4 scenarios for healthcare in 2032. Two are positive, one middle of the road and one negative.

Video: http://youtu.be/Qel4gmBxy8U

Lesson 13 - Genomics, Proteomics and Information Visualization I ¶

A study of an Azure application with an Excel frontend and a cloud BLAST backend starts this lesson. This is followed by a big data analysis of personal genomics and an analysis of a typical DNA sequencing analytics pipeline. The Protein Sequence Universe is defined and used to motivate Multi dimensional Scaling MDS. Sammon’s method is defined and its use illustrated by a metagenomics example. Subtleties in use of MDS include a monotonic mapping of the dissimilarity function. The application to the COG Proteomics dataset is discussed. We note that the MDS approach is related to the well known chisq method and some aspects of nonlinear minimization of chisq (Least Squares) are discussed.

Video: http://youtu.be/r1yENstaAUE

Lesson 14 - Genomics, Proteomics and Information Visualization II ¶

This lesson continues the discussion of the COG Protein Universe introduced in the last lesson. It is shown how Proteomics clusters are clearly seen in the Universe browser. This motivates a side remark on different clustering methods applied to metagenomics. Then we discuss the Generative Topographic Map GTM method that can be used in dimension reduction when original data is in a metric space and is in this case faster than MDS as GTM computational complexity scales like N not N squared as seen in MDS.

Examples are given of GTM including an application to topic models in Information Retrieval. Indiana University has developed a deterministic annealing improvement of GTM. 3 separate clusterings are projected for visualization and show very different structure emphasizing the importance of visualizing results of data analytics. The final slide shows an application of MDS to generate and visualize phylogenetic trees.

Video: http://youtu.be/_F1Eo6bfN0w

Lesson 15 - Genomics, Proteomics and Information Visualization III ¶

Video: http://youtu.be/R1svGGKipkc

Resources ¶

Slides ¶

https://iu.app.box.com/s/4v7omhmfpzd4y1bkpy9iab6o4jyephoa

Section 4 - Sports Case Study ¶

Sports sees significant growth in analytics with pervasive statistics shifting to more sophisticated measures. We start with baseball as game is built around segments dominated by individuals where detailed (video/image) achievement measures including PITCHf/x and FIELDf/x are moving field into big data arena. There are interesting relationships between the economics of sports and big data analytics. We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.

Unit 7 - Sports Informatics I : Sabermetrics (Basic)¶

Unit Overview ¶

This unit discusses baseball starting with the movie Moneyball and the 2002-2003 Oakland Athletics. Unlike sports like basketball and soccer, most baseball action is built around individuals often interacting in pairs. This is much easier to quantify than many player phenomena in other sports. We discuss Performance-Dollar relationship including new stadiums and media/advertising. We look at classic baseball averages and sophisticated measures like Wins Above Replacement.

Slides ¶

https://iu.box.com/s/trsxko7icktb7htqfickfsws0cqmvt2j

Lesson 1 - Introduction and Sabermetrics (Baseball Informatics) Lesson ¶

Introduction to all Sports Informatics, Moneyball The 2002-2003 Oakland Athletics, Diamond Dollars economic model of baseball, Performance - Dollar relationship, Value of a Win.

Video: http://youtu.be/oviNJ-_fLto

Lesson 2 - Basic Sabermetrics ¶

Different Types of Baseball Data, Sabermetrics, Overview of all data, Details of some statistics based on basic data, OPS, wOBA, ERA, ERC, FIP, UZR.

Video: http://youtu.be/-5JYfQXC2ew

Lesson 3 - Wins Above Replacement ¶

Wins above Replacement WAR, Discussion of Calculation, Examples, Comparisons of different methods, Coefficient of Determination, Another, Sabermetrics Example, Summary of Sabermetrics.

Video: http://youtu.be/V5uzUS6jdHw

Resources ¶

Unit 8 - Sports Informatics II : Sabermetrics (Advanced)¶

Unit Overview ¶

This unit discusses ‘advanced sabermetrics’ covering advances possible from using video from PITCHf/X, FIELDf/X, HITf/X, COMMANDf/X and MLBAM.

Slides ¶

https://iu.box.com/s/o2kikemoh2580ohzt2pn3y3jps4f7wr3

Lesson 1 - Pitching Clustering ¶

A Big Data Pitcher Clustering method introduced by Vince Gennaro, Data from Blog and video at 2013 SABR conference.

Video: http://youtu.be/I06_AOKyB20

Lesson 2 - Pitcher Quality ¶

Results of optimizing match ups, Data from video at 2013 SABR conference.

Video: http://youtu.be/vAPJx8as4_0

Lesson 3 - PITCHf/X ¶

Examples of use of PITCHf/X.

Video: http://youtu.be/JN1-sCa9Bjs

Lesson 4 - Other Video Data Gathering in Baseball ¶

FIELDf/X, MLBAM, HITf/X, COMMANDf/X.

Video: http://youtu.be/zGGThkkIJg8

Resources ¶

Unit 9 - Sports Informatics III : Other Sports ¶

Unit Overview ¶

We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.

Slides ¶

https://iu.box.com/s/ho0ktliih8cj0oyl929axwwu6083e8ck

Lesson 1 - Wearables ¶

Consumer Sports, Stake Holders, and Multiple Factors.

Video: http://youtu.be/1UzvNHzFCFQ

Lesson 2 - Soccer and the Olympics ¶

Soccer, Tracking Players and Balls, Olympics.

Video: http://youtu.be/01mlZ2KBkzE

Lesson 3 - Spatial Visualization in NFL and NBA ¶

NFL, NBA, and Spatial Visualization.

Video: http://youtu.be/Q0Pt97BwRlo

Lesson 4 - Tennis and Horse Racing ¶

Tennis, Horse Racing, and Continued Emphasis on Spatial Visualization.

Video: http://youtu.be/EuXrtfHG3cY

Resources ¶

Section 5 - Technology Training - Python & FutureSystems (will be updated)¶

This section is meant to give an overview of the python tools needed for doing for this course.

These are really powerful tools which every data scientist who wishes to use python must know.

NumPy - It is popular library on top of which many other libraries (like pandas, scipy) are built. It provides a way a vectorizing data. This helps to organize in a more intuitive fashion and also helps us use the various matrix operations which are popularly used by the machine learning community. Matplotlib: This a data visualization package. It allows you to create graphs charts and other such diagrams. It supports Images in JPEG, GIF, TIFF format. SciPy: SciPy is a library built above numpy and has a number of off the shelf algorithms / operations implemented. These include algorithms from calculus(like integration), statistics, linear algebra, image-processing, signal processing, machine learning, etc.

Unit 10 - Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib ¶

Unit Overview ¶

This section is meant to give an overview of the python tools needed for doing for this course. These are really powerful tools which every data scientist who wishes to use python must know.

Lesson 1 - Introduction ¶

This section is meant to give an overview of the python tools needed for doing for this course. These are really powerful tools which every data scientist who wishes to use python must know. This section covers NumPy, MatPlotLib, and Scipy.

Pycharm ¶

is an Integrated Development Environment (IDE) used for programming in Python. It provides code analysis, a graphical debugger, an integrated unit tester, integration with git.

Video: https://youtu.be/X8ZpbZweJcw

Python in 45 minutes ¶

Here is an introductory video about the Python programming language that we found on the internet. Naturally there are many alternatives to this video, but the video is probably a good start. It also uses PyCharm which we recommend.

https://www.youtube.com/watch?v=N4mEzFDjqtA

How much you want to understand of python is actually a bit up to your, while its goot to know classes and inheritance, you may be able for this class to get away without using it. However, we do recommend that you learn it.

Lesson 3 - Numpy 1 ¶

Video: http://youtu.be/mN_JpGO9Y6s

Lesson 4 - Numpy 2 ¶

Continuation of Lesson 3 - Part 2

Video: http://youtu.be/7QfW7AT7UNU

Lesson 5 - Numpy 3 ¶

Continuation of Lesson 3 - Part 3

Video: http://youtu.be/Ccb67Q5gpsk

Lesson 6 - Matplotlib 1 ¶

Matplotlib: This a data visualization package. It allows you to create graphs charts and other such diagrams. It supports Images in JPEG, GIF, TIFF format.

Video: http://youtu.be/3UOvB5OmtYE

Lesson 7 - Matplotlib 2 ¶

Continuation of Lesson 6 - Part 2

Video: http://youtu.be/9ONSnsN4hcg

Lesson 8 - Scipy 1 ¶

SciPy: SciPy is a library built above numpy and has a number of off the shelf algorithms / operations implemented. These include algorithms from calculus(like integration), statistics, linear algebra, image-processing, signal processing, machine learning, etc.

Video: http://youtu.be/lpC6Mn-09jY

Lesson 9 - Scipy 2 ¶

Continuation of Lesson 8 - Part 2

Video: http://youtu.be/-XKBz7qCUqw

Unit 11 - Using FutureSystems (Please do not do yet)¶

Unit Overview ¶

This section is meant to give an overview of the FutureSystems and how to use for the Big Data Course. In addition to this creating FutureSystems Account, Uploading OpenId and SSH Key and how to instantiate and log into Virtual Machine and accessing Ipython are covered. In the end we discuss about running Python and Java on Virtual Machine.

Lesson 1 - FutureSystems Overview ¶

In this video we introduce FutureSystems in terms of its services and features.

FirstProgram.java: http://openedx.scholargrid.org:18010/c4x/SoIC/INFO-I-523/asset/FirstProgram.java

Video: http://youtu.be/RibpNSyd4qg

Lesson 2 - Creating Portal Account ¶

This lesson explains how to create a portal account, which is the first step in gaining access to FutureSystems.

See Lesson 4 and 7 for SSH key generation on Linux, OSX or Windows.

Video: http://youtu.be/X6zeVEALzTk

Lesson 3 - Upload an OpenId ¶

This lesson explains how to upload and use OpenID to easily log into the FutureSystems portal.

Video: http://youtu.be/rZzpCYWDEpI

Lesson 4 - SSH Key Generation using ssh-keygen command ¶

SSH keys are used to identify user accounts in most systems including FutureSystems. This lesson walks you through generating an SSH key via ssh-keygen command line tool.

Video: http://youtu.be/pQb2VV1zNIc

Lesson 5 - Shell Access via SSH ¶

This lesson explains how to get access FutureSystems resources vis SSH terminal with your registered SSH key.

Video: http://youtu.be/aJDXfvOrzRE

Lesson 6 - Advanced SSH ¶

This lesson shows you how to write SSH ‘config’ file in advanced settings.

Video: http://youtu.be/eYanElmtqMo

Lesson 7 - SSH Key Generation via putty (Windows user only)¶

This lesson is for Windows users.

You will learn how to create an SSH key using PuTTYgen, add the public key to you FutureSystems portal, and then login using the PuTTY SSH client.

Video: http://youtu.be/irmVJKwWQCU

Lesson 8 - Using FS - Creating VM using Cloudmesh and running IPython ¶

This lesson explains how to log into FutureSystems and our customized shell and menu options that will simplify management of the VMs for this upcoming lessons.

Instruction is at: http://cloudmesh.github.io/introduction_to_cloud_computing/class/cm-mooc/cm-mooc.html

Video: http://youtu.be/nbZbJxheLwc

Section 6 - Physics Case Study ¶

This section starts by describing the LHC accelerator at CERN and evidence found by the experiments suggesting existence of a Higgs Boson. The huge number of authors on a paper, remarks on histograms and Feynman diagrams is followed by an accelerator picture gallery. The next unit is devoted to Python experiments looking at histograms of Higgs Boson production with various forms of shape of signal and various background and with various event totals. Then random variables and some simple principles of statistics are introduced with explanation as to why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they seen so often in natural phenomena. Several Python illustrations are given. Random Numbers with their Generators and Seeds lead to a discussion of Binomial and Poisson Distribution. Monte-Carlo and accept-reject methods. The Central Limit Theorem concludes discussion.

Unit 12 - I: Looking for Higgs Particles, Bumps in Histograms, Experiments and Accelerators ¶

Unit Overview ¶

This unit is devoted to Python and Java experiments looking at histograms of Higgs Boson production with various forms of shape of signal and various background and with various event totals. The lectures use Python but use of Java is described.

Slides ¶

https://iu.app.box.com/s/6uz4ofnnd9usv75cab71

Files ¶

HiggsClassI-Sloping.py

Lesson 1 - Looking for Higgs Particle and Counting Introduction I ¶

We return to particle case with slides used in introduction and stress that particles often manifested as bumps in histograms and those bumps need to be large enough to stand out from background in a statistically significant fashion.

Video: http://youtu.be/VQAupoFUWTg

Lesson 2 - Looking for Higgs Particle and Counting Introduction II ¶

We give a few details on one LHC experiment ATLAS. Experimental physics papers have a staggering number of authors and quite big budgets. Feynman diagrams describe processes in a fundamental fashion.

Video: http://youtu.be/UAMzmOgjj7I

Lesson 3 - Physics-Informatics Looking for Higgs Particle Experiments ¶

We give a few details on one LHC experiment ATLAS. Experimental physics papers have a staggering number of authors and quite big budgets. Feynman diagrams describe processes in a fundamental fashion.

Video: http://youtu.be/BW12d780qT8

Lesson 4 - Accelerator Picture Gallery of Big Science ¶

This lesson gives a small picture gallery of accelerators. Accelerators, detection chambers and magnets in tunnels and a large underground laboratory used fpr experiments where you need to be shielded from background like cosmic rays.

Video: http://youtu.be/WLJIxWWMYi8

Resources ¶

Unit 13 - II: Looking for Higgs Particles: Python Event Counting for Signal and Background ¶

Unit Overview ¶

This unit is devoted to Python experiments looking at histograms of Higgs Boson production with various forms of shape of signal and various background and with various event totals.

Slides ¶

https://iu.app.box.com/s/77iw9brrugz2pjoq6fw1

Files ¶

Lesson 1 - Physics Use Case II 1: Class Software ¶

We discuss how this unit uses Java and Python on both a backend server (FutureGrid) or a local client. WE point out useful book on Python for data analysis. This builds on technology training in Section 3.

Video: http://youtu.be/tOFJEUM-Vww

Lesson 2 - Physics Use Case II 2: Event Counting ¶

We define ‘’event counting’’ data collection environments. We discuss the python and Java code to generate events according to a particular scenario (the important idea of Monte Carlo data). Here a sloping background plus either a Higgs particle generated similarly to LHC observation or one observed with better resolution (smaller measurement error).

Video: http://youtu.be/h8-szCeFugQ

Lesson 3 - Physics Use Case II 3: With Python examples of Signal plus Background ¶

This uses Monte Carlo data both to generate data like the experimental observations and explore effect of changing amount of data and changing measurement resolution for Higgs.

Video: http://youtu.be/bl2f0tAzLj4

Lesson 4 - Physics Use Case II 4: Change shape of background & num of Higgs Particles ¶

This lesson continues the examination of Monte Carlo data looking at effect of change in number of Higgs particles produced and in change in shape of background.

Video: http://youtu.be/bw3fd5cfQhk

Resources ¶

Python for Data Analysis: Agile Tools for Real World Data By Wes McKinney, Publisher: O’Reilly Media, Released: October 2012, Pages: 472.
http://jwork.org/scavis/api/
https://en.wikipedia.org/wiki/DataMelt

Unit 14 - III: Looking for Higgs Particles: Random Variables, Physics and Normal Distributions ¶

Unit Overview ¶

We introduce random variables and some simple principles of statistics and explains why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they seen so often in natural phenomena. Several Python illustrations are given. Java is currently not available in this unit.

Slides ¶

https://iu.app.box.com/s/bcyze7h8knj6kvhyr05y

HiggsClassIII.py

Lesson 1 - Statistics Overview and Fundamental Idea: Random Variables ¶

We go through the many different areas of statistics covered in the Physics unit. We define the statistics concept of a random variable.

Video: http://youtu.be/0oZzALLzYBM

Lesson 2 - Physics and Random Variables I ¶

We describe the DIKW pipeline for the analysis of this type of physics experiment and go through details of analysis pipeline for the LHC ATLAS experiment. We give examples of event displays showing the final state particles seen in a few events. We illustrate how physicists decide whats going on with a plot of expected Higgs production experimental cross sections (probabilities) for signal and background.

Video: http://youtu.be/Tn3GBxgplxg

Lesson 3 - Physics and Random Variables II ¶

Video: http://youtu.be/qWEjp0OtvdA

Lesson 4 - Statistics of Events with Normal Distributions ¶

We introduce Poisson and Binomial distributions and define independent identically distributed (IID) random variables. We give the law of large numbers defining the errors in counting and leading to Gaussian distributions for many things. We demonstrate this in Python experiments.

Video: http://youtu.be/LMBtpWOOQLo

Lesson 5 - Gaussian Distributions ¶

We introduce the Gaussian distribution and give Python examples of the fluctuations in counting Gaussian distributions.

Video: http://youtu.be/LWIbPa-P5W0

Lesson 6 - Using Statistics ¶

We discuss the significance of a standard deviation and role of biases and insufficient statistics with a Python example in getting incorrect answers.

Video: http://youtu.be/n4jlUrGwgic

Resources ¶

Unit 15 - IV: Looking for Higgs Particles: Random Numbers, Distributions and Central Limit Theorem ¶

Unit Overview ¶

We discuss Random Numbers with their Generators and Seeds. It introduces Binomial and Poisson Distribution. Monte-Carlo and accept-reject methods are discussed. The Central Limit Theorem and Bayes law concludes discussion. Python and Java (for student - not reviewed in class) examples and Physics applications are given.

Slides ¶

https://iu.app.box.com/s/me7738igixwzc9h9qwe1

Files ¶

HiggsClassIII.py

Lesson 1 - Generators and Seeds I ¶

We define random numbers and describe how to generate them on the computer giving Python examples. We define the seed used to define to specify how to start generation.

Video: http://youtu.be/76jbRphjRWo

Lesson 2 - Generators and Seeds II ¶

We define random numbers and describe how to generate them on the computer giving Python examples. We define the seed used to define to specify how to start generation.

Video: http://youtu.be/9QY5qkQj2Ag

Lesson 3 - Binomial Distribution ¶

We define binomial distribution and give LHC data as an example of where this distribution valid.

Video: http://youtu.be/DPd-eVI_twQ

Lesson 4 - Accept-Reject ¶

We introduce an advanced method accept/reject for generating random variables with arbitrary distributions.

Video: http://youtu.be/GfshkKMKCj8

Lesson 5 - Monte Carlo Method ¶

We define Monte Carlo method which usually uses accept/reject method in typical case for distribution.

Video: http://youtu.be/kIQ-BTyDfOQ

Lesson 6 - Poisson Distribution ¶

We extend the Binomial to the Poisson distribution and give a set of amusing examples from Wikipedia.

Video: http://youtu.be/WFvgsVo-k4s

Lesson 7 - Central Limit Theorem ¶

We introduce Central Limit Theorem and give examples from Wikipedia.

Video: http://youtu.be/ZO53iKlPn7c

Lesson 8 - Interpretation of Probability: Bayes v. Frequency ¶

This lesson describes difference between Bayes and frequency views of probability. Bayes’s law of conditional probability is derived and applied to Higgs example to enable information about Higgs from multiple channels and multiple experiments to be accumulated.

Video: http://youtu.be/jzDkExAQI9M

Resources ¶

https://en.wikipedia.org/wiki/Pseudorandom_number_generator
https://en.wikipedia.org/wiki/Mersenne_Twister
https://en.wikipedia.org/wiki/Mersenne_prime
CMS-PAS-HIG-12-041 Updated results on the new boson discovered in the search for the standard model Higgs boson in the ZZ to 4 leptons channel in pp collisions at sqrt(s) = 7 and 8 TeV http://cds.cern.ch/record/1494488?ln=en
https://en.wikipedia.org/wiki/Poisson_distribution
https://en.wikipedia.org/wiki/Central_limit_theorem
http://jwork.org/scavis/api/
https://en.wikipedia.org/wiki/DataMelt

Section 7 - Big Data Use Cases Survey ¶

This section covers 51 values of X and an overall study of Big data that emerged from a NIST (National Institute for Standards and Technology) study of Big data. The section covers the NIST Big Data Public Working Group (NBD-PWG) Process and summarizes the work of five subgroups: Definitions and Taxonomies Subgroup, Reference Architecture Subgroup, Security and Privacy Subgroup, Technology Roadmap Subgroup and the Requirements andUse Case Subgroup. 51 use cases collected in this process are briefly discussed with a classification of the source of parallelism and the high and low level computational structure. We describe the key features of this classification.

Unit 16 - Overview of NIST Big Data Public Working Group (NBD-PWG) Process and Results ¶

Unit Overview ¶

This unit covers the NIST Big Data Public Working Group (NBD-PWG) Process and summarizes the work of five subgroups: Definitions and Taxonomies Subgroup, Reference Architecture Subgroup, Security and Privacy Subgroup, Technology Roadmap Subgroup and the Requirements and Use Case Subgroup. The work of latter is continued in next two units.

Slides ¶

https://iu.app.box.com/s/bgr7lyaz7uazcarangqd

Lesson 1 - Introduction to NIST Big Data Public Working Group (NBD-PWG) Process ¶

The focus of the (NBD-PWG) is to form a community of interest from industry, academia, and government, with the goal of developing a consensus definitions, taxonomies, secure reference architectures, and technology roadmap. The aim is to create vendor-neutral, technology and infrastructure agnostic deliverables to enable big data stakeholders to pick-and-choose best analytics tools for their processing and visualization requirements on the most suitable computing platforms and clusters while allowing value-added from big data service providers and flow of data between the stakeholders in a cohesive and secure manner.

Video: http://youtu.be/ofRfHBKpyvg

Lesson 2 - Definitions and Taxonomies Subgroup ¶

The focus is to gain a better understanding of the principles of Big Data. It is important to develop a consensus-based common language and vocabulary terms used in Big Data across stakeholders from industry, academia, and government. In addition, it is also critical to identify essential actors with roles and responsibility, and subdivide them into components and sub-components on how they interact/ relate with each other according to their similarities and differences.

For Definitions: Compile terms used from all stakeholders regarding the meaning of Big Data from various standard bodies, domain applications, and diversified operational environments. For Taxonomies: Identify key actors with their roles and responsibilities from all stakeholders, categorize them into components and subcomponents based on their similarities and differences. In particular data Science and Big Data terms are discussed.

Video: http://youtu.be/sGshHN-DdbE

Lesson 3 - Reference Architecture Subgroup ¶

The focus is to form a community of interest from industry, academia, and government, with the goal of developing a consensus-based approach to orchestrate vendor-neutral, technology and infrastructure agnostic for analytics tools and computing environments. The goal is to enable Big Data stakeholders to pick-and-choose technology-agnostic analytics tools for processing and visualization in any computing platform and cluster while allowing value-added from Big Data service providers and the flow of the data between the stakeholders in a cohesive and secure manner. Results include a reference architecture with well defined components and linkage as well as several exemplars.

Video: http://youtu.be/JV596ZH36YA

Lesson 4 - Security and Privacy Subgroup ¶

The focus is to form a community of interest from industry, academia, and government, with the goal of developing a consensus secure reference architecture to handle security and privacy issues across all stakeholders. This includes gaining an understanding of what standards are available or under development, as well as identifies which key organizations are working on these standards. The Top Ten Big Data Security and Privacy Challenges from the CSA (Cloud Security Alliance) BDWG are studied. Specialized use cases include Retail/Marketing, Modern Day Consumerism, Nielsen Homescan, Web Traffic Analysis, Healthcare, Health Information Exchange, Genetic Privacy, Pharma Clinical Trial Data Sharing, Cyber-security, Government, Military and Education.

Video: http://youtu.be/Gbk0LaWE3lM

Lesson 5 - Technology Roadmap Subgroup ¶

The focus is to form a community of interest from industry, academia, and government, with the goal of developing a consensus vision with recommendations on how Big Data should move forward by performing a good gap analysis through the materials gathered from all other NBD subgroups. This includes setting standardization and adoption priorities through an understanding of what standards are available or under development as part of the recommendations. Tasks are gather input from NBD subgroups and study the taxonomies for the actors’ roles and responsibility, use cases and requirements, and secure reference architecture; gain understanding of what standards are available or under development for Big Data; perform a thorough gap analysis and document the findings; identify what possible barriers may delay or prevent adoption of Big Data; and document vision and recommendations.

Video: http://youtu.be/GCc9yfErmd0

Lesson 6 - Requirements and Use Case Subgroup Introduction I ¶

The focus is to form a community of interest from industry, academia, and government, with the goal of developing a consensus list of Big Data requirements across all stakeholders. This includes gathering and understanding various use cases from diversified application domains.Tasks are gather use case input from all stakeholders; derive Big Data requirements from each use case; analyze/prioritize a list of challenging general requirements that may delay or prevent adoption of Big Data deployment; develop a set of general patterns capturing the ‘’essence’’ of use cases (not done yet) and work with Reference Architecture to validate requirements and reference architecture by explicitly implementing some patterns based on use cases. The progress of gathering use cases (discussed in next two units) and requirements systemization are discussed.

Video: http://youtu.be/sztqNXJ9P6c

Lesson 7 - Requirements and Use Case Subgroup Introduction II ¶

Video: http://youtu.be/0sbfIqHUauI

Lesson 8 - Requirements and Use Case Subgroup Introduction III ¶

Video: http://youtu.be/u59559nqjiY

Resources ¶

NIST Big Data Public Working Group (NBD-PWG) Process https://www.nist.gov/el/cyber-physical-systems/big-data-pwg
Big Data Definitions: http://dx.doi.org/10.6028/NIST.SP.1500-1 (link is external)
Big Data Taxonomies: http://dx.doi.org/10.6028/NIST.SP.1500-2 (link is external)
Big Data Use Cases and Requirements: http://dx.doi.org/10.6028/NIST.SP.1500-3 (link is external)
Big Data Security and Privacy: http://dx.doi.org/10.6028/NIST.SP.1500-4 (link is external)
Big Data Architecture White Paper Survey: http://dx.doi.org/10.6028/NIST.SP.1500-5 (link is external)
Big Data Reference Architecture: http://dx.doi.org/10.6028/NIST.SP.1500-6 (link is external)
Big Data Standards Roadmap: http://dx.doi.org/10.6028/NIST.SP.1500-7 (link is external)

Some of the links bellow may be outdated. Please let us know the new links and notify us of the outdated links.

DCGSA Standard Cloud: https://www.youtube.com/watch?v=l4Qii7T8zeg
On line 51 Use Cases http://bigdatawg.nist.gov/usecases.php
Summary of Requirements Subgroup http://bigdatawg.nist.gov/_uploadfiles/M0245_v5_6066621242.docx
Use Case 6 Mendeley http://mendeley.com%20http//dev.mendeley.com
Use Case 7 Netflix http://www.slideshare.net/xamat/building-largescale-realworld-recommender-systems-recsys2012-tutoria
Use Case 8 Search http://www.slideshare.net/kleinerperkins/kpcb-internet-trends-2013, http://webcourse.cs.technion.ac.il/236621/Winter2011-2012/en/ho_Lectures.html, http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws, http://www.slideshare.net/beechung/recommender-systems-tutorialpart1intro, http://www.worldwidewebsize.com/
Use Case 9 IaaS (Infrastructure as a Service) Big Data Business Continuity & Disaster Recovery (BC/DR) Within A Cloud Eco-System provided by Cloud Service Providers (CSPs) and Cloud Brokerage Service Providers (CBSPs) http://www.disasterrecovery.org/
Use Case 11 and Use Case 12 Simulation driven Materials Genomics https://www.materialsproject.org/
Use Case 13 Large Scale Geospatial Analysis and Visualization http://www.opengeospatial.org/standards, http://geojson.org/ , http://earth-info.nga.mil/publications/specs/printed/CADRG/cadrg.html
Use Case 14 Object identification and tracking from Wide Area Large Format Imagery (WALF) Imagery or Full Motion Video (FMV) - Persistent Surveillance http://www.militaryaerospace.com/topics/m/video/79088650/persistent-surveillance-relies-on-extracting-relevant-data-points-and-connecting-the-dots.htm, http://www.defencetalk.com/wide-area-persistent-surveillance-revolutionizes-tactical-isr-45745/
Use Case 15 Intelligence Data Processing and Analysis http://www.afcea-aberdeen.org/files/presentations/AFCEAAberdeen_DCGSA_COLWells_PS.pdf, http://stids.c4i.gmu.edu/papers/STIDSPapers/STIDS2012_T14_SmithEtAl_HorizontalIntegrationOfWarfighterIntel.pdf, http://stids.c4i.gmu.edu/STIDS2011/papers/STIDS2011_CR_T1_SalmenEtAl.pdf, https://www.youtube.com/watch?v=l4Qii7T8zeg, http://dcgsa.apg.army.mil/
Use Case 16 Electronic Medical Record (EMR) Data: Regenstrief Institute , Logical observation identifiers names and codes , Indiana Health Information Exchange , Institute of Medicine Learning Healthcare System
Use Case 17 Pathology Imaging/digital pathology; https://web.cci.emory.edu/confluence/display/PAIS , https://web.cci.emory.edu/confluence/display/HadoopGIS
Use Case 19 Genome in a Bottle Consortium: www.genomeinabottle.org
Use Case 20 Comparative analysis for metagenomes and genomes http://img.jgi.doe.gov/
Use Case 25 Biodiversity and LifeWatch
Use Case 26 Deep Learning: Recent popular press coverage of deep learning technology: http://www.nytimes.com/2012/11/24/science/scientists-see-advances-in-deep-learning-a-part-of-artificial-intelligence.html , http://www.nytimes.com/2012/06/26/technology/in-a-big-network-of-computers-evidence-of-machine-learning.html , http://www.wired.com/2013/06/andrew_ng/,

A recent research paper on HPC for Deep Learning: http://www.stanford.edu/~acoates/papers/CoatesHuvalWangWuNgCatanzaro_icml2013.pdf, Widely-used tutorials and references for Deep Learning: http://ufldl.stanford.edu/wiki/index.php/Main_Page, http://deeplearning.net/
Use Case 27 Organizing large-scale, unstructured collections of consumer photos http://vision.soic.indiana.edu/projects/disco/
Use Case 28 Truthy: Information diffusion research from Twitter Data http://truthy.indiana.edu/ , http://cnets.indiana.edu/groups/nan/truthy/ , http://cnets.indiana.edu/groups/nan/despic/
Use Case 30 CINET: Cyberinfrastructure for Network (Graph) Science and Analytics http://cinet.vbi.vt.edu/cinet_new/
Use Case 31 NIST Information Access Division analytic technology performance measurement, evaluations, and standards http://www.nist.gov/itl/iad/
Use Case 32 DataNet Federation Consortium DFC: The DataNet Federation Consortium , iRODS
Use Case 33 The ‘Discinnet process’, metadata < - > big data global experiment http://www.discinnet.org/
Use Case 34 Semantic Graph-search on Scientific Chemical and Text-based Data http://www.eurekalert.org/pub_releases/2013-07/aiop-ffm071813.php , http://xpdb.nist.gov/chemblast/pdb.pl
Use Case 35 Light source beamlines http://www-als.lbl.gov/ , https://www1.aps.anl.gov/
Use Case 36 CRTS survey , CSS survey ; For an overview of the classification challenges, see, e.g., http://arxiv.org/abs/1209.1681
Use Case 37 DOE Extreme Data from Cosmological Sky Survey and Simulations http://www.lsst.org/lsst/ , http://www.nersc.gov/ , http://www.nersc.gov/assets/Uploads/HabibcosmosimV2.pdf
Use Case 38 Large Survey Data for Cosmology http://desi.lbl.gov/ , http://www.darkenergysurvey.org/
Use Case 39 Particle Physics: Analysis of LHC Large Hadron Collider Data: Discovery of Higgs particle http://grids.ucs.indiana.edu/ptliupages/publications/Where%20does%20all%20the%20data%20come%20from%20v7.pdf , http://www.es.net/assets/pubs_presos/High-throughput-lessons-from-the-LHC-experience.Johnston.TNC2013.pdf
Use Case 40 Belle II High Energy Physics Experiment http://belle2.kek.jp/
Use Case 41 EISCAT 3D incoherent scatter radar system https://www.eiscat3d.se/
Use Case 42 ENVRI, Common Operations of Environmental Research Infrastructure, ENVRI Project website , ENVRI Reference Model , ENVRI deliverable D3.2 : Analysis of common requirements of Environmental Research Infrastructures , ICOS , Euro - Argo , EISCAT 3D , LifeWatch , EPOS , EMSO
Use Case 43 Radar Data Analysis for CReSIS Remote Sensing of Ice Sheets https://www.cresis.ku.edu/
Use Case 44 UAVSAR Data Processing, Data Product Delivery, and Data Services http://uavsar.jpl.nasa.gov/ , http://www.asf.alaska.edu/program/sdc , http://geo-gateway.org/main.html
Use Case 47 Atmospheric Turbulence - Event Discovery and Predictive Analytics http://oceanworld.tamu.edu/resources/oceanography-book/teleconnections.htm , http://www.forbes.com/sites/toddwoody/2012/03/21/meet-the-scientists-mining-big-data-to-predict-the-weather/
Use Case 48 Climate Studies using the Community Earth System Model at DOE.s NERSC center http://www-pcmdi.llnl.gov/ , http://www.nersc.gov/ , http://science.energy.gov/ber/research/cesd/ , http://www2.cisl.ucar.edu/
Use Case 50 DOE-BER AmeriFlux and FLUXNET Networks http://ameriflux.lbl.gov/ , http://www.fluxdata.org/default.aspx
Use Case 51 Consumption forecasting in Smart Grids http://smartgrid.usc.edu/, http://ganges.usc.edu/wiki/Smart_Grid, https://www.ladwp.com/ladwp/faces/ladwp/aboutus/a-power/a-p-smartgridla?_afrLoop=157401916661989&_afrWindowMode=0&_afrWindowId=null#%40%3F_afrWindowId%3Dnull%26_afrLoop%3D157401916661989%26_afrWindowMode%3D0%26_adf.ctrl-state%3Db7yulr4rl_17, http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6475927

Unit 17 - 51 Big Data Use Cases ¶

Unit Overview ¶

This units consists of one or more slides for each of the 51 use cases - typically additional (more than one) slides are associated with pictures. Each of the use cases is identified with source of parallelism and the high and low level computational structure. As each new classification topic is introduced we briefly discuss it but full discussion of topics is given in following unit.

Slides ¶

https://iu.app.box.com/s/cvki350s0a12o404a524

Lesson 1 - Government Use Cases I ¶

This covers Census 2010 and 2000 - Title 13 Big Data; National Archives and Records Administration Accession NARA, Search, Retrieve, Preservation; Statistical Survey Response Improvement (Adaptive Design) and Non-Traditional Data in Statistical Survey Response Improvement (Adaptive Design).

Video: http://youtu.be/gCqBFYDDzSQ

Lesson 2 - Government Use Cases II ¶

Video: http://youtu.be/y0nIed-Nxjw

Lesson 3 - Commercial Use Cases I ¶

This covers Cloud Eco-System, for Financial Industries (Banking, Securities & Investments, Insurance) transacting business within the United States; Mendeley - An International Network of Research; Netflix Movie Service; Web Search; IaaS (Infrastructure as a Service) Big Data Business Continuity & Disaster Recovery (BC/DR) Within A Cloud Eco-System; Cargo Shipping; Materials Data for Manufacturing and Simulation driven Materials Genomics.

Video: http://youtu.be/P1iuViI-AKc

Lesson 4 - Commercial Use Cases II ¶

Video: http://youtu.be/epFH4w_Q9lc

Lesson 5 - Commercial Use Cases III ¶

Video: http://youtu.be/j5kWjL4y7Bo

Lesson 6 - Defense Use Cases I ¶

This covers Large Scale Geospatial Analysis and Visualization; Object identification and tracking from Wide Area Large Format Imagery (WALF) Imagery or Full Motion Video (FMV) - Persistent Surveillance and Intelligence Data Processing and Analysis.

Video: http://youtu.be/8hXG7dinhjg

Lesson 7 - Defense Use Cases II ¶

Video: http://youtu.be/MplyAfmuxko

Lesson 8 - Healthcare and Life Science Use Cases I ¶

This covers Electronic Medical Record (EMR) Data; Pathology Imaging/digital pathology; Computational Bioimaging; Genomic Measurements; Comparative analysis for metagenomes and genomes; Individualized Diabetes Management; Statistical Relational Artificial Intelligence for Health Care; World Population Scale Epidemiological Study; Social Contagion Modeling for Planning, Public Health and Disaster Management and Biodiversity and LifeWatch.

Video: http://youtu.be/jVARCWVeYxQ

Lesson 9 - Healthcare and Life Science Use Cases II ¶

Video: http://youtu.be/y9zJzrH4P8k

Lesson 10 - Healthcare and Life Science Use Cases III ¶

Video: http://youtu.be/eU5emeI3AmM

Lesson 12 - Research Ecosystem Use Cases ¶

DataNet Federation Consortium DFC; The ‘Discinnet process’, metadata - big data global experiment; Semantic Graph-search on Scientific Chemical and Text-based Data and Light source beamlines.

Video: http://youtu.be/pZ6JucTCKcw

Lesson 13 - Astronomy and Physics Use Cases I ¶

This covers Catalina Real-Time Transient Survey (CRTS): a digital, panoramic, synoptic sky survey; DOE Extreme Data from Cosmological Sky Survey and Simulations; Large Survey Data for Cosmology; Particle Physics: Analysis of LHC Large Hadron Collider Data: Discovery of Higgs particle and Belle II High Energy Physics Experiment.

Video: http://youtu.be/rWqkF-b3Kwk

Lesson 14 - Astronomy and Physics Use Cases II ¶

Video: http://youtu.be/RxLCB6yLmpk

Lesson 15 - Environment, Earth and Polar Science Use Cases I ¶

EISCAT 3D incoherent scatter radar system; ENVRI, Common Operations of Environmental Research Infrastructure; Radar Data Analysis for CReSIS Remote Sensing of Ice Sheets; UAVSAR Data Processing, DataProduct Delivery, and Data Services; NASA LARC/GSFC iRODS Federation Testbed; MERRA Analytic Services MERRA/AS; Atmospheric Turbulence - Event Discovery and Predictive Analytics; Climate Studies using the Community Earth System Model at DOE’s NERSC center; DOE-BER Subsurface Biogeochemistry Scientific Focus Area and DOE-BER AmeriFlux and FLUXNET Networks.

Video: http://youtu.be/u2zTIGwsJwU

Lesson 16 - Environment, Earth and Polar Science Use Cases II ¶

Video: http://youtu.be/sH3B3gXuJ7E

Lesson 17 - Energy Use Case ¶

This covers Consumption forecasting in Smart Grids.

Video: http://youtu.be/ttmVypmgWmw

Resources ¶

DCGSA Standard Cloud: https://www.youtube.com/watch?v=l4Qii7T8zeg
NIST Big Data Public Working Group (NBD-PWG) Process http://bigdatawg.nist.gov/home.php
On line 51 Use Cases http://bigdatawg.nist.gov/usecases.php
Summary of Requirements Subgroup http://bigdatawg.nist.gov/_uploadfiles/M0245_v5_6066621242.docx
Use Case 6 Mendeley http://mendeley.com%20http//dev.mendeley.com
Use Case 7 Netflix http://www.slideshare.net/xamat/building-largescale-realworld-recommender-systems-recsys2012-tutoria
Use Case 8 Search http://www.slideshare.net/kleinerperkins/kpcb-internet-trends-2013 , http://webcourse.cs.technion.ac.il/236621/Winter2011-2012/en/ho_Lectures.html , http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws , http://www.slideshare.net/beechung/recommender-systems-tutorialpart1intro , http://www.worldwidewebsize.com/
Use Case 9 IaaS (Infrastructure as a Service) Big Data Business Continuity & Disaster Recovery (BC/DR) Within A Cloud Eco-System provided by Cloud Service Providers (CSPs) and Cloud Brokerage Service Providers (CBSPs) http://www.disasterrecovery.org/
Use Case 11 and Use Case 12 Simulation driven Materials Genomics https://www.materialsproject.org/
Use Case 13 Large Scale Geospatial Analysis and Visualization http://www.opengeospatial.org/standards , http://geojson.org/ , http://earth-info.nga.mil/publications/specs/printed/CADRG/cadrg.html
Use Case 14 Object identification and tracking from Wide Area Large Format Imagery (WALF) Imagery or Full Motion Video (FMV) - Persistent Surveillance http://www.militaryaerospace.com/topics/m/video/79088650/persistent-surveillance-relies-on-extracting-relevant-data-points-and-connecting-the-dots.htm , http://www.defencetalk.com/wide-area-persistent-surveillance-revolutionizes-tactical-isr-45745/
Use Case 15 Intelligence Data Processing and Analysis http://www.afcea-aberdeen.org/files/presentations/AFCEAAberdeen_DCGSA_COLWells_PS.pdf ,` http://stids.c4i.gmu.edu/papers/STIDSPapers/STIDS2012_T14_SmithEtAl_HorizontalIntegrationOfWarfighterIntel.pdf <http://stids.c4i.gmu.edu/papers/STIDSPapers/STIDS2012_T14_SmithEtAl_HorizontalIntegrationOfWarfighterIntel.pdf>`__ , http://stids.c4i.gmu.edu/STIDS2011/papers/STIDS2011_CR_T1_SalmenEtAl.pdf , https://www.youtube.com/watch?v=l4Qii7T8zeg , http://dcgsa.apg.army.mil/
Use Case 16 Electronic Medical Record (EMR) Data: Regenstrief Institute , Logical observation identifiers names and codes , Indiana Health Information Exchange , Institute of Medicine Learning Healthcare System
Use Case 17 Pathology Imaging/digital pathology; https://web.cci.emory.edu/confluence/display/PAIS , https://web.cci.emory.edu/confluence/display/HadoopGIS
Use Case 19 Genome in a Bottle Consortium: www.genomeinabottle.org
Use Case 20 Comparative analysis for metagenomes and genomes http://img.jgi.doe.gov/
Use Case 25 Biodiversity and LifeWatch
Use Case 26 Deep Learning: Recent popular press coverage of deep learning technology: http://www.nytimes.com/2012/11/24/science/scientists-see-advances-in-deep-learning-a-part-of-artificial-intelligence.html , http://www.nytimes.com/2012/06/26/technology/in-a-big-network-of-computers-evidence-of-machine-learning.html , http://www.wired.com/2013/06/andrew_ng/ ; A recent research paper on HPC for Deep Learning: http://www.stanford.edu/~acoates/papers/CoatesHuvalWangWuNgCatanzaro_icml2013.pdf Widely-used tutorials and references for Deep Learning: http://ufldl.stanford.edu/wiki/index.php/Main_Page , http://deeplearning.net/
Use Case 27 Organizing large-scale, unstructured collections of consumer photos http://vision.soic.indiana.edu/projects/disco/
Use Case 28 Truthy: Information diffusion research from Twitter Data http://truthy.indiana.edu/ , http://cnets.indiana.edu/groups/nan/truthy/ , http://cnets.indiana.edu/groups/nan/despic/
Use Case 30 CINET: Cyberinfrastructure for Network (Graph) Science and Analytics http://cinet.vbi.vt.edu/cinet_new/
Use Case 31 NIST Information Access Division analytic technology performance measurement, evaluations, and standards http://www.nist.gov/itl/iad/
Use Case 32 DataNet Federation Consortium DFC: The DataNet Federation Consortium , iRODS
Use Case 33 The ‘Discinnet process’, metadata < - > big data global experiment http://www.discinnet.org/
Use Case 34 Semantic Graph-search on Scientific Chemical and Text-based Data http://www.eurekalert.org/pub_releases/2013-07/aiop-ffm071813.php , http://xpdb.nist.gov/chemblast/pdb.pl
Use Case 35 Light source beamlines http://www-als.lbl.gov/ , https://www1.aps.anl.gov/
Use Case 36 CRTS survey , CSS survey ; For an overview of the classification challenges, see, e.g., http://arxiv.org/abs/1209.1681
Use Case 37 DOE Extreme Data from Cosmological Sky Survey and Simulations http://www.lsst.org/lsst/ , http://www.nersc.gov/ , http://www.nersc.gov/assets/Uploads/HabibcosmosimV2.pdf
Use Case 38 Large Survey Data for Cosmology http://desi.lbl.gov/ , http://www.darkenergysurvey.org/
Use Case 39 Particle Physics: Analysis of LHC Large Hadron Collider Data: Discovery of Higgs particle http://grids.ucs.indiana.edu/ptliupages/publications/Where%20does%20all%20the%20data%20come%20from%20v7.pdf , http://www.es.net/assets/pubs_presos/High-throughput-lessons-from-the-LHC-experience.Johnston.TNC2013.pdf
Use Case 40 Belle II High Energy Physics Experiment http://belle2.kek.jp/
Use Case 41 EISCAT 3D incoherent scatter radar system https://www.eiscat3d.se/
Use Case 42 ENVRI, Common Operations of Environmental Research Infrastructure, ENVRI Project website , ENVRI Reference Model , ENVRI deliverable D3.2 : Analysis of common requirements of Environmental Research Infrastructures , ICOS , Euro - Argo , EISCAT 3D , LifeWatch , EPOS , EMSO
Use Case 43 Radar Data Analysis for CReSIS Remote Sensing of Ice Sheets https://www.cresis.ku.edu/
Use Case 44 UAVSAR Data Processing, Data Product Delivery, and Data Services http://uavsar.jpl.nasa.gov/ , http://www.asf.alaska.edu/program/sdc , http://geo-gateway.org/main.html
Use Case 47 Atmospheric Turbulence - Event Discovery and Predictive Analytics http://oceanworld.tamu.edu/resources/oceanography-book/teleconnections.htm , http://www.forbes.com/sites/toddwoody/2012/03/21/meet-the-scientists-mining-big-data-to-predict-the-weather/
Use Case 48 Climate Studies using the Community Earth System Model at DOE.s NERSC center http://www-pcmdi.llnl.gov/ , http://www.nersc.gov/ , http://science.energy.gov/ber/research/cesd/ , http://www2.cisl.ucar.edu/
Use Case 50 DOE-BER AmeriFlux and FLUXNET Networks http://ameriflux.lbl.gov/ , http://www.fluxdata.org/default.aspx
Use Case 51 Consumption forecasting in Smart Grids http://smartgrid.usc.edu/ , http://ganges.usc.edu/wiki/Smart_Grid , https://www.ladwp.com/ladwp/faces/ladwp/aboutus/a-power/a-p-smartgridla?_afrLoop=157401916661989&_afrWindowMode=0&_afrWindowId=null#%40%3F_afrWindowId%3Dnull%26_afrLoop%3D157401916661989%26_afrWindowMode%3D0%26_adf.ctrl-state%3Db7yulr4rl_17 , http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6475927

Unit 18 - Features of 51 Big Data Use Cases ¶

This unit discusses the categories used to classify the 51 use-cases. These categories include concepts used for parallelism and low and high level computational structure. The first lesson is an introduction to all categories and the further lessons give details of particular categories.

Slides ¶

https://iu.app.box.com/s/azpn47brv4o46iij9xvb

Lesson 1 - Summary of Use Case Classification I ¶

This discusses concepts used for parallelism and low and high level computational structure. Parallelism can be over People (users or subjects), Decision makers; Items such as Images, EMR, Sequences; observations, contents of online store; Sensors – Internet of Things; Events; (Complex) Nodes in a Graph; Simple nodes as in a learning network; Tweets, Blogs, Documents, Web Pages etc.; Files or data to be backed up, moved or assigned metadata; Particles/cells/mesh points. Low level computational types include PP (Pleasingly Parallel); MR (MapReduce); MRStat; MRIter (Iterative MapReduce); Graph; Fusion; MC (Monte Carlo) and Streaming. High level computational types include Classification; S/Q (Search and Query); Index; CF (Collaborative Filtering); ML (Machine Learning); EGO (Large Scale Optimizations); EM (Expectation maximization); GIS; HPC; Agents. Patterns include Classic Database; NoSQL; Basic processing of data as in backup or metadata; GIS; Host of Sensors processed on demand; Pleasingly parallel processing; HPC assimilated with observational data; Agent-based models; Multi-modal data fusion or Knowledge Management; Crowd Sourcing.

Video: http://youtu.be/dfgH6YvHCGE

Lesson 2 - Summary of Use Case Classification II ¶

Video: http://youtu.be/TjHus5-HaMQ

Lesson 3 - Summary of Use Case Classification III ¶

Video: http://youtu.be/EbuNBbt4rQc

Lesson 4 - Database(SQL) Use Case Classification ¶

This discusses classic (SQL) datbase approach to data handling with Search&Query and Index features. Comparisons are made to NoSQL approaches.

Video: http://youtu.be/8QDcUWjA9Ok

Lesson 5 - NoSQL Use Case Classification ¶

This discusses NoSQL (compared in previous lesson) with HDFS, Hadoop and Hbase. The Apache Big data stack is introduced and further details of comparison with SQL.

Video: http://youtu.be/aJ127gkHQUs

Lesson 6 - Use Case Classifications I ¶

This discusses a subset of use case features: GIS, Sensors. the support of data analysis and fusion by streaming data between filters.

Video: http://youtu.be/STAoaS1T2bM

Lesson 7 - Use Case Classifications II Part 1 ¶

This discusses a subset of use case features: Pleasingly parallel, MRStat, Data Assimilation, Crowd sourcing, Agents, data fusion and agents, EGO and security.

Video: http://youtu.be/_tJRzG-jS4A

Lesson 8 - Use Case Classifications II Part 2 ¶

This discusses a subset of use case features: Pleasingly parallel, MRStat, Data Assimilation, Crowd sourcing, Agents, data fusion and agents, EGO and security.

Video: http://youtu.be/5iHdzMNviZo

Lesson 9 - Use Case Classifications III Part 1 ¶

This discusses a subset of use case features: Classification, Monte Carlo, Streaming, PP, MR, MRStat, MRIter and HPC(MPI), global and local analytics (machine learning), parallel computing, Expectation Maximization, graphs and Collaborative Filtering.

Video: http://youtu.be/tITbuwCRVzs

Lesson 10 - Use Case Classifications III Part 2 ¶

Video: http://youtu.be/0zaXWo8A4Co

Resources ¶

See previous section

Section 8 - Technology Training - Plotviz ¶

We introduce Plotviz, a data visualization tool developed at Indiana University to display 2 and 3 dimensional data. The motivation is that the human eye is very good at pattern recognition and can ‘’see’’ structure in data. Although most Big data is higher dimensional than 3, all can be transformed by dimension reduction techniques to 3D. He gives several examples to show how the software can be used and what kind of data can be visualized. This includes individual plots and the manipulation of multiple synchronized plots.Finally, he describes the download and software dependency of Plotviz.

Unit 19 - Using Plotviz Software for Displaying Point Distributions in 3D ¶

We introduce Plotviz, a data visualization tool developed at Indiana University to display 2 and 3 dimensional data. The motivation is that the human eye is very good at pattern recognition and can ‘’see’’ structure in data. Although most Big data is higher dimensional than 3, all can be transformed by dimension reduction techniques to 3D. He gives several examples to show how the software can be used and what kind of data can be visualized. This includes individual plots and the manipulation of multiple synchronized plots. Finally, he describes the download and software dependency of Plotviz.

Slides ¶

https://iu.app.box.com/s/jypomnrz755xgps5e6iw

Files ¶

Lesson 1 - Motivation and Introduction to use ¶

The motivation of Plotviz is that the human eye is very good at pattern recognition and can ‘’see’’ structure in data. Although most Big data is higher dimensional than 3, all data can be transformed by dimension reduction techniques to 3D and one can check analysis like clustering and/or see structure missed in a computer analysis. The motivations shows some Cheminformatics examples. The use of Plotviz is started in slide 4 with a discussion of input file which is either a simple text or more features (like colors) can be specified in a rich XML syntax. Plotviz deals with points and their classification (clustering). Next the protein sequence browser in 3D shows the basic structure of Plotviz interface. The next two slides explain the core 3D and 2D manipulations respectively. Note all files used in examples are available to students.

Video: http://youtu.be/4aQlCmQ1jfY

Lesson 2 - Example of Use I: Cube and Structured Dataset ¶

Initially we start with a simple plot of 8 points – the corners of a cube in 3 dimensions – showing basic operations such as size/color/labels and Legend of points. The second example shows a dataset (coming from GTM dimension reduction) with significant structure. This has .pviz and a .txt versions that are compared.

Video: http://youtu.be/nCTT5mI_j_Q

Lesson 3 - Example of Use II: Proteomics and Synchronized Rotation ¶

This starts with an examination of a sample of Protein Universe Browser showing how one uses Plotviz to look at different features of this set of Protein sequences projected to 3D. Then we show how to compare two datasets with synchronized rotation of a dataset clustered in 2 different ways; this dataset comes from k Nearest Neighbor discussion.

Video: http://youtu.be/lDbIhnLrNkk

Lesson 4 - Example of Use III: More Features and larger Proteomics Sample ¶

This starts by describing use of Labels and Glyphs and the Default mode in Plotviz. Then we illustrate sophisticated use of these ideas to view a large Proteomics dataset.

Video: http://youtu.be/KBkUW_QNSvs

Lesson 5 - Example of Use IV: Tools and Examples ¶

This lesson starts by describing the Plotviz tools and then sets up two examples – Oil Flow and Trading – described in PowerPoint. It finishes with the Plotviz viewing of Oil Flow data.

Video: http://youtu.be/zp_709imR40

Lesson 6 - Example of Use V: Final Examples ¶

This starts with Plotviz looking at Trading example introduced in previous lesson and then examines solvent data. It finishes with two large biology examples with 446K and 100K points and each with over 100 clusters. We finish remarks on Plotviz software structure and how to download. We also remind you that a picture is worth a 1000 words.

Video: http://youtu.be/FKoCfTJ_cDM

Resources ¶

Download files from http://salsahpc.indiana.edu/pviz3/

Section 9 - e-Commerce and LifeStyle Case Study ¶

Recommender systems operate under the hood of such widely recognized sites as Amazon, eBay, Monster and Netflix where everything is a recommendation. This involves a symbiotic relationship between vendor and buyer whereby the buyer provides the vendor with information about their preferences, while the vendor then offers recommendations tailored to match their needs. Kaggle competitions h improve the success of the Netflix and other recommender systems. Attention is paid to models that are used to compare how changes to the systems affect their overall performance. It is interesting that the humble ranking has become such a dominant driver of the world’s economy. More examples of recommender systems are given from Google News, Retail stores and in depth Yahoo! covering the multi-faceted criteria used in deciding recommendations on web sites.

The formulation of recommendations in terms of points in a space or bag is given where bags of item properties, user properties, rankings and users are useful. Detail is given on basic principles behind recommender systems: user-based collaborative filtering, which uses similarities in user rankings to predict their interests, and the Pearson correlation, used to statistically quantify correlations between users viewed as points in a space of items. Items are viewed as points in a space of users in item-based collaborative filtering. The Cosine Similarity is introduced, the difference between implicit and explicit ratings and the k Nearest Neighbors algorithm. General features like the curse of dimensionality in high dimensions are discussed. A simple Python k Nearest Neighbor code and its application to an artificial data set in 3 dimensions is given. Results are visualized in Matplotlib in 2D and with Plotviz in 3D. The concept of a training and a testing set are introduced with training set pre labeled. Recommender system are used to discuss clustering with k-means based clustering methods used and their results examined in Plotviz. The original labelling is compared to clustering results and extension to 28 clusters given. General issues in clustering are discussed including local optima, the use of annealing to avoid this and value of heuristic algorithms.

Unit 20 - Recommender Systems: Introduction ¶

We introduce Recommender systems as an optimization technology used in a variety of applications and contexts online. They operate in the background of such widely recognized sites as Amazon, eBay, Monster and Netflix where everything is a recommendation. This involves a symbiotic relationship between vendor and buyer whereby the buyer provides the vendor with information about their preferences, while the vendor then offers recommendations tailored to match their needs, to the benefit of both.

There follows an exploration of the Kaggle competition site, other recommender systems and Netflix, as well as competitions held to improve the success of the Netflix recommender system. Finally attention is paid to models that are used to compare how changes to the systems affect their overall performance. It is interesting how the humble ranking has become such a dominant driver of the world’s economy.

Slides ¶

https://iu.app.box.com/s/v2coa6mxql12iax4yc8f

Lesson 1 - Recommender Systems as an Optimization Problem ¶

We define a set of general recommender systems as matching of items to people or perhaps collections of items to collections of people where items can be other people, products in a store, movies, jobs, events, web pages etc. We present this as “yet another optimization problem”.

https://youtu.be/rymBt1kdyVU

Lesson 2 - Recommender Systems Introduction ¶

We give a general discussion of recommender systems and point out that they are particularly valuable in long tail of tems (to be recommended) that aren’t commonly known. We pose them as a rating system and relate them to information retrieval rating systems. We can contrast recommender systems based on user profile and context; the most familiar collaborative filtering of others ranking; item properties; knowledge and hybrid cases mixing some or all of these.

https://youtu.be/KbjBKrzFYKg

Lesson 3 - Kaggle Competitions ¶

We look at Kaggle competitions with examples from web site. In particular we discuss an Irvine class project involving ranking jokes.

https://youtu.be/DFH7GPrbsJA

Lesson 4 - Examples of Recommender Systems ¶

We go through a list of 9 recommender systems from the same Irvine class.

https://youtu.be/1Eh1epQj-EQ

Lesson 5 - Netflix on Recommender Systems I ¶

This is Part 1.

We summarize some interesting points from a tutorial from Netflix for whom ‘’everything is a recommendation’‘. Rankings are given in multiple categories and categories that reflect user interests are especially important. Criteria used include explicit user preferences, implicit based on ratings and hybrid methods as well as freshness and diversity. Netflix tries to explain the rationale of its recommendations. We give some data on Netflix operations and some methods used in its recommender systems. We describe the famous Netflix Kaggle competition to improve its rating system. The analogy to maximizing click through rate is given and the objectives of optimization are given.

https://youtu.be/tXsU5RRAD-w

Lesson 6 - Netflix on Recommender Systems II ¶

This is Part 2 of “Netflix on Recommender Systems”

https://youtu.be/GnAol5aGuEo

Lesson 7 - Consumer Data Science ¶

Here we go through Netflix’s methodology in letting data speak for itself in optimizing the recommender engine. An example iis given on choosing self produced movies. A/B testing is discussed with examples showing how testing does allow optimizing of sophisticated criteria. This lesson is concluded by comments on Netflix technology and the full spectrum of issues that are involved including user interface, data, AB testing, systems and architectures. We comment on optimizing for a household rather than optimizing for individuals in household.

https://youtu.be/B8cjaOQ57LI

Resources ¶

Unit 21 - Recommender Systems: Examples and Algorithms ¶

We continue the discussion of recommender systems and their use in e-commerce. More examples are given from Google News, Retail stores and in depth Yahoo! covering the multi-faceted criteria used in deciding recommendations on web sites. Then the formulation of recommendations in terms of points in a space or bag is given.

Here bags of item properties, user properties, rankings and users are useful. Then we go into detail on basic principles behind recommender systems: user-based collaborative filtering, which uses similarities in user rankings to predict their interests, and the Pearson correlation, used to statistically quantify correlations between users viewed as points in a space of items.

Slides ¶

https://iu.app.box.com/s/pqa1xpk7g4jnr7k2xlbe

Lesson 1 - Recap and Examples of Recommender Systems ¶

We start with a quick recap of recommender systems from previous unit; what they are with brief examples.

https://youtu.be/dcdm5AfGZ64

Lesson 2 - Examples of Recommender Systems ¶

We give 2 examples in more detail: namely Google News and Markdown in Retail.

https://youtu.be/og07mH9fU0M

Lesson 3 - Recommender Systems in Yahoo Use Case Example I ¶

This is Part 1.

We describe in greatest detail the methods used to optimize Yahoo web sites. There are two lessons discussing general approach and a third lesson examines a particular personalized Yahoo page with its different components. We point out the different criteria that must be blended in making decisions; these criteria include analysis of what user does after a particular page is clicked; is the user satisfied and cannot that we quantified by purchase decisions etc. We need to choose Articles, ads, modules, movies, users, updates, etc to optimize metrics such as relevance score, CTR, revenue, engagement.These lesson stress that if though we have big data, the recommender data is sparse. We discuss the approach that involves both batch (offline) and on-line (real time) components.

https://youtu.be/FBn7HpGFNvg

Lesson 4 - Recommender Systems in Yahoo Use Case Example II ¶

This is Part 2 of “Recommender Systems in Yahoo Use Case Example”

https://youtu.be/VS2Y4lAiP5A

Lesson 5 - Recommender Systems in Yahoo Use Case Example III: Particular Module ¶

This is Part 3 of “Recommender Systems in Yahoo Use Case Example”

https://youtu.be/HrRJWEF8EfU

Lesson 6 - User-based nearest-neighbor collaborative filtering I ¶

This is Part 1.

Collaborative filtering is a core approach to recommender systems. There is user-based and item-based collaborative filtering and here we discuss the user-based case. Here similarities in user rankings allow one to predict their interests, and typically this quantified by the Pearson correlation, used to statistically quantify correlations between users.

https://youtu.be/lsf_AE-8dSk

Lesson 7 - User-based nearest-neighbor collaborative filtering II ¶

This is Part 2 of “User-based nearest-neighbor collaborative filtering”

https://youtu.be/U7-qeX2ItPk

Lesson 8 - Vector Space Formulation of Recommender Systems ¶

We go through recommender systems thinking of them as formulated in a funny vector space. This suggests using clustering to make recommendations.

https://youtu.be/IlQUZOXlaSU

Resources ¶

http://pages.cs.wisc.edu/~beechung/icml11-tutorial/

Unit 22 - Item-based Collaborative Filtering and its Technologies ¶

We move on to item-based collaborative filtering where items are viewed as points in a space of users. The Cosine Similarity is introduced, the difference between implicit and explicit ratings and the k Nearest Neighbors algorithm. General features like the curse of dimensionality in high dimensions are discussed.

Slides ¶

https://iu.app.box.com/s/fvrwds7zd65m79a7uur3

Lesson 1 - Item-based Collaborative Filtering I ¶

This is Part 1.

We covered user-based collaborative filtering in the previous unit. Here we start by discussing memory-based real time and model based offline (batch) approaches. Now we look at item-based collaborative filtering where items are viewed in the space of users and the cosine measure is used to quantify distances. WE discuss optimizations and how batch processing can help. We discuss different Likert ranking scales and issues with new items that do not have a significant number of rankings.

https://youtu.be/25sBgh3HwxY

Lesson 2 - Item-based Collaborative Filtering II ¶

This is Part 2 of “Item-based Collaborative Filtering”

https://youtu.be/SM8EJdAa4mw

Lesson 3 - k Nearest Neighbors and High Dimensional Spaces ¶

We define the k Nearest Neighbor algorithms and present the Python software but do not use it. We give examples from Wikipedia and describe performance issues. This algorithm illustrates the curse of dimensionality. If items were a real vectors in a low dimension space, there would be faster solution methods.

https://youtu.be/2NqUsDGQDy8

Section 10 - Technology Training - kNN & Clustering ¶

This section is meant to provide a discussion on the kth Nearest Neighbor (kNN) algorithm and clustering using K-means. Python version for kNN is discussed in the video and instructions for both Java and Python are mentioned in the slides. Plotviz is used for generating 3D visualizations.

Unit 23 - Recommender Systems - K-Nearest Neighbors (Python & Java Track)¶

We discuss simple Python k Nearest Neighbor code and its application to an artificial data set in 3 dimensions. Results are visualized in Matplotlib in 2D and with Plotviz in 3D. The concept of training and testing sets are introduced with training set pre-labelled.

Slides ¶

https://iu.app.box.com/s/i9et3dxnhr3qt5gn14bg

Files ¶

Lesson 1 - Python k’th Nearest Neighbor Algorithms I ¶

This is Part 1.

This lesson considers the Python k Nearest Neighbor code found on the web associated with a book by Harrington on Machine Learning. There are two data sets. First we consider a set of 4 2D vectors divided into two categories (clusters) and use k=3 Nearest Neighbor algorithm to classify 3 test points. Second we consider a 3D dataset that has already been classified and show how to normalize. In this lesson we just use Matplotlib to give 2D plots.

https://youtu.be/o16L0EqsQ_g

Lesson 2 - Python k’th Nearest Neighbor Algorithms II ¶

This is Part 2 of “Python k’th Nearest Neighbor Algorithms”.

https://youtu.be/JK5p24mnTjs

Lesson 3 - 3D Visualization ¶

The lesson modifies the online code to allow it to produce files readable by PlotViz. We visualize already classified 3D set and rotate in 3D.

https://youtu.be/fLtH-ZI1Jqk

Lesson 4 - Testing k’th Nearest Neighbor Algorithms ¶

The lesson goes through an example of using k NN classification algorithm by dividing dataset into 2 subsets. One is training set with initial classification; the other is test point to be classified by k=3 NN using training set. The code records fraction of points with a different classification from that input. One can experiment with different sizes of the two subsets. The Python implementation of algorithm is analyzed in detail.

https://youtu.be/zLaPGMIQ9So

Unit 24 - Clustering and heuristic methods ¶

We use example of recommender system to discuss clustering. The details of methods are not discussed but k-means based clustering methods are used and their results examined in Plotviz. The original labelling is compared to clustering results and extension to 28 clusters given. General issues in clustering are discussed including local optima, the use of annealing to avoid this and value of heuristic algorithms.

Slides ¶

https://iu.app.box.com/s/70qn6d61oln9b50jqobl

Files ¶

Lesson 1 - Kmeans Clustering ¶

We introduce the k means algorithm in a gentle fashion and describes its key features including dangers of local minima. A simple example from Wikipedia is examined.

https://youtu.be/3KTNJ0Okrqs

Lesson 2 - Clustering of Recommender System Example ¶

Plotviz is used to examine and compare the original classification with an ‘’optimal’’ clustering into 3 clusters using a fancy deterministic annealing method that is similar to k means. The new clustering has centers marked.

https://youtu.be/yl_KZ86NT-A

Lesson 3 - Clustering of Recommender Example into more than 3 Clusters ¶

The previous division into 3 clusters is compared into a clustering into 28 separate clusters that are naturally smaller in size and divide 3D space covered by 1000 points into compact geometrically local regions.

https://youtu.be/JWZmh48l0cw

Lesson 4 - Local Optima in Clustering ¶

This lesson introduces some general principles. First many important processes are ‘’just’’ optimization problems. Most such problems are rife with local optima. The key idea behind annealing to avoid local optima is described. The pervasive greedy optimization method is described.

https://youtu.be/Zmq8O_axCmc

Lesson 5 - Clustering in General ¶

The two different applications of clustering are described. First find geometrically distinct regions and secondly divide spaces into geometrically compact regions that may have no ‘’thin air’’ between them. Generalizations such as mixture models and latent factor methods are just mentioned. The important distinction between applications in vector spaces and those where only inter-point distances are defined is described. Examples are then given using PlotViz from 2D clustering of a mass spectrometry example and the results of clustering genomic data mapped into 3D with Multi Dimensional Scaling MDS.

https://youtu.be/JejNZhBxjRU

Lesson 6 - Heuristics ¶

Some remarks are given on heuristics; why are they so important why getting exact answers is often not so important?

https://youtu.be/KT22YuX8ZMY

Resources ¶

Section 11 - Cloud Computing Technology for Big Data Applications & Analytics (will be updated)¶

We describe the central role of Parallel computing in Clouds and Big Data which is decomposed into lots of ‘’Little data’’ running in individual cores. Many examples are given and it is stressed that issues in parallel computing are seen in day to day life for communication, synchronization, load balancing and decomposition. Cyberinfrastructure for e-moreorlessanything or moreorlessanything-Informatics and the basics of cloud computing are introduced. This includes virtualization and the important ‘’as a Service’’ components and we go through several different definitions of cloud computing.

Gartner’s Technology Landscape includes hype cycle and priority matrix and covers clouds and Big Data. Two simple examples of the value of clouds for enterprise applications are given with a review of different views as to nature of Cloud Computing. This IaaS (Infrastructure as a Service) discussion is followed by PaaS and SaaS (Platform and Software as a Service). Features in Grid and cloud computing and data are treated. We summarize the 21 layers and almost 300 software packages in the HPC-ABDS Software Stack explaining how they are used.

Cloud (Data Center) Architectures with physical setup, Green Computing issues and software models are discussed followed by the Cloud Industry stakeholders with a 2014 Gartner analysis of Cloud computing providers. This is followed by applications on the cloud including data intensive problems, comparison with high performance computing, science clouds and the Internet of Things. Remarks on Security, Fault Tolerance and Synchronicity issues in cloud follow. We describe the way users and data interact with a cloud system. The Big Data Processing from an application perspective with commercial examples including eBay concludes section after a discussion of data system architectures.

Unit 25 - Parallel Computing: Overview of Basic Principles with familiar Examples ¶

Slides ¶

https://iu.app.box.com/s/nau0rsr39kyej240s4yz

Lesson 1 - Decomposition I ¶

This is Part 1.

We describe why parallel computing is essential with Big Data and distinguishes parallelism over users to that over the data in problem. The general ideas behind data decomposition are given followed by a few often whimsical examples dreamed up 30 years ago in the early heady days of parallel computing. These include scientific simulations, defense outside missile attack and computer chess. The basic problem of parallel computing – efficient coordination of separate tasks processing different data parts – is described with MPI and MapReduce as two approaches. The challenges of data decomposition in irregular problems is noted.

https://youtu.be/R-wHQW2YuRE

Lesson 2 - Decomposition II ¶

This is Part 2 of “Decomposition”.

https://youtu.be/iIi9wdvlwCM

Lesson 3 - Decomposition III ¶

This is Part 3 of “Decomposition”.

https://youtu.be/F0aeeLeTD9I

Lesson 4 - Parallel Computing in Society I ¶

This is Part 1.

This lesson from the past notes that one can view society as an approach to parallel linkage of people. The largest example given is that of the construction of a long wall such as that (Hadrian’s wall) between England and Scotland. Different approaches to parallelism are given with formulae for the speed up and efficiency. The concepts of grain size (size of problem tackled by an individual processor) and coordination overhead are exemplified. This example also illustrates Amdahl’s law and the relation between data and processor topology. The lesson concludes with other examples from nature including collections of neurons (the brain) and ants.

https://youtu.be/8rtjoe8AeJw

Lesson 5 - Parallel Computing in Society II ¶

This is Part 2 of “Parallel Computing in Society”.

https://youtu.be/7sCgH_TTPGk

Lesson 6 - Parallel Processing for Hadrian’s Wall ¶

This lesson returns to Hadrian’s wall and uses it to illustrate advanced issues in parallel computing. First We describe the basic SPMD – Single Program Multiple Data – model. Then irregular but homogeneous and heterogeneous problems are discussed. Static and dynamic load balancing is needed. Inner parallelism (as in vector instruction or the multiple fingers of masons) and outer parallelism (typical data parallelism) are demonstrated. Parallel I/O for Hadrian’s wall is followed by a slide summarizing this quaint comparison between Big data parallelism and the construction of a large wall.

https://youtu.be/ZD2AQ08cy8I

Resources ¶

Solving Problems in Concurrent Processors-Volume 1, with M. Johnson, G. Lyzenga, S. Otto, J. Salmon, D. Walker, Prentice Hall, March 1988.
Parallel Computing Works!, with P. Messina, R. Williams, Morgan Kaufman (1994). http://www.netlib.org/utk/lsi/pcwLSI/text/
The Sourcebook of Parallel Computing book edited by Jack Dongarra, Ian Foster, Geoffrey Fox, William Gropp, Ken Kennedy, Linda Torczon, and Andy White, Morgan Kaufmann, November 2002.
Geoffrey Fox Computational Sciences and Parallelism to appear in Encyclopedia on Parallel Computing edited by David Padua and published by Springer. http://grids.ucs.indiana.edu/ptliupages/publications/SpringerEncyclopedia_Fox.pdf

Unit 26 - Cloud Computing Technology Part I: Introduction ¶

We discuss Cyberinfrastructure for e-moreorlessanything or moreorlessanything-Informatics and the basics of cloud computing. This includes virtualization and the important ‘as a Service’ components and we go through several different definitions of cloud computing.Gartner’s Technology Landscape includes hype cycle and priority matrix and covers clouds and Big Data. The unit concludes with two simple examples of the value of clouds for enterprise applications. Gartner also has specific predictions for cloud computing growth areas.

Slides ¶

https://iu.app.box.com/s/p3lztuu9kv240pdm66141or9b8p1uvzb

Lesson 1 - Cyberinfrastructure for E-MoreOrLessAnything ¶

This introduction describes Cyberinfrastructure or e-infrastructure and its role in solving the electronic implementation of any problem where e-moreorlessanything is another term for moreorlessanything-Informatics and generalizes early discussion of e-Science and e-Business.

https://youtu.be/gHz0cu195ZM

Lesson 2 - What is Cloud Computing: Introduction ¶

Cloud Computing is introduced with an operational definition involving virtualization and efficient large data centers that can rent computers in an elastic fashion. The role of services is essential – it underlies capabilities being offered in the cloud. The four basic aaS’s – Software (SaaS), Platform (Paas), Infrastructure (IaaS) and Network (NaaS) – are introduced with Research aaS and other capabilities (for example Sensors aaS are discussed later) being built on top of these.

https://youtu.be/Od_mYXRs5As

Lesson 3 - What and Why is Cloud Computing: Several Other Views I ¶

This is Part 1.

This lesson contains 5 slides with diverse comments on ‘’what is cloud computing’’ from the web.

https://youtu.be/5VeqMjXKU_Y

Lesson 4 - What and Why is Cloud Computing: Several Other Views II ¶

This is Part 2 of “What and Why is Cloud Computing: Several Other Views”.

https://youtu.be/J963LR0PS_g

Lesson 5 - What and Why is Cloud Computing: Several Other Views III ¶

This is Part 3 of “What and Why is Cloud Computing: Several Other Views”.

https://youtu.be/_ryLXUnOAzo

Lesson 6 - Gartner’s Emerging Technology Landscape for Clouds and Big Data ¶

This lesson gives Gartner’s projections around futures of cloud and Big data. We start with a review of hype charts and then go into detailed Gartner analyses of the Cloud and Big data areas. Big data itself is at the top of the hype and by definition predictions of doom are emerging. Before too much excitement sets in, note that spinach is above clouds and Big data in Google trends.

https://youtu.be/N7aEtU1mUwc

Lesson 7 - Simple Examples of use of Cloud Computing ¶

This short lesson gives two examples of rather straightforward commercial applications of cloud computing. One is server consolidation for multiple Microsoft database applications and the second is the benefits of scale comparing gmail to multiple smaller installations. It ends with some fiscal comments.

https://youtu.be/VCctCP6BKEo

Lesson 8 - Value of Cloud Computing ¶

Some comments on fiscal value of cloud computing.

https://youtu.be/HM1dZCxdsaA

Resources ¶

Unit 27 - Cloud Computing Technology Part II: Software and Systems ¶

We cover different views as to nature of architecture and application for Cloud Computing. Then we discuss cloud software for the cloud starting at virtual machine management (IaaS) and the broad Platform (middleware) capabilities with examples from Amazon and academic studies. We summarize the 21 layers and almost 300 software packages in the HPC-ABDS Software Stack explaining how they are used.

Slides ¶

https://iu.app.box.com/s/k61o0ff1w6jkn5zmpaaiw02yth4v4alh

Lesson 1 - What is Cloud Computing ¶

This lesson gives some general remark of cloud systems from an architecture and application perspective.

https://youtu.be/h3Rpb0Eyj1c

Lesson 2 - Introduction to Cloud Software Architecture: IaaS and PaaS I ¶

We discuss cloud software for the cloud starting at virtual machine management (IaaS) and the broad Platform (middleware) capabilities with examples from Amazon and academic studies.

https://youtu.be/1AnyJYyh490

Lesson 3 - Introduction to Cloud Software Architecture: IaaS and PaaS II ¶

https://youtu.be/hVpFAUHcAd4

Lesson 4 - Using the HPC-ABDS Software Stack ¶

Using the HPC-ABDS Software Stack.

https://youtu.be/JuTQdRW78Pg

Resources ¶

Unit 28 - Cloud Computing Technology Part III: Architectures, Applications and Systems ¶

We start with a discussion of Cloud (Data Center) Architectures with physical setup, Green Computing issues and software models. We summarize a 2014 Gartner analysis of Cloud computing providers. This is followed by applications on the cloud including data intensive problems, comparison with high performance computing, science clouds and the Internet of Things. Remarks on Security, Fault Tolerance and Synchronicity issues in cloud follow.

Slides ¶

https://iu.app.box.com/s/0bn57opwe56t0rx4k18bswupfwj7culv

Lesson 1 - Cloud (Data Center) Architectures I ¶

This is Part 1.

Some remarks on what it takes to build (in software) a cloud ecosystem, and why clouds are the data center of the future are followed by pictures and discussions of several data centers from Microsoft (mainly) and Google. The role of containers is stressed as part of modular data centers that trade scalability for fault tolerance. Sizes of cloud centers and supercomputers are discussed as is “green” computing.

https://youtu.be/j0P32DmQjI8

Lesson 2 - Cloud (Data Center) Architectures II ¶

This is Part 2 of “Cloud (Data Center) Architectures”.

https://youtu.be/3HAGqz34AB4

Lesson 3 - Analysis of Major Cloud Providers ¶

Gartner 2014 Analysis of leading cloud providers.

https://youtu.be/Tu8hE1SeT28

Lesson 4 - Commercial Cloud Storage Trends ¶

Use of Dropbox, iCloud, Box etc.

https://youtu.be/i5OI6R526kM

Lesson 5 - Cloud Applications I ¶

This is Part 1.

This short lesson discusses the need for security and issues in its implementation. Clouds trade scalability for greater possibility of faults but here clouds offer good support for recovery from faults. We discuss both storage and program fault tolerance noting that parallel computing is especially sensitive to faults as a fault in one task will impact all other tasks in the parallel job.

https://youtu.be/nkeSOMTGbbo

Lesson 6 - Cloud Applications II ¶

This is Part 2 of “Cloud Applications”.

https://youtu.be/ORd3aBhc2Rc

Lesson 7 - Science Clouds ¶

Science Applications and Internet of Things.

https://youtu.be/2PDvpZluyvs

Lesson 8 - Security ¶

This short lesson discusses the need for security and issues in its implementation.

https://youtu.be/NojXG3fbrEo

Lesson 9 - Comments on Fault Tolerance and Synchronicity Constraints ¶

Clouds trade scalability for greater possibility of faults but here clouds offer good support for recovery from faults. We discuss both storage and program fault tolerance noting that parallel computing is especially sensitive to faults as a fault in one task will impact all other tasks in the parallel job.

https://youtu.be/OMZiSiN7dlU

Resources ¶

http://www.slideshare.net/woorung/trend-and-future-of-cloud-computing
http://www.eweek.com/c/a/Cloud-Computing/AWS-Innovation-Means-Cloud-Domination-307831
CSTI General Assembly 2012, Washington, D.C., USA Technical Activities Coordinating Committee (TACC) Meeting, Data Management, Cloud Computing and the Long Tail of Science October 2012 Dennis Gannon.
http://research.microsoft.com/en-us/um/redmond/events/cloudfutures2012/tuesday/Keynote_OpportunitiesAndChallenges_Yousef_Khalidi.pdf
http://www.datacenterknowledge.com/archives/2011/05/10/uptime-institute-the-average-pue-is-1-8/
https://loosebolts.wordpress.com/2008/12/02/our-vision-for-generation-4-modular-data-centers-one-way-of-getting-it-just-right/
http://www.mediafire.com/file/zzqna34282frr2f/koomeydatacenterelectuse2011finalversion.pdf
http://www.slideshare.net/JensNimis/cloud-computing-tutorial-jens-nimis
http://www.slideshare.net/botchagalupe/introduction-to-clouds-cloud-camp-columbus
http://www.venus-c.eu/Pages/Home.aspx
Geoffrey Fox and Dennis Gannon Using Clouds for Technical Computing To be published in Proceedings of HPC 2012 Conference at Cetraro, Italy June 28 2012 http://grids.ucs.indiana.edu/ptliupages/publications/Clouds_Technical_Computing_FoxGannonv2.pdf
https://berkeleydatascience.files.wordpress.com/2012/01/20120119berkeley.pdf
Taming The Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics, Bill Franks Wiley ISBN: 978-1-118-20878-6
Anjul Bhambhri, VP of Big Data, IBM http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html
Conquering Big Data with the Oracle Information Model, Helen Sun, Oracle
Hugh Williams VP Experience, Search & Platforms, eBay http://businessinnovation.berkeley.edu/fisher-cio-leadership-program/
Dennis Gannon, Scientific Computing Environments, http://www.nitrd.gov/nitrdgroups/images/7/73/D_Gannon_2025_scientific_computing_environments.pdf
http://research.microsoft.com/en-us/um/redmond/events/cloudfutures2012/tuesday/Keynote_OpportunitiesAndChallenges_Yousef_Khalidi.pdf
http://www.datacenterknowledge.com/archives/2011/05/10/uptime-institute-the-average-pue-is-1-8/
https://loosebolts.wordpress.com/2008/12/02/our-vision-for-generation-4-modular-data-centers-one-way-of-getting-it-just-right/
http://www.mediafire.com/file/zzqna34282frr2f/koomeydatacenterelectuse2011finalversion.pdf
http://searchcloudcomputing.techtarget.com/feature/Cloud-computing-experts-forecast-the-market-climate-in-2014
http://www.slideshare.net/botchagalupe/introduction-to-clouds-cloud-camp-columbus
http://www.slideshare.net/woorung/trend-and-future-of-cloud-computing
http://www.venus-c.eu/Pages/Home.aspx
http://www.kpcb.com/internet-trends

Unit 29 - Cloud Computing Technology Part IV: Data Systems ¶

We describe the way users and data interact with a cloud system. The unit concludes with the treatment of data in the cloud from an architecture perspective and Big Data Processing from an application perspective with commercial examples including eBay.

Slides ¶

https://iu.app.box.com/s/ftfpybxm8jzjepzp409vgair1fttv3m1

Lesson 1 - The 10 Interaction scenarios (access patterns) I ¶

The next 3 lessons describe the way users and data interact with the system.

https://youtu.be/vB4rCNri_P0

Lesson 2 - The 10 Interaction scenarios - Science Examples ¶

This lesson describes the way users and data interact with the system for some science examples.

https://youtu.be/cFX1PQpiSbk

Lesson 3 - Remaining general access patterns ¶

This lesson describe the way users and data interact with the system for the final set of examples.

https://youtu.be/-dtE9zXB-I0

Lesson 4 - Data in the Cloud ¶

Databases, File systems, Object Stores and NOSQL are discussed and compared. The way to build a modern data repository in the cloud is introduced.

https://youtu.be/HdtIOnk3qX4

Lesson 5 - Applications Processing Big Data ¶

This lesson collects remarks on Big data processing from several sources: Berkeley, Teradata, IBM, Oracle and eBay with architectures and application opportunities.

https://youtu.be/d6A2m4GR-hw

Resources ¶

Section 12 - Web Search and Text Mining and their technologies ¶

This section starts with an overview of data mining and puts our study of classification, clustering and exploration methods in context. We examine the problem to be solved in web and text search and note the relevance of history with libraries, catalogs and concordances. An overview of web search is given describing the continued evolution of search engines and the relation to the field of Information Retrieval. The importance of recall, precision and diversity is discussed. The important Bag of Words model is introduced and both Boolean queries and the more general fuzzy indices. The important vector space model and revisiting the Cosine Similarity as a distance in this bag follows. The basic TF-IDF approach is dis cussed. Relevance is discussed with a probabilistic model while the distinction between Bayesian and frequency views of probability distribution completes this unit.

We start with an overview of the different steps (data analytics) in web search and then goes key steps in detail starting with document preparation. An inverted index is described and then how it is prepared for web search. The Boolean and Vector Space approach to query processing follow. This is followed by Link Structure Analysis including Hubs, Authorities and PageRank. The application of PageRank ideas as reputation outside web search is covered. The web graph structure, crawling it and issues in web advertising and search follow. The use of clustering and topic models completes section

Unit 30 - Web Search and Text Mining I ¶

The unit starts with the web with its size, shape (coming from the mutual linkage of pages by URL’s) and universal power laws for number of pages with particular number of URL’s linking out or in to page. Information retrieval is introduced and compared to web search. A comparison is given between semantic searches as in databases and the full text search that is base of Web search. The origin of web search in libraries, catalogs and concordances is summarized. DIKW – Data Information Knowledge Wisdom – model for web search is discussed. Then features of documents, collections and the important Bag of Words representation. Queries are presented in context of an Information Retrieval architecture. The method of judging quality of results including recall, precision and diversity is described. A time line for evolution of search engines is given.

Boolean and Vector Space models for query including the cosine similarity are introduced. Web Crawlers are discussed and then the steps needed to analyze data from Web and produce a set of terms. Building and accessing an inverted index is followed by the importance of term specificity and how it is captured in TF-IDF. We note how frequencies are converted into belief and relevance.

Slides ¶

https://iu.app.box.com/s/qo7itbtcxp2b58syz3jg

Lesson 1 - Web and Document/Text Search: The Problem ¶

This lesson starts with the web with its size, shape (coming from the mutual linkage of pages by URL’s) and universal power laws for number of pages with particular number of URL’s linking out or in to page.

https://youtu.be/T12BccKe8p4

Lesson 2 - Information Retrieval leading to Web Search ¶

Information retrieval is introduced A comparison is given between semantic searches as in databases and the full text search that is base of Web search. The ACM classification illustrates potential complexity of ontologies. Some differences between web search and information retrieval are given.

https://youtu.be/KtWhk2cdRa4

Lesson 3 - History behind Web Search ¶

The origin of web search in libraries, catalogs and concordances is summarized.

https://youtu.be/J7D61uH5gVM

Lesson 4 - Key Fundamental Principles behind Web Search ¶

This lesson describes the DIKW – Data Information Knowledge Wisdom – model for web search. Then it discusses documents, collections and the important Bag of Words representation.

https://youtu.be/yPFi6xFnDHE

Lesson 5 - Information Retrieval (Web Search) Components ¶

This describes queries in context of an Information Retrieval architecture. The method of judging quality of results including recall, precision and diversity is described.

https://youtu.be/EGsnonXgb3Y

Lesson 6 - Search Engines ¶

This short lesson describes a time line for evolution of search engines. The first web search approaches were directly built on Information retrieval but in 1998 the field was changed when Google was founded and showed the importance of URL structure as exemplified by PageRank.

https://youtu.be/kBV-99N6f7k

Lesson 7 - Boolean and Vector Space Models ¶

This lesson describes the Boolean and Vector Space models for query including the cosine similarity.

https://youtu.be/JzGBA0OhsIk

Lesson 8 - Web crawling and Document Preparation ¶

This describes a Web Crawler and then the steps needed to analyze data from Web and produce a set of terms.

https://youtu.be/Wv-r-PJ9lro

Lesson 9 - Indices ¶

This lesson describes both building and accessing an inverted index. It describes how phrases are treated and gives details of query structure from some early logs.

https://youtu.be/NY2SmrHoBVM

Lesson 10 - TF-IDF and Probabilistic Models ¶

It describes the importance of term specificity and how it is captured in TF-IDF. It notes how frequencies are converted into belief and relevance.

https://youtu.be/9P_HUmpselU

Resources ¶

http://saedsayad.com/data_mining_map.htm
http://webcourse.cs.technion.ac.il/236621/Winter2011-2012/en/ho_Lectures.html
The Web Graph: an Overview Jean-Loup Guillaume and Matthieu Latapy https://hal.archives-ouvertes.fr/file/index/docid/54458/filename/webgraph.pdf
Constructing a reliable Web graph with information on browsing behavior, Yiqun Liu, Yufei Xue, Danqing Xu, Rongwei Cen, Min Zhang, Shaoping Ma, Liyun Ru http://www.sciencedirect.com/science/article/pii/S0167923612001844
http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Unit 31 - Web Search and Text Mining II ¶

We start with an overview of the different steps (data analytics) in web search. This is followed by Link Structure Analysis including Hubs, Authorities and PageRank. The application of PageRank ideas as reputation outside web search is covered. Issues in web advertising and search follow. his leads to emerging field of computational advertising. The use of clustering and topic models completes unit with Google News as an example.

Slides ¶

https://iu.app.box.com/s/iuzc1qfep748z1o2kgx2

Lesson 1 - Data Analytics for Web Search ¶

This short lesson describes the different steps needed in web search including: Get the digital data (from web or from scanning); Crawl web; Preprocess data to get searchable things (words, positions); Form Inverted Index mapping words to documents; Rank relevance of documents with potentially sophisticated techniques; and integrate technology to support advertising and ways to allow or stop pages artificially enhancing relevance.

https://youtu.be/ugyycKBjaBQ

Lesson 2 - Link Structure Analysis including PageRank I ¶

This is Part 1.

The value of links and the concepts of Hubs and Authorities are discussed. This leads to definition of PageRank with examples. Extensions of PageRank viewed as a reputation are discussed with journal rankings and university department rankings as examples. There are many extension of these ideas which are not discussed here although topic models are covered briefly in a later lesson.

https://youtu.be/1oXdopVxqfI

Lesson 3 - Link Structure Analysis including PageRank II ¶

This is Part 2 of “Link Structure Analysis including PageRank”.

https://youtu.be/OCn-gCTxvrU

Lesson 4 - Web Advertising and Search ¶

Internet and mobile advertising is growing fast and can be personalized more than for traditional media. There are several advertising types Sponsored search, Contextual ads, Display ads and different models: Cost per viewing, cost per clicking and cost per action. This leads to emerging field of computational advertising.

https://youtu.be/GgkmG0NzQvg

Lesson 5 - Clustering and Topic Models ¶

We discuss briefly approaches to defining groups of documents. We illustrate this for Google News and give an example that this can give different answers from word-based analyses. We mention some work at Indiana University on a Latent Semantic Indexing model.

https://youtu.be/95cHMyZ-TUs

Resources ¶

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws
https://en.wikipedia.org/wiki/PageRank
http://webcourse.cs.technion.ac.il/236621/Winter2011-2012/en/ho_Lectures.html
Meeker/Wu May 29 2013 Internet Trends D11 Conference http://www.slideshare.net/kleinerperkins/kpcb-internet-trends-2013

Section 13 - Technology for Big Data Applications and Analytics ¶

We use the K-means Python code in SciPy package to show real code for clustering. After a simple example we generate 4 clusters of distinct centers and various choice for sizes using Matplotlib tor visualization. We show results can sometimes be incorrect and sometimes make different choices among comparable solutions. We discuss the ‘’hill’’ between different solutions and rationale for running K-means many times and choosing best answer. Then we introduce MapReduce with the basic architecture and a homely example. The discussion of advanced topics includes an extension to Iterative MapReduce from Indiana University called Twister and a generalized Map Collective model. Some measurements of parallel performance are given. The SciPy K-means code is modified to support a MapReduce execution style. This illustrates the key ideas of mappers and reducers. With appropriate runtime this code would run in parallel but here the ‘’parallel’’ maps run sequentially. This simple 2 map version can be generalized to scalable parallelism. Python is used to Calculate PageRank from Web Linkage Matrix showing several different formulations of the basic matrix equations to finding leading eigenvector. The unit is concluded by a calculation of PageRank for general web pages by extracting the secret from Google.

Unit 32 - Technology for X-Informatics: K-means (Python & Java Track)¶

Slides ¶

https://iu.app.box.com/s/ltgbehfjwvgh40l5d3w8

Files ¶

Lesson 1 - K-means in Python ¶

We use the K-means Python code in SciPy package to show real code for clustering and applies it a set of 85 two dimensional vectors – officially sets of weights and heights to be clustered to find T-shirt sizes. We run through Python code with Matplotlib displays to divide into 2-5 clusters. Then we discuss Python to generate 4 clusters of varying sizes and centered at corners of a square in two dimensions. We formally give the K means algorithm better than before and make definition consistent with code in SciPy.

https://youtu.be/I79ISV6XBbE

Lesson 2 - Analysis of 4 Artificial Clusters I ¶

This is Part 1.

We present clustering results on the artificial set of 1000 2D points described in previous lesson for 3 choices of cluster sizes ‘’small’’ ‘’large’’ and ‘’very large’‘. We emphasize the SciPy always does 20 independent K means and takes the best result – an approach to avoiding local minima. We allow this number of independent runs to be changed and in particular set to 1 to generate more interesting erratic results. We define changes in our new K means code that also has two measures of quality allowed. The slides give many results of clustering into 2 4 6 and 8 clusters (there were only 4 real clusters). We show that the ‘’very small’’ case has two very different solutions when clustered into two clusters and use this to discuss functions with multiple minima and a hill between them. The lesson has both discussion of already produced results in slides and interactive use of Python for new runs.

https://youtu.be/Srgq9VDg4C8

Lesson 3 - Analysis of 4 Artificial Clusters II ¶

This is Part 2 of “Analysis of 4 Artificial Clusters”.

https://youtu.be/rjyAXjA_mOk

Lesson 4 - Analysis of 4 Artificial Clusters III ¶

This is Part 3 of “Analysis of 4 Artificial Clusters”.

https://youtu.be/N6QKyrhNVAc

Unit 33 - Technology for X-Informatics: MapReduce ¶

We describe the basic architecture of MapReduce and a homely example. The discussion of advanced topics includes extension to Iterative MapReduce from Indiana University called Twister and a generalized Map Collective model. Some measurements of parallel performance are given.

Slides ¶

https://iu.app.box.com/s/hqykdx1bquez7ers3d1j

Lesson 1 - Introduction ¶

This introduction uses an analogy to making fruit punch by slicing and blending fruit to illustrate MapReduce. The formal structure of MapReduce and Iterative MapReduce is presented with parallel data flowing from disks through multiple Map and Reduce phases to be inspected by the user.

https://youtu.be/67qFY64aj7g

Lesson 2 - Advanced Topics I ¶

This is Part 1.

This defines 4 types of MapReduce and the Map Collective model of Qiu. The Iterative MapReduce model from Indiana University called Twister is described and a few performance measurements on Microsoft Azure are presented.

https://youtu.be/lo4movzSyVw

Lesson 3 - Advanced Topics II ¶

This is Part 2 of “Advanced Topics”.

https://youtu.be/wnanWncQBow

Unit 34 - Technology: Kmeans and MapReduce Parallelism ¶

We modify the SciPy K-means code to support a MapReduce execution style and runs it in this short unit. This illustrates the key ideas of mappers and reducers. With appropriate runtime this code would run in parallel but here the ‘’parallel’’ maps run sequentially. We stress that this simple 2 map version can be generalized to scalable parallelism.

Slides ¶

https://iu.app.box.com/s/zc9pckhyehn0cog8wy19

Files ¶

ParallelKmeans

Lesson 1 - MapReduce Kmeans in Python I ¶

This is Part 1.

https://youtu.be/2El1oL3gKpQ

Lesson 2 - MapReduce Kmeans in Python II ¶

This is Part 2 of “MapReduce Kmeans in Python”

https://youtu.be/LLrTWWdE3T0

Unit 35 - Technology: PageRank (Python & Java Track)¶

We use Python to Calculate PageRank from Web Linkage Matrix showing several different formulations of the basic matrix equations to finding leading eigenvector. The unit is concluded by a calculation of PageRank for general web pages by extracting the secret from Google.

Slides ¶

https://iu.app.box.com/s/gwq1qp0kmwbvilo0kjqq

Files ¶

Lesson 1 - Calculate PageRank from Web Linkage Matrix I ¶

This is Part 1.

We take two simple matrices for 6 and 8 web sites respectively to illustrate the calculation of PageRank.

https://youtu.be/rLWUvvcHrCQ

Lesson 2 - Calculate PageRank from Web Linkage Matrix II ¶

This is Part 2 of “Calculate PageRank for Web linkage Matrix”.

https://youtu.be/UzQRukCFQv8

Lesson 3 - Calculate PageRank of a real page ¶

This tiny lesson presents a Python code that finds the Page Rank that Google calculates for any page on the web.

https://youtu.be/8L_72bRLQVk

Section 14 - Sensors Case Study ¶

We start with the Internet of Things IoT giving examples like monitors of machine operation, QR codes, surveillance cameras, scientific sensors, drones and self driving cars and more generally transportation systems. We give examples of robots and drones. We introduce the Industrial Internet of Things IIoT and summarize surveys and expectations Industry wide. We give examples from General Electric. Sensor clouds control the many small distributed devices of IoT and IIoT. More detail is given for radar data gathered by sensors; ubiquitous or smart cities and homes including U-Korea; and finally the smart electric grid.

Unit 36 - Case Study: Sensors ¶

Slides ¶

https://iu.box.com/s/9a5y7p7xvhjqgrc9zjob8gorv3ft4kyq

Lesson 1 - Internet of Things ¶

There are predicted to be 24-50 Billion devices on the Internet by 2020; these are typically some sort of sensor defined as any source or sink of time series data. Sensors include smartphones, webcams, monitors of machine operation, barcodes, surveillance cameras, scientific sensors (especially in earth and environmental science), drones and self driving cars and more generally transportation systems. The lesson gives many examples of distributed sensors, which form a Grid that is controlled by a cloud.

https://youtu.be/fFMvxYW6Yu0

Lesson 2 - Robotics and IOT Expectations ¶

Examples of Robots and Drones.

https://youtu.be/VqXvn0dwqxs

Lesson 3 - Industrial Internet of Things I ¶

This is Part 1.

We summarize surveys and expectations Industry wide.

https://youtu.be/jqQJjtTEsEo

Lesson 4 - Industrial Internet of Things II ¶

This is Part 2 of “Industrial Internet of Things”.

Examples from General Electric.

https://youtu.be/YiIvQRCi3j8

Lesson 5 - Sensor Clouds ¶

We describe the architecture of a Sensor Cloud control environment and gives example of interface to an older version of it. The performance of system is measured in terms of processing latency as a function of number of involved sensors with each delivering data at 1.8 Mbps rate.

https://youtu.be/0egT1FsVGrU

Lesson 6 - Earth/Environment/Polar Science data gathered by Sensors ¶

This lesson gives examples of some sensors in the Earth/Environment/Polar Science field. It starts with material from the CReSIS polar remote sensing project and then looks at the NSF Ocean Observing Initiative and NASA’s MODIS or Moderate Resolution Imaging Spectroradiometer instrument on a satellite.

https://youtu.be/CS2gX7axWfI

Lesson 7 - Ubiquitous/Smart Cities ¶

For Ubiquitous/Smart cities we give two examples: Iniquitous Korea and smart electrical grids.

https://youtu.be/MFFIItQ3SOo

Lesson 8 - U-Korea (U=Ubiquitous)¶

Korea has an interesting positioning where it is first worldwide in broadband access per capita, e-government, scientific literacy and total working hours. However it is far down in measures like quality of life and GDP. U-Korea aims to improve the latter by Pervasive computing, everywhere, anytime i.e. by spreading sensors everywhere. The example of a ‘High-Tech Utopia’ New Songdo is given.

https://youtu.be/wdot23r4YKs

Lesson 9 - Smart Grid ¶

The electrical Smart Grid aims to enhance USA’s aging electrical infrastructure by pervasive deployment of sensors and the integration of their measurement in a cloud or equivalent server infrastructure. A variety of new instruments include smart meters, power monitors, and measures of solar irradiance, wind speed, and temperature. One goal is autonomous local power units where good use is made of waste heat.

https://youtu.be/m3eX8act0GU

Resources ¶

Section 15 - Radar Case Study ¶

Unit 37 - Case Study: Radar ¶

The changing global climate is suspected to have long-term effects on much of the world’s inhabitants. Among the various effects, the rising sea level will directly affect many people living in low-lying coastal regions. While the ocean-s thermal expansion has been the dominant contributor to rises in sea level, the potential contribution of discharges from the polar ice sheets in Greenland and Antarctica may provide a more significant threat due to the unpredictable response to the changing climate. The Radar-Informatics unit provides a glimpse in the processes fueling global climate change and explains what methods are used for ice data acquisitions and analysis.

Slides ¶

https://iu.app.box.com/s/njxktkb71e2cbroopsx2

Lesson 1 - Introduction ¶

This lesson motivates radar-informatics by building on previous discussions on why X-applications are growing in data size and why analytics are necessary for acquiring knowledge from large data. The lesson details three mosaics of a changing Greenland ice sheet and provides a concise overview to subsequent lessons by detailing explaining how other remote sensing technologies, such as the radar, can be used to sound the polar ice sheets and what we are doing with radar images to extract knowledge to be incorporated into numerical models.

https://youtu.be/LXOncC2AhsI