3.1. Introduction

Note

You may find that some videos may have a different lesson, section or unit number. Please ignore this. In case the content does not correspond to the title, please let us know.

This section has a technical overview of course followed by a broad motivation for course hosted at [intro1].

The course overview covers it’s content and structure. It presents an introduction to general field of Big Data and Analytics. We are especially analysing the many different application areas in which Big Data can be applied. As Big Datais typically not just used in isolation but is part of a larger Informatics issue for a particular field we also use the term X-Informatics, where X defines a usecase or area of specialization in which Big Data is applied to. As such we organize the class around the the Rallying Cry of course: Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics.

The courses is set up as a number of lessons that are typically between 20 minutes to an hour. The lessons are either provided as written documents or as video lectures. They are enhanced by an in person meeting that takes place either in a lecture room for residential students or as online meeting for online students.

The course covers a mix of applications (the X in X-Informatics) and technologies needed to support the field electronically i.e. to process the application data. The overview ends with a discussion of course content at highest level. The course starts with a motivation summarizing clouds and data science, then units describing applications in areas such as Physics, e-Commerce, Web Search and Text mining, Health, Sensors and Remote Sensing). These are interspersed with discussions of infrastructure (clouds) and data analytics (algorithms like clustering and collaborative filtering used in applications). The course uses Python as primary programming language. We will be introducing practical use of cloud resources so that you have the oportunity to explore example analytics applications on smaller data sets that you define.

The course motivation starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data. Then the cloud computing model developed at amazing speed by industry is introduced. The 4 paradigms of scientific research are described with growing importance of data oriented version.He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described.

We discuss in this course include the following topics. We may change the order of the topics to allow for maximal flexibility and parallel learning experiences.

Writing Track:

  • Writing a short review article
  • Writing a porject or term report

Theory Track:

  • Motivation: Big Data and the Cloud; Centerpieces of the Future Economy
  • Introduction: What is Big Data, Data Analytics
  • Use Cases: Big Data Use Cases Survey
    • Use Case, Physics Discovery of Higgs Particle
    • Use Case: e-Commerce and Lifestyle with recommender systems
    • Use Case: Web Search and Text Mining and their technologies
    • Use Case: Sports
    • Use Case: Health
    • Use Case: Sensors
    • Use Case: Radar for Remote Sensing.
  • Parallel Computing Overview and familiar examples
  • Cloud Technology for Big Data Applications & Analytics

Practice Track:

  • Python for Big Data Applications and Analytics: NumPy, SciPy, MatPlotlib
  • Using FutureGrid for Big Data Applications and Analytics Course
  • Using Chameleon Cloud for Big Data Applications and Analytics Course
  • [optional] Using Plotviz Software for Displaying Point Distributions in 3D
  • Recommender Systems - K-Nearest Neighbors, Clustering and heuristic methods
  • PageRank
  • Kmeans
  • MapReduce
  • Kmeans and MapReduce Parallelism

3.1.1. Course Motivation

We motivate the study of X-informatics by describing data science and clouds. He starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data.

He introduces the cloud computing model developed at amazing speed by industry. The 4 paradigms of scientific research are described with growing importance of data oriented version. He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC’s.

3.1.1.1. Emerging Technologies

This presents the overview of talk, some trends in computing and data and jobs. Gartner’s emerging technology hype cycle shows many areas of Clouds and Big Data. We highlight 6 issues of importance: economic imperative, computing model, research model, Opportunities in advancing computing, Opportunities in X-Informatics, Data Science Education

3.1.1.2. Data Deluge

We give some amazing statistics for total storage; uploaded video and uploaded photos; the social media interactions every minute; aspects of the business big data tidal wave; monitors of aircraft engines; the science research data sizes from particle physics to astronomy and earth science; genes sequenced; and finally the long tail of science. The next slide emphasizes applications using algorithms on clouds. This leads to the rallying cry “Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics educated in data science’’ with a catalog of the many values of X ‘’Astronomy, Biology, Biomedicine, Business, Chemistry, Climate, Crisis, Earth Science, Energy, Environment, Finance, Health, Intelligence, Lifestyle, Marketing, Medicine, Pathology, Policy, Radar, Security, Sensor, Social, Sustainability, Wealth and Wellness”

3.1.1.3. Jobs

Jobs abound in clouds and data science. There are documented shortages in data science, computer science and the major tech companies advertise for new talent.

  • Video: 9:39: (streamed with optional cC) JOBS:
  • Video: 9:39: (unstreamed without CC) JOBS:
  • Slides (8 pages): JOBS:

3.1.1.5. Digital Disruption of Old Favorites

Not everything goes up. The rise of the Internet has led to declines in some traditional areas including Shopping malls and Postal Services.

-* Video: 32:54: (unstreamed without CC) Digital Distruption and transformation

3.1.1.6. Computing Model

Industry adopted clouds which are attractive for data analytics

Clouds and Big Data are transformational on a 2-5 year time scale. Already Amazon AWS is a lucrative business with almost a $4B revenue. We describe the nature of cloud centers with economies of scale and gives examples of importance of virtualization in server consolidation. Then key characteristics of clouds are reviewed with expected high growth in Infrastructure, Platform and Software as a Service.

3.1.1.7. Research Model

4th Paradigm; From Theory to Data driven science?

We introduce the 4 paradigms of scientific research with the focus on the new fourth data driven methodology.

3.1.1.8. Data Science Process

We introduce the DIKW data to information to knowledge to wisdom paradigm. Data flows through cloud services transforming itself and emerging as new information to input into other transformations.

3.1.1.9. Physics-Informatics

Looking for Higgs Particle with Large Hadron Collider LHC

We look at important particle physics example where the Large hadron Collider has observed the Higgs Boson. He shows this discovery as a bump in a histogram; something that so amazed him 50 years ago that he got a PhD in this field. He left field partly due to the incredible size of author lists on papers.

3.1.1.10. Recommender Systems

Many important applications involve matching users, web pages, jobs, movies, books, events etc. These are all optimization problems with recommender systems one important way of performing this optimization. We go through the example of Netflix ~~ everything is a recommendation and muses about the power of viewing all sorts of things as items in a bag or more abstractly some space with funny properties.

3.1.1.11. Web Search and Information Retrieval

This course also looks at Web Search and here we give an overview of the data analytics for web search, Pagerank as a method of ranking web pages returned and uses material from Yahoo on the subtle algorithms for dynamic personalized choice of material for web pages.

3.1.1.12. Cloud Application in Research

We describe scientific applications and how they map onto clouds, supercomputers, grids and high throughput systems. He likes the cloud use of the Internet of Things and gives examples.

3.1.1.13. Parallel Computing and MapReduce

We define MapReduce and gives a homely example from fruit blending.

3.1.1.14. Data Science Education

We discuss one reason you are taking this course ~~ Data Science as an educational initiative and aspects of its Indiana University implementation. Then general; features of online education are discussed with clear growth spearheaded by MOOC’s where we use this course and others as an example. He stresses the choice between one class to 100,000 students or 2,000 classes to 50 students and an online library of MOOC lessons. In olden days he suggested ‘’hermit’s cage virtual university’’ ~~ gurus in isolated caves putting together exciting curricula outside the traditional university model. Grading and mentoring models and important online tools are discussed. Clouds have MOOC’s describing them and MOOC’s are stored in clouds; a pleasing symmetry.

3.1.1.15. Conclusions

The conclusions highlight clouds, data-intensive methodology, employment, data science, MOOC’s and never forget the Big Data ecosystem in one sentence “Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics educated in data science”

3.1.1.16. Resources

3.1.1.17. References

[intro1]Gregor von Laszewski. Cloudmesh.classes. Web Page. URL: https://cloudmesh.github.io/classes/.