i524 Big Data and Open Source Software Projects (2017)

This course studies software used in many commercial activities related to Big Data. The backdrop for course contains more than 370 software subsystems illustrated in Figure 1. We will describe the software architecture represented by this collection and work towards identifying best practices to deploy, access and interface with them. Topics of this class will include:

  1. The cloud computing architecture underlying open source big data software and frameworks and contrast of them to high performance computing
  2. The software architecture with its different layers covering broad functionality and rationale for each layer.
  3. We will go through selected big data stack software from the list of more than 370
  4. We will be identifying how we can create and replicate software environments based on software deployed and used on clouds while using Containers, OpenStack and Ansible playbooks.
  5. Students will choose a number of open source members of the list each and create repeatable deployments as illustrated in class.
  6. The main activity of the course will be building a significant project using multiple subsystems combined with user code and data. Projects will be suggested or students can choose their own. A project report will summarize the work conducted.
  7. Topics taught in this class will be very relevant for industry as you are not only exposed to big data, but you will also be practically exposed to DevOps and collaborative code development tools as part of your homework and project assignment.

Students of this class will need to conduct their project deployments in python using ansible and enabling a software stack that is useful for a big data analysis. While it is not necessary to know either python or ansible to take the class it is important that you have knowledge of a programming language so you can enhance your knowledge on them throughout the class and succeed. You will be expected to have a computer on which you have python 2.7.x installed. You will be using chameleon and possibly our local cloud. Optionally some projects may use docker.

Figure 2 illustrates that you can follow the components of the class in a variety of ways and in parallel. For example, you do not have to wait to start the project or to find out more about any of the subsystems.

digraph class {
   rankdir="LR";
   compound=true;
   node [shape=box,
         style=filled,
         fillcolor=ivory];

   subgraph cluster_prj {
             "Project Proposal" -> "Project Approval" ->
             "Project Update" -> "Final Project";
             label = "Project";
     }

   subgraph cluster_ppr {
             "Paper 1" -> "Paper 2" -> "Paper 3";
             label = "Technology Papers";
     }


   "start" [shape=circle,fillcolor=palegreen];
   "end" [shape=circle,fillcolor=khaki];
   "midterm" [shape=circle,fillcolor=salmon];

   "start" -> "Technology";
   "start" -> "Collaboration";
   "start" -> "Systems";
   "start" -> "Theory";
   "start" -> "Web" -> "Paper 1" [lhead=cluster_ppr];
   "Paper 3" -> "midterm" [ltail=cluster_ppr];
   "midterm" -> "end";
   "Theory" -> "Technology" -> "Project Proposal" [lhead=cluster_prj];
   "Collaboration" -> "Project Proposal" [lhead=cluster_prj];
   "Collaboration" -> "Paper 1" [lhead=cluster_ppr];
   "Collaboration" -> "Web";
   "Systems" -> "Project Proposal" [lhead=cluster_prj];
   "Theory" -> "Web";
   "Theory" -> "Paper 1" [lhead=cluster_ppr];
   "Project Update" -> "midterm" [ltail=cluster_prj];
   "Final Project" -> "end" [ltail=cluster_prj];
}

Figure 2: Components of the Class

Note

You do not have to take I523 in order to take I524.

For previous I523 class participants: While I523 is a beginners class I524 is a more advanced class and we expect that you know python which you hopefully have learned as part of I523 while doing a software project. If not, make sure you learn it before you take this class or consider significant additional time needed to learn it for the class.

Residential students need to enroll early so we avoid the situation like last year where we had many signing up, but did not even show up to the first lecture. I have asked that students from I523 have preference, but I am not sure if we can enforce this. So enroll ASAP. Those that are on the waiting list are recommended to show up in the first class. It is likely that you can join as others drop.

Meeting Times

The classes are published online. Residential students at Indiana University will participate in a discussion taking place at the following time:

  • Monday 09:30am - 10:45am EST, I2 130

For the 100% online students see the office hours.

Online Meetings

For the zoom information please go to

https://iu.instructure.com/courses/1603897/assignments/syllabus

A doodle was used and all students that answered the doodle have times that they specified. We covered 100% the time for the students through the following schedule:

All times are in Eastern Standard Time.

Day of Week Meetings
Monday
8-9am Office Hours (Vibhatha)
9:30-10:45am Residential Lecture
6-7pm Office Hours (Tony)
Tuesday
1-2pm Office Hours (Dimitar)
4-5pm Office Hours (Jerome)
Wednesday 6-7pm Office Hours (Jerome)
Thursday 6-7pm Office Hours (Gregor)
Friday 4-5pm Office Hours (Dimitar)
Saturday 8-9pm Office Hours (Miao)
Sunday
9-10am Office Hours (Vibhatha)
8-9pm Office Hours (Miao)

Who can take the class?

  • Although Undergrads can take this class it will be thaught as graduate class. Make sure you have enough time and fulfill the prerequisites such as knowing a programming language well. You need to have enough time to learn python if you do not know it.
  • You can take I524 without taking I523, but you must be proficient in python. The overlap between I523 and I524 only relates to some introduction lectures and naturally lectures from the systems track such as github, report writing, introduction to python.
  • Online students
  • Residential students

Homework

Grading policies are listed in Table 1.

Table 1: Grading
Percent Description
10% Class participation and contribution to Web pages.
30% Three unique technology papers per student of the 370 systems. Each paper as at least 2 pages per technology without references.
60% Project code and report with at least 6 pages without references. Much shorter reports will be returned without review. Do not artificially inflate contents.
  • Technology papers: Technology papers must be non-overlapping in the entire class. As we have over 370 such technologies we should have enough for the entire class. If you see technologies missing, let us know and we see how to add them. Technology papers could be a survey of multiple technologies or an indepth analysis of a particular technology.
  • Technology paper groups: Groups of up to three students can work also on the technology papers. However the workload is not reduced, you will produce 3 times the number of group members technology papers of unique technologies. However, you can have multiple coauthors for each paper (up to three) that are part of your group. Please do not ask us how many technology papers you need to write if you are in a group. The rule is clearly specified. Example: Your group has 3 members, each of them has to produce 3 unique papers, thus you have to produce 9 unique technology papers for this group. If you have 2 members you have to produce 6, if you work alone you have to produce 3.
  • Technology deployment Homework: Each student will develop as a preparation for the project a deployment of a technology. Points may depend on completeness, effort of the deployment. Technology deployments should as much as possible be non overlapping. In many cases you chose wisely such deployments may line up with your technology papers as you can add a section reporting on your achievement and experience with such deployments.
  • Project groups: Groups of up to three students can work on a project but workload increases with each student and a work break down must be provided. More than three students are not allowed. If you work in a group you will be asked to deploy a larger system or demonstrate deployability on multiple clouds or container frameworks while benchmarking and comparing them. A group project containing 2 or 3 team members should not look like a project done by an individual. Please plan careful and make sure all team members contribute.
  • Frequent checkins: It is important to make frequent and often commits to the github repository as the activities will be monitored and will be integrated into the project grade. Note that paper and project will take a considerable amount of time and doing proper time management is a must for this class. Avoid starting your project late. Procrastination does not pay off.
  • No bonus projects: This class will not have any bonus projects regrading to requests. Instead, you need to focus your time on the papers and the project assignments and homework.
  • Voluntary work: You are welcome to conduct assignments and exercises you find on the class Web page on your own. However, they are not graded or considered for extra credit.
  • Late homework: Any late homework will be receiving a 10% grade reduction. As this is a large class and the assignments are not standard multiple choice questions, grading will take a considerable time. Some homework can not be delivered late as they are related to establish communication with you. Such deadline specific homework will receive 0 points in case they are late. See course calendar. It is the student’s responsibility to upload submissions well ahead of the deadline to avoid last minute problems with network connectivity, browser crashes, cloud issues, etc.
  • Chance for publishing a paper: If however you find that the work you do could lead to a publishable paper, you could work together with the course instructor as coauthors to conduct such an activity. However, this is going to be a significant effort and you need to decide if you like to conduct this. In such cases if the work is sufficient for publication submission, an A+ for the class could be considered. It will be a lot of work. The length of such a paper is typically 10-12 high quality pages including figures and references. We may elect for the final submission to use a different LaTeX style

Prerequisites

We expect you are familiar with:

  • Linux and the Operating system on which you will focus your deployment.
  • Note that Windows as OS will not be sufficient as Ansible is not supported on it. However you can use virtualbox or log onto one of the clouds to get access to an OS that supports ansible. So you can use your Windows computer if it is powerful enough.
  • Python 2.7.x (we will not use python 3 for this class as it is not yet portable with all systems) Although python is considered to be a straight forward language to learn, students that have not done any programming may find it challenging.
  • Familiarity with the Python ecosystem. The best way to install python on a computer is to use virtualenv, and pip (which we will teach you as part of the class).
  • Familiarity with an editor such as emacs, vi, jedit, pyCharm, eclipse, or other that you can use to program in and write your reports.

If you are not familiar with these technologies, we expect that you get to know them before or during class. This may pose additional time commitment.

Open Source Publication of Homework

As this class is about open source technologies, we will be using such technologies to gather the homework submissions. We will not be using CANVAS so we teach you these technologies that are often mandated in industry. CANVAS is not.

As a consequence, all technology papers from all students will be available as a single big technical report. To achieve this all reports must be written in the same format. This will be LaTeX and all references have to be provided a bibtex file while you use jabref. Alternatively, lyx.org can be used, if you prefer to edit latex in what you see is almost what you get format. The use of sharelatex or overleave or lyx.org is allowed.

Piazza

All communication will be done via Piazza. We will not read e-mail send to our university or private e-mails. All instructors are following this rule. Any mail that is not sent via Piazza will be not read and deleted. This is also true for any mail send to the inbox system in CANVAS. We found CANVAS a not scalable solution for our class and will not use CANVAS for reaching out to you. If you need a different mechanism to communicate with us, please ask on Piazza how to do that. Please note that private posts in piazza are shared among all instructors and TAs.

To sign up in piazza please follow this link:

We have created a number of piazza folder to organize the posts int topics. Thes folders are:

help:
Our help folder is just like a ticket system that is monitored by the TA’s. Once you file a question here, a TA will attend to it, and work with you. Once the issue you had is resolved, the TA is marking it as resolved. If you need to reopen a help message, please mark it again as unresolved or post a follow up.
project:
Questions related to projectsis posted here.
logistics:
Question regarding the logistics of the class are posted here. This includes questions about communication, meeting times, and other administrative activities.
papers:
Questions regarding the paper are posted here.
grading:
Questions regarding grades are posted here.
clouds:
Questions regarding cloud resources are posted here.
faq:
Some questions will be transformed to FAQ’s that we post here. Note also that we have an FAQ on the class Web page that you may want to visit. We try to move important FAQ’s from Piazza into the Web page, so it is important that you check both.
meetings: Here we will post times for meetings with TA’s and
instructors that are not yet posted on the Web page as part of the regular meeting times. Class participants are allowed to attend any Zoom meeting that we announce in this folder. For online students we will also determine a time for regular meetings. The TAs are required to hold 10 hours of meeting times upon request with you. Please make use of this.
other:
In case no folder matches for your question use other.

Tips on how to achieve your best

While teaching our classes we noticed the following tips to achieve your best:

  • Listening to the lectures
  • Set aside enough undisturbed time for the class. Switch off facebook, twitter, or other distracting social media systems when focussing on the class.
  • Ask for help. The TAs can schedule custom help office hours on appointment during reasonable times.
  • Do not Procrastinate.
  • Do not take your other classes more serious.
  • Start the project in the first 4 weeks of the class
  • Be aware that this class is not based on a text book and what this implies.
  • Do not overestimate the technical abilities.
  • Do not underestimate the time it takes to do the project.
  • Do not forget to include benchmarks in your project.
  • Unnecessarily struggling with LaTeX as you do not use an example we provide.
  • Trying to do things just on Windows which is typically more difficult than using Linux.
  • Not having a computer that is up to date. Update your memory and have a SSD
  • Ignoring obvious security rules and not integrating ssh form the start into your projects.
  • Not posting passwords into git. For example git does not allow to easily completely delete files that contain secret information such as passwords. It takes significant effort to do that. Make sure you do add in git on individual files and never just a bulk add.
  • Having your colleagues do the work for you
  • Underestimating the time it takes to do deployments
  • Not reading our piazza posts and repeating the same question over and over
  • Use Piazza to communicate and not CANVAS or e-mail.
  • When you receive an e-mail from piazza, reply to it while clicking on the link instead of replying via e-mail directly. This is more reliable.

Submissions

Your papers and projects will be developed on GitHub and submitted using Pull Requests. The process is as follows:

  1. fork the sp17-i524 repository.
  2. clone your fork and commit and push your changes.
  3. submit a pull request to the master branch of the origin repository.

See the repository for details on the individual assignments. As it will periodically be updated, make sure you are familiar with the process of Syncing a fork.

Some things to keep in mind:

  • space on github is limited, so do not add datasets to the repository. Any datasets you use should be publicly hosted and deployed as part of your project ansible deployment scripts.
  • never add ssh private keys to the repository. This results in a security risk, possible point deductions, and lots of time and effort to fix.
  • all work will be licensed under the Apache 2 open source license.
  • all submissions and discussion will be visible to the world.

Selected Project Ideas

Students can select from a number of project ideas. We will improve and add additional ideas throughout the semester. Students can also conduct their own projects. We recommend that you identify a project idea by the end of the first month of the class. Example project descriptions that you may want to take a look at include:

  • Virtual Robot Swarm
  • Virtual Cluster with Docker Swarm
  • Virtual Cluster with Kubernetes
  • Virtual SLURM Cluster
  • Author Disambiguty Problem
  • NIST Big Data Working group examples: Selected and approved use case from http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-3.pdf
  • Selected examples from Fall I523: Some students may have created an example as part of I523. Not all examples created as part of this class qualify for an I524 project. Please contact Gregor von Laszewski via Piazza to discuss suitability of your previous I523 project. If such a project is selected, approved and used it is expected it is significantly enhanced.
  • Cloudmesh Enhancements: A number of projects could center around the enhancements of cloudmesh for the improvement of big data projects using virtual machines and containers. This includes:
    • Development of REST services for cloudmesh while using cloudmesh client
    • Development of benchmarking examples while using cloudmesh client
    • Development of a better Azure interface to additional services
    • Development of a better AWS interface to additional services
    • Development of a Web interface while using django
    • SLURM integration to create virtual clusters on comet
    • Port cloudmesh client to Windows 10
    • Integrate docker into cloudmesh and demonstrate its use
    • Integrate kubernetes into cloudmesh and demonstrate its use
    • Expand the HPC capabilities of cloudmesh

Software Project

For a software project you have the choice of working individually or working in a team of up to three students. You can use the search teammate folder to find and form groups:

The following artifacts are part of the deliverables for a project

Code:
You must deliver the code in github. The code must be compilable and a TA may try to replicate to run your code. You MUST avoid lengthy install descriptions and everything must be installable from the command line. We will check submission. All team members must be responsible for one part of the project.
Project Report:

A report must be produced while using the format discussed in the Report Format section. The following length is required:

  • 4 pages, one student in the project
  • 6 pages, two students in the project
  • 8 pages, three students in the project
Work Breakdown:

The report contains in an appendix a section that is only needed for team projects. Include in the section a short but sufficiently detailed work breakdown documenting what the team has done. Back it up with commit information from github. Such as how many commits and lines of code a team member has contributed. The section does not count towards the overall length of the paper.

In addition, the graders will check the history of checkins to verify each team member has used github to checkin their contributions frequently. E.g. if we find that one of the students has not checked in code or documentation in the same way at other teammates, it will be questioned. An oral exam may be scheduled to verify that the student has contributed to the project. In an oral exam the student must be familiar with all aspects of the project not just the part you contributed.

License:
All projects are developed under an open source license such as Apache 2.0 License. You will be required to add a LICENCE.txt file and if you use other software identify how it can be reused in your project. If your project uses different licenses, please add in a README.md file which packages are used and which license these packages have while adding a licenses file.
Reproducibility:
The reproducibility of your code will be tested twice. It is tetes by another student or team, it is also tested by a TA. A report of the testing team is provided. Your team will also be responsible for executing as many tests as you have team members on other projects. A reproducibility statement should be written with details about functionality, readability, and report quality. This statement does not have to be written in latex but uses RST.
Requirements:
  • Use of cloud resources are mandatory and can be substituted by kubeernetes or docker swarm
  • Deployment must be done with ansible
  • A Makefile or a cmd file as discussed in class is needed to deploy the software, start the program, conduct a parameter study/benchmark
  • Report
  • If project is conducted in a team at least two clouds are to be benchmarked and contrasted 2 team members = 2 clouds, 3 team members = 3 clouds. cloud could also be kubernetes or docker swarm
  • Cloudmesh client is to be used to start the virtual cluster in order to avoid reinventing the wheel
  • Cloudmesh contains deployments for hadoop and spark. If these technologies are used, it has to be shown that if the student(s) elect to write a new ansible script for it that it is better than the once provided by cloudmesh. Proof is to be provided by reproducible benchmarks. If this can not be achieved the student(s) have to write an additional ansible script for a technology listed in class or approved by the professor.
Additional links form another class:

This other class contains a section deployment projects that may inspire you. You can look at suggestions and conduct them, the rules listed under requirements above applies: e.g. deployment must be done in ansible and it must be done on a cloud, kubernetes, or docker swarm. I524 will not focus on analytics. However you still are able to do that, but it still must contain a deployment portion.

Report Format

All reports will be using the format specified in Section Documenting Scientific Research.

There will be NO EXCEPTION to this format. Documents not following this format or not looking professional will be returned without review.

Github repositories

Class content repository: https://github.com/cloudmesh/classes

Class homework repository: https://github.com/cloudmesh/sp17-i524

Code Repositories Deliverables

Code repositories are for code, if you have additional libraries or data that are needed you need to develop a script or use a DevOps framework to install such software. They must not be checked into github. Thus zip files and .class, .o, precompiled python, .exe, core dumps, and other such files are not permissible in the project. If we find such files you will get a 20% deduction in your grade. Each project must be reproducible with a simple script. An example is:

git clone ....
make install
make run
make view

Which would use a simple make file to install, run, and view the results. Naturally you can use ansible or shell scripts. It is not permissible to use GUI based DevOps preinstalled frameworks (such as the one you may have installed in your company or as part of another project). Everything must be installable from the command line.

Learning Outcomes

Students will

  1. gain a broad understanding of Big Data applications and open source technologies supporting them.
  2. have intense programming experience in Python and ansible and DevOps.
  3. use open source technologies to manage code in large groups of individuals.
  4. be able to communicate research in professional scientific reports.

Outcome 1 is supported by a series of lectures around open source technologies for big data.

Outcome 2 is supported by a significant software project that will take up a considerable amount of time to plan and execute.

Outcome 1 and 4 is supported by writing 3 technology papers and a project report that is shared with all students. Students can gain additional insight from reading and reviewing other students contributions.

Outcome 3 is supported by using piazza and github as well as contributing to the class Web page with git pull requests.

Academic Integrity Policy

We take academic integrity very seriously. You are required to abide by the Indiana University policy on academic integrity, as described in the Code of Student Rights, Responsibilities, and Conduct, as well as the Computer Science Statement on Academic Integrity (http://www.soic.indiana.edu/doc/graduate/graduate-forms/Academic-Integrity-Guideline-FINAL-2015.pdf). It is your responsibility to understand these policies. Briefly summarized, the work you submit for course assignments, projects, quizzes, and exams must be your own or that of your group, if group work is permitted. You may use the ideas of others but you must give proper credit. You may discuss assignments with other students but you must acknowledge them in the reference section according to scholarly citation rules. Please also make sure that you know how to not plagiarize text from other sources while reviewing citation rules.

We will respond to acts of plagiarism and academic misconduct according to university policy. Sanctions typically involve a grade of 0 for the assignment in question and/or a grade of F in the course. In addition, University policy requires us to report the incident to the Dean of Students, who may apply additional sanctions, including expulsion from the university.

Students agree that by taking this course, papers and source code submitted to us may be subject to textual similarity review, for example by Turnitin.com. These submissions may be included as source documents in reference databases for the purpose of detecting plagiarism of such papers or codes.

It is not acceptable to use for pay services to conduct your project. Please be aware that we monitor such services and have TAs speaking various languages and know about these services even in different countries. Also do not just translate a report written by someone in a different language and claim it to be your project.

Course Numbers

This course is offered for Graduate (and Undergraduate students with permission) at Indiana University and as an online course. To Register, for University credit please go to:

Please, select the course that is most suitable for your program:

INFO-I 524  BIG DATA SOFTWARE AND PROJECTS (3 CR)    Von Laszewski G
         Above class open to graduates only
         Above class taught online
         Discussion (DIS)
      30672 RSTR     09:30A-10:45A   M      I2 130    Von Laszewski G
         Above class meets with ENGR-E 599
INFO-I 524  BIG DATA SOFTWARE AND PROJECTS (3 CR)
      30673 RSTR     ARR             ARR    ARR       Von Laszewski G
         Above class open to graduates only
         This is a 100% online class taught by IU Bloomington. No
         on-campus class meetings are required. A distance education
         fee may apply; check your campus bursar website for more
         information

 ENGR-E 599  TOPICS IN INTELL SYS ENGINEER (3 CR)
       VT: BIG DATA SOFTWARE AND PROJECTS
          ***** RSTR     ARR             ARR    ARR       Von Laszewski G
             Discussion (DIS)
       VT: BIG DATA SOFTWARE AND PROJECTS
          33924 RSTR     09:30A-10:45A   M      I2 130    Von Laszewski G
             Above class meets with INFO-I 524