The main activity of the course will be building a significant project using multiple subsystems combined with user code and data. Projects will be suggested or students can choose their own. A project report will summarize the work conducted.
Topics taught in this class will be very relevant for an industry as you are not only exposed to big data, but you will also be practically exposed to DevOps and collaborative code development tools as part of your homework and project assignment.
Each project must be approved by the TAs and the Professor. This is done in an iterative process in which the students needs to first provide a 2-page description of the project detailing the project and its execution plan. This is a snapshot of a draft of the actual report that will be handed in at the end. Thus, you do not have to write in your proposal the words, we propose, or in this proposal. You can directly write in this report … You can actually write directly the project report and leave sections out or put in TBD as text. Important is that the abstract and the introduction is there, as well as an execution plan (e.g. reminding you what you can do realistically when).
Students can select from a number of project ideas. We will improve and add additional ideas throughout the semester. Students can also conduct their own projects. We recommend that you identify a project idea by the end of the first month of the class. Example project descriptions that you may want to take a look at include:
Example projects can also include the following, but must include a benchmark
Links Datasets that may inspire a project and benchmarking them are
For NIST Projects:
If you have not yet done an ansible deployment as part of your I523 project you are allowed to continue it as part of this class. Please note that the focus of I523 allowed you to not conduct a deployment and a benchmark. I524 requires you to conduct a deployment with ansible and cloudmesh client, as well as benchmarking the application on a real cloud (e.g. chameleoncloud.org).
Due to the variability of the project, we also do not have a sample report for a successfully conducted project. However, the papers written in class as well as the homework to develop an ansible deployment will provide you with sufficient clarity how to be successful.
Students of this class will need to conduct their project deployments in Python using ansible and enabling a software stack that is useful for a big data analysis. You will be expected to have a computer on which you have python 2.7.x installed. You will be using chameleoncloud.org and possibly our local cloud. Optionally some projects may use docker. If your project uses docker you can use docker files, but you still need to show its running on 3 different computers.
If your project uses neither, you have to make sure that you hand in a software stack deployment done very well on some software related to the 300+ software systems. You can pick what you want, but should not be as simple as installing emacs or R. For example a sharded mongodb or cansandra deployment, a distributed deployment of hadoop (some students asked for this one despite that we had already one like this. Based on student feedback we allow you to do that. and many others, ask for approval however).
Some students may elect to chose as the homework to deploy a technology with ansible, a technology that is actually used as part of the project. This is naturally a very good way of minimizing your work while building and expanding upon the technology homework you elect to conduct. Points may depend on completeness, effort of the deployment.Technology deployments should as much as possible be non overlapping. In many cases you chose wisely such deployments may line up with your technology papers as you can add a section reporting on your achievement and experience with such deployments.
Groups of up to three students can work on a project but workload increases with each student and a work break down must be provided. More than three students are not allowed. If you work in a group you will be asked to deploy a larger system or demonstrate deployability on multiple clouds or container frameworks while benchmarking and comparing them. A group project containing 2 or 3 team members should not look like a project done by an individual. Please plan careful and make sure all team members contribute.
As we get this question often: No we will not allow more than three students to participate in a project. Please do not ask.
We monitor your progress in Github and you will get Discussion points for this. Thus it is imperative you do Frequent checkins: It is important to make frequent and often commits to the Github repository as the activities will be monitored and will be integrated into the project grade. For example, if you elect to just check in your project at the end of the semester while not using Github, you will miss points.
Note that paper and project will take a considerable amount of time and doing proper time management is a must for this class. Avoid starting your project late. Procrastination does not pay off. Too often we see a student starting their project in the week before it is due. We can guarantee you this will be problematic. To force you to think about your time management we require that your report contains a section Project Execution Plan, that documents when you approximately do what.
We will not accept any bonus projects or secondary projects as we want that you focus on your class project. If you would have time to do a second project, we recommend you add or integrate it in your actual project so you can achieve your best. One excellent project is better than two good projects.
If however you find that the work you do could lead to a publishable paper, you could work together with the course instructor as coauthors to conduct such an activity. However, this is going to be a significant effort and you need to decide if you like to conduct this. In such cases if the work is sufficient for publication submission, an A+ for the class could be considered. It will be a lot of work. The length of such a paper is typically 10-12 high-quality pages including figures and references. We may elect for the final submission to use a different LaTeX style
All project related discussions must be conducted in the piazza folder.
Some students from the class asked for a precise grading scheme. However, based on the previous observation with other classes a truly outstanding project will not really need a grading scheme.
However, as we got asked we propose the following:
ansible 30%
benchmarking 30%
paper 30%
wow factor 10%
The wow factor* is given if one of the three other components of the project is impressively well done.
Be reminded that the benchmark must involve multiple vms In case you work as team, the benchmark must include multiple clouds`
The following tips have been issued and especially apply to the project:
The following artifacts are part of the deliverables for a project
A report must be produced while using the format discussed in the Report Format section. The following length is required:
The report contains in an appendix a section that is only needed for team projects. Include in the section a short but sufficiently detailed work breakdown documenting what the team has done. Back it up with commit information from github. Such as how many commits and lines of code a team member has contributed. The section does not count towards the overall length of the paper.
In addition, the graders will check the history of checkins to verify each team member has used Github to checkin their contributions frequently. E.g. if we find that one of the students has not checked in code or documentation in the same way at other teammates, it will be questioned. An oral exam may be scheduled to verify that the student has contributed to the project. In an oral exam, the student must be familiar with all aspects of the project not just the part you contributed.
All reports will be using the format specified in Section Documenting Scientific Research.
There will be NO EXCEPTION to this format. Documents not following this format and are not professionally looking will be returned without review. The format is the same format that we use for the technology papers. Some additional information is provided in the technology paper template.
Class homework repository: https://github.com/cloudmesh/sp17-i524
Code repositories are for code, if you have additional libraries or data that are needed you need to develop a script or use a DevOps framework to install such software. They must not be checked into github. Thus zip files and .class, .o, precompiled python, .exe, core dumps, and other such files are not permissible in the project. If we find such files you will get a 20% deduction in your grade. Each project must be reproducible with a simple script. An example is:
git clone ....
make install
make run
make view
Which would use a simple make file to install, run, and view the results. Naturally, you can use ansible or shell scripts. It is not permissible to use GUI based DevOps preinstalled frameworks (such as the one you may have installed in your company or as part of another project). Everything must be installable from the command line. In many cases, it is better not to use shell scripts but actually use the python CMD or even better the CMD5 tools as presented in class
The project is submitted into Github into your project directory. We will refine this section, but the code must be submitted here. No compiled code or data is accepted in this directory. We expect you make weekly pull requests.
If you are working in a team, we will set up a “special project directory” directory for you, so you need to announce teams on Piazza. A post will be made to collect the team information.
A README.rst file needs to be included that contains the following information (please be mindfull with the spaces, there is an empty line between each field. Additional fields may need to be added as the project proceeds:
group: no
project_url: url to the project directory
title: Your Project Title in CamelCase
author: Firstname Lastname
HID: your HID
piazza: your piazza id
github: your github id
repository: the link to the report folder
proposal: report-proposal.pdf
proposal_submission: mm/dd/2017 hh:mmam
report: report.pdf
report_submission: mm/dd/2017 hh:mmam
status: short one line non breaking sentance about where you are (updated weekly)
dataset_url: url of the dataset, do not store in repo
deployment: short description of what you deploy
abstract: a copy of the abstract, make sure to use proper
indentation in RST format
Bibtex Entry
------------
@TechReport{Project_ID_or_HID-project,
author = {},
title = {},
institution = {Indiana University},
year = {2017},
type = {Class Project Report},
number = {your HID or project id},
address = {Course I524, Spring 2017},
month = apr,
url = (url of the report.pdf}
}
You will need to communicate via Piazza with the TAs that will set up a repository for you. All Github names of all team members will need to be listed in that request.
Each author has to go to their HID repository and fill out the README.rst while making sure the values ar set as follows:
group: yes
project_url: url to the project directory, that will be assigned to you
After the project directory is created, fill out the README.rst, just as if you do it for a single user, but add in the Author field the list of authors. Use a comma to separate authors.
Please note that we create automatically a proceedings from the README.rst from all students. If you have not filled out the README.rst we will not be able to see your submission.