Homework

Assignments

If not otherwise stated homework in all sections and classes is the same. All lectures are assigned Friday’s and homework is due next week Friday, other than the first week of the semester where the lectures are assigned on Monday (22nd of August) and the first homework is due Friday. Therefore we have not posted explicit due dates, as they are obvious from the calendar. You are welcome to work ahead, but check back in case the homework has been updated. Additional due dates will be posted however in CANVAS. Please visit canvas for these due dates.

As you will be doing some discussions, please PREFACE YOUR POSTS with your Full Name.

External hyperlinks, like Python

Homework Submission is done as follows:

  1. All assignments will be posted through Canvas
  2. You will be provided with a GitLab folder once you register at https://about.gitlab.com/
  3. You will complete your assignments and check in your solutions to your gitlab.com repository (see ../../lesson/prg/gitlab)
  4. You will submit to canvas a link to your solution in gitlab

Study groups

It is very common and encouraged to build study groups to discuss with each other the content of the class. However such groups should not be used to copy homework assignments that are intended for individual submissions.

When working in a team, we recommend that you use English as the communication language. This will help those that are not native English speakers.

Week 1

Communication

Resources res1

Survey 1

Please fill out the Survey to let us help you better with the course s

Video V1

Watch Videos in Section 1: Units 1 and 2 at the Course Page Syllabus

Video V2

Watch Videos in Section 2: Units 3, 4, and 5. Note these units have overlap with Unit 2 of Section 1. (see Syllabus)

Discussion d1

Consider Discussion d1 after Section 1. Please create a new post on the topic “Why is Big Data interesting to me” and also comment on at least 2 other posts.

Paper p1

This assignment may be conducted as a group with at most two students. It will be up to you to find another student, or you can just do the paper yourself. There is no need to keep this team during the semester or the project assignment you can build new teams throughout the semester for different homework. Make sure your team contributes equally.

This assignment requires to write a paper that is 2 pages in length. Please use the 2 column ACM proceedings Format.

  • Conduct the Discussion homework first.
  • Review what plagiarism is and how to not do it
  • Install jabref and organize your citations with jabref

Write a paper discussing all of the following topics:

  • What is Big Data?
  • Why is Big Data interesting to me? (Summarize and/or contrast positions in the discussion list. This is not just your position. See our note bellow.)
  • What limitations does Big Data Analytics have?
  • If you work in a team please also discuss different positions if there are any. Make sure the work is shared and no academic honesty policy has been violated.

Please note that a discussion took place on the discussion list that you need to analyze. It is important that you summarize the position and identify a mechanism to evaluate the students responses. One option is that your discussion could be augmented by classifications and statistics. It is allowable to include them as figures in the paper. Others may just highlight selected points raised by the course members.

You will be submitting the paper in gitlab.com as discussed in:

http://bdaafall2016.readthedocs.io/en/latest/gitlab.html

You will be uploading the following files into the paper1 directory:

paper1.tex
sample.bib
paper1.pdf

After you upload the files, please go to Canvas and fill out the form for the paper1 submission. You will have to upload the appropriate links.


Video V3

Please watch Section 3 Unit 6. Total Length 2.5 hours, (see Syllabus)

Discussion d3

Consider Discussion d3 after Section 3. Please post about the topic “Where are the Big Data Jobs now and in future? Discuss anything you can share – areas that are hot, good online sites etc.” and also comment on at least 2 other posts.

Paper p2

This requires to write a paper that is two pages in length. Please use the 2 column ACM proceedings Format. Write a paper discussing the following topics:

  • What is the role of Big Data in health?
  • Discuss any or all areas from telemedicine, personalized (precision) medicine, personal monitors like Fitbit, privacy issues.

You will be submitting the paper in gitlab.com as discussed in:

http://bdaafall2016.readthedocs.io/en/latest/gitlab.html

You will be uploading the following files into the paper2 directory:

paper2.tex
sample.bib
paper2.pdf

After you upload the files, please go to Canvas and fill out the form for the paper2 submission. You will have to upload the appropriate links.

A video of how to use the Webbrowser to upload the paper is available at:

Video in cc: TBD

References R1

It is important that you know how to cite. Please see the page n-resources for guidelines

Bonus points: Use d2 to discuss the topic of crowd sourcing in relationship to big data. Conduct research if needed.

Week 3

Video V4

Please watch Section 4 Unit 7-9. Total Length 3.5 hours (see Syllabus).

Discussion d4

Consider Discussion d4 after Section 4 Please post on topic “Sports and Health Informatics”:

  • Which are most interesting job areas;
  • Which are likely to make most progress
  • Which one would you work in given similar offers in both fields
  • Comment on at least 2 other posts.

Paper p3

This requires to write a paper that is from one to two pages in length. Please use the 2 column ACM proceedings Format.

This assignment may be conducted as a group with at most two students. It will be up to you to find another student, or you can just do the paper yourself. There is no need to keep this team during the semester or the project assignment you can build new teams throughout the semester for different homework. Make sure your team contributes equally.

Chose one of the alternatives:

Alternative A:

Using what we call Big Data (such as video) and Little Data (such as Baseball numerical statistics) in Sports Analytics. Write a paper discussing the following topics:

  • Which offer most opportunity on what sports?
  • How is Big Data and Little Data applied to the Olympics2016?

Alternative B (This assignment gives bonus points if done right):

How can big data and lIttle data be used in wildlife conservation, pets, farming, and other related areas that involve animal. Write a 2 page paper that covers the topic and addresses

  • Which opportunities are there related to animals?
  • Which opportunities are there for wildlife preservation?
  • What limitations are there?
  • How can big data be best used? give concrete examples.
  • This paper could be longer than two pages if you like
  • You are allowed to work in a team of six. The number of pages is determined by team members while the minimum page number is 2. The team must identify who did what.
  • However the paper must be coherent and consistent.
  • Additional pages are allowed.
  • When building teams the entire team must approve the team members.
  • If a team does not want to have you join, you need to accept this. Look for another team or work alone.
  • Use gitlab to share your LaTeX document or use microsoft one drive to write it collaboratively.

Week 4

Video V5

see next section

Development Virtual Machine

To easily develop code and not to effect your local machine, we will be using ubuntu desktop in a virtual machine running on your computer. Please make sure your hardware supports this. For example, a chrome book is insufficient.

The detailed description including 3 videos are posted at:

Please conduct form that page Homework 1, 2 & 3

Next you will be using python in that virtual machine.

Note

You can use your native OS to do the programming assignment. However if you like to use any cloud environment you must also do the Development virtual machine as we want you to get a feeing for how to use ubuntu before you go on the cloud.

Programming prg1: Python

Hardware:
Identify a suitable hardware environment that works for you to conduct the assignments. First you must have access to a sufficiently powerful computer. This could be your Laptop or Desktop, or you could get access to machines at IU’s computer labs or virtual machines.
Setup Python:
Next you will need to setup Python on the machine or verify if python works. We recommend that you use python 2.7 and NOT python 3. We recommend that you follow the instructions from python.org and use virtualenv. As editor we recommend you use PyCharm or Emacs.
Canopy and Anaconda:
We made bad experiences with Canopy as well as Anaconda on some machine of a Teaching Assitant. Therefore we recommend agains using thise systems. It will be up to you to determine if these systems work for you. We do recommend that you use python.org and virtualenv. If you have already started using canopy or anaconda you can do so (but we do not recommend it).
Useful software:
Tasks:
  • Learn Python, E.g. go through the python_big_data (and python_intro if you need to) lesson.

  • Use virtualenv and pip to customize your environment.

  • Learn Python pandas <http://pandas.pydata.org/> and do a simple Python application demonstrating:

    • a linechart
    • a barchart, e.g. a histogram

    Find some real meaningful data such as number of people born in a year or some other more interesting data set to demonstrate the various features.

  • Review of Scipy: look at the scipy manual and be aware what you can do with it in case you chose a Project

Deliverables prg1:

The goal of this assignment is to choose one or two datasets (see datasets), preprocess it to clean it up, and generate a line graph and histogram plot. Your figures must provide labels for the axes along with units.

Submit your programs in a folder called prg1, which must contain the following:

  • requirements.txt: list of python libraries your programs need as installable by: pip install -r requirements.txt
  • fetchdata.py: a python program that, when run as python fetchdata.py will produce dataset files in CSV format called data-line.csv and data-hist.csv.
  • linechart.py: a python program that, when run as python linechart.py data-line.csv will generate a line chart as save it in PNG format to a file called linechart.png.
  • histogram.py: a python program that, when run as python historgram.py data-hist.csv will generate a histogram plot as save it in PNG format to a file called histogram.png
  • README.rst: a RST format file which documents the datasets you used, where you fetched them from, how fetchdata.py cleans them to generate the data-{line,hist}.csv files.

Warning

Missing items will result in zero points being given

Term Paper and Term Project Report Assignment T1

Please prepare for the selection process for a project or a term paper:

  • Review the guidelines for the project and term paper.
  • Identify if you are likely to do a project or a term paper
  • Build teams, chose your team members wisely. For example if you have 3 people in the team and only two do the work, you still get graded based on a 3 person team.
  • Decide for a topic that you want to do and the team. Commit to it by end of Week 5.
  • For that week the homework also includes to make a plan for your term paper and write a one page summary which we will approve and give comments on. Note teaming can change in actual final project. If you are in a team, each student must submit an (identical) plan with a notation as to teaming. Note teaming can change in actual final project.
  • You will completing this Form Form, throughout the semester in which you will be uploading the title, the team members, and the location of your proposal in gitlab with direct URL, description of the artifacts and the final project report.

Discussion d5

Create a NEW post to discuss your final project you want to do and look for team members (if you want to build a team).

Week 5

Video S6

Watch the video in Section 6 (see Syllabus).

Futuresystems

  • Obtain an account on Futuresystems.org and join project FG511. Not that this will take time and you need to do this ASAP. No late assignments will be accepted. If you are late this assignment will receive 0 points.

    Which account name should i use?:

    The same name as you use at IU to register. If you have had a previous class and used a different name, please let us know, so we can make a note of it. Please do not apply for two accounts. If you account name is already taken, please use a different one.

ChameleonCloud

  • Obtain an account on https://www.chameleoncloud.org. Fill out the Poll TBD (This assignment is optional, but we have made good experience with Chameleon cloud, so we advise you to get an account. As you are a student you will not be able to create a project. We will announce the project in due time that you can join and use chameleon cloud).

OpenStack

  • Inform yourself about OpenStack and how to start and stop virtual machines via the command line.
  • Optionally, you can use cloudmesh_client for this (If you use cloudmesh client you will get bonus points).

prg2 (canceled)

Consider the Python code available on Section 6 Unit 13 Files tab (the third one) as HiggsClassIIUniform.py. This software is also available When run it should produce results like the file TypicalResultsHW5.docx on the same tab. This code corresponds to 42000 background events and 300 Higgs. Background is uniformly distributed and Higgs is a Normal (Gaussian) distribution centered at 126 with width of 2. Produce 2 more figures (plots) corresponding to experiments with a factor of 10 more or a factor of 10 less data. (Both Higgs and Background increase or decrease by same factor). Return the two new figures and your code as Homework in github under the folder prg2.

What do you conclude from figures about ability to see Higgs particle with different amount of data (corresponding to different lengths of time experiment runs) Due date October 25 Video V6: Video Review/Study Section 7 Units 12-15; total 3 hours 7 minutes. This is Physics Informatics Section.

https://github.com/cglmoocs/bdaafall2015/tree/master/PythonFiles/Section-4_Physics-Units-9-10-11/Unit-9_The-Elusive-Mr.-Higgs

Discussion d6

Post on Discussion d6 after Section 7, the “Physics” topic:

  • What you found interesting, remarkable or shocking about the search for Higgs Bosons.
  • Was it worth all that money?
  • Please also comment on at least 2 other posts.

Week 6

Video S7

Watch the videos in section 7 (see Syllabus).

Discussion d7

Post on Discussion d7 on the topic:

  • Which is the most interesting/important of the 51 use cases in section 7.
  • Why?
  • What is most interesting/important use case not in group of 51?
  • Please write one post and comment on at least 2 other posts in the discussions.

Week 7

This weeks lecture will be determined at a later time.

Week 8

Video S9

Watch the videos related to Section 9 (see Syllabus).

Discussion d9

Post on Discussion d9:

  • What are benefits for e-Commerce?
  • What are limitations for e-Commerce?
  • Waht are risks and benefits for Banking industry using big data?

Week 9

Video S10

Watch the videos related to Section 10 (see Syllabus).

Discussion d10

Use Discussion d10 in case you have questions about PRG-GEO

Programming prg-geo

PRG-GEO can be found here: PRG-GEO: Geolocation

Week 10

Discussion d11

Discuss what you learnt from video you watched in S11: Parallel Computing and Clouds under Discussion d11

Paper p11

Consider any 5 cloud or cloud like activities from list of 11 below. Describe the ones you chose and explain what ways they could be used to generate an X-Informatics for some X. Write a 2 page paper wit the Paper format from Section paper_format:

Week 11 - Week 13

Project or Term Report

Work on your project

Discussion 11, 12, 13, 14

Discuss what you learnt from videos you watched in last 2 weeks of class Sections 12-15; chose one of the topics: Web Search and Text mining, Big Data Technology, Sensors, Radar Each Discussion about the topic is to be conducted in the week it is introduced. Due dates Friday’s.

Week 13 - Dec. 2nd

Continue to work on your Term Paper or Project

Due date for the project is Dec 2nd. It will a considerable amount of time to grade your project and term papers. Thus the deadline is mandatory. Late projects and term papers will receive a 10% grade reduction. Furthermore dependent on when the project is handed in it may not be graded over the Christmas break.

Assignment Guidelines

Getting Access and Systems Support

For some projects you will need access to a cloud. We recommend you evaluate which cloud would be most appropriate for your project. This includes:

  • chameleoncloud.org
  • furturesystems.org
  • AWS (you will be responsible for charges)
  • Azure (you will be responsible for charges)
  • virtualbox if you have a powerful computer and like to prototype
  • other clouds

We intend to make some small number of virtual machines available for us in a project FG511 on FutureSystems:

Note

FutureSystems OpenStack cloud is currently updated and will not be available till Sept.

Documentation about FutureSystems can be found at OpenStackFutureSystems

Once you created an account on FutureSystems and you do a project yOu can add yourself to the project so you gain access. Systems staff is available only during regular business hours Mo-Fri 10am - 4pm.

You could also use the cloudmesh client software on Linux and OSX to access multiple clouds in easy fashion. A Section will introduce this software.

Report and Paper Format

All reports and paper assignments will be using the ACM proceedings format. The MSWord template can be found here:

A LaTeX version can be found at

however you have to remove the ACM copyright notice in the LaTeX version.

There will be NO EXCEPTION to this format. In case you are in a team, you can use either gitlab while collaboratively developing the LaTeX document or use MicrosoftOne Drive which allows collaborative editing features. All bibliographical entries must be put into a bibliography manager such as jabref, endnote, or Mendeley. This will guarantee that you follow proper citation styles. You can use either ACM or IEEE reference styles. Your final submission will include the bibliography file as a separate document.

Documents that do not follow the ACM format and are not accompanied by references managed with jabref or endnote or are not spell checked will be returned without review.

Please do not use figures ore tables toe artificially inflate the length of the report. Make figures readable and provide the original images. Use PDF for figures and not png, gif, org jpeg. This way the figures you produce are scalable and zooming into the paper will be possible.

Report Checklist:

  • [ ] Have you written the report in word or LaTeX in the specified format.
  • [ ] In case of LaTeX, have you removed the ACM copyright information
  • [ ] Have you included the report in gitlab.
  • [ ] Have you specified the names and e-mails of all team members in your report. E.g. the username in Canvas.
  • [ ] Have you included all images in native and PDF format in gitlab in the images folder.
  • [ ] Have you added the bibliography file (such as endnote or bibtex file e.g. jabref) in a directory bib.
  • [ ] Have you submitted an additional page that describes who did what in the project or report.
  • [ ] Have you spellchecked the paper.
  • [ ] Have you made sure you do not plagiarize.

Software Project

Develop a software system with OpenStack available on FutureSystems or Chameleoncloud to support it. Only choose the software option if you are prepared to take on programming tasks.

In case of a software project, we encourage a group project with up to three members. You can use the discussion list for the Software Project to form project teams or just communicate privately with other class members to formulate a team. The following artifacts are part of the deliverables for a project

Code:
You must deliver the code in gitlab. The code must be compilable and a TA may try to replicate to run your code. You MUST avoid lengthy install descriptions and everything must be installable from the commandline.
Project Report:

A report must be produced while using the format discussed in the Report Format section. The following length is required:

  • 4 pages, one student in the project
  • 6 pages, two students in the project
  • 8 pages, three students in the project

Reports can be longer up to 10 pages if needed. Your high quality scientific report should describe a) What you did b) results obtained and c) Software documentation including how to install, and run it. If c) is longer than half a page and can not be reproduced with shell scripts or easy to follow steps you will get points deducted.

Work Breakdown:
This document is only needed for team projects. A one page PDF document describing who did what. It includes pointers to the git history that documents the statistics that demonstrate not only one student has worked on the project.
License:
All projects are developed under an open source license such as Apache 2.0 License, or similar. You will be required to add a LICENCE.txt file and if you use other software identify how it can be reused in your project. If your project uses different licenses, please add in a README.rst file which packages are used and which license these packages have.
Code Repository:

Code repositories are for code, if you have additional libraries that are needed you need to develop a script or use a DevOps framework to install such software. Thus zip files and .class, .o files are not permissible in the project. Each project must be reproducible with a simple script. An example is:

git clone ....
make install
make run
make view

Which would use a simple make file to install, run, and view the results. Naturally you can use ansible or shell scripts. It is not permissible to use GUI based DevOps preinstalled frameworks. Everything must be installable form the command line.

Datasets that may inspire projects can be found in datasets.

You should also review sampleprojects.

Term Paper

Term Report:
In case you chose the term paper, you or your team will pick a topic relevant for the class. You will write a high quality scholarly paper about this topic. This includes scientifically examining technologies and application.
Content Rules:
Material may be taken from other sources but that must amount to at most 25% of paper and must be cited Figures may be used (citations in the figure caption are required). As usual, proper citations and quotations must be given to such content. The quality should be similar to a publishable paper or technical report. Plagiarism is not allowed.
Proposal:

The topic should be close to what you will propose. Please contact me if you change significantly topic. Also inform me if you change teaming. These changes are allowed; We just need to know, review, and approve.

You can use the discussion list for the Term Paper to form project teams or just communicate privately with other class members to formulate a team.

Deliverables:

The following artifacts are part of the deliverables for a term paper. A report must be produced while using the format discussed in the Report Format section. The following length is required:

  • 6 pages, one student in the project
  • 9 pages, two student in the project
  • 12 pages, three student in the project

A gitlab repository will contain the paper your wrote in PDF and in docx or latex. All images will be in an image folder and be clearly marked. All bibtex or endnote files will be included in the repository.

Work Breackdown:
This document is only needed for team projects. A one page PDF document describing who did what. The document is called workbreakdown.pdf

The directory structure thus look like:

./paper.docx
./paper.pdf
./refrences.enl
./images/myniftyimage-fig1.pptx
./images/myniftyimage-fig1.pdf
Possible Term Paper Topics:
  • Big Data and Agriculture
  • Big Data and Transportation
  • Big Data and Home Automation
  • Big Data and Internet of Things
  • Big Data and Olympics
  • Big Data and Environment
  • Big Data and Astrophysics
  • Big Data and Deep Learning
  • Big Data and Biology
  • Survey of Big Data Applications (Difficult as lots of work, tHis is a 3 person project only and at least 15 pages are required, where additional three pages are given for references.)
  • Big Data and “Suggest your own”
  • Review of Recommender Systems: technology & applications
  • Review of Big Data in Bioinformatics
  • Review of Data visualization including high dimensional data
  • Design of a NoSQL database for a specialized application

Project Proposal

Project and Term Paper Proposal Format

Please submit a one page ACM style 2 column paper in which you include the following information dependent on if you do a term paper or Project. The title will be proceeded with the keyword “PROJECT” or “REPORT”

A project proposal should contain in the proposal section:

  • The nature of the project and its context
  • The technologies used
  • Any proprietary issues
  • Specific aims you intent to complete
  • A list of intended deliverables (artifacts produced)
Title:
  • REPORT: Your title

or

  • Project: Your title
Authors:

The Authors need to be listed in the proposal with Fullname, e-mail, and gitlab username, if you use futuresystems or chameleoncloud you will also need to add your futuresystems or chameleoncloud name. Please put the prefix futuresystems: and/or chamelon: in the author field accordingly. Please only include if you have used the resources. If you do not use the resources for the project or report, ther is no need to include them.

Example:

Gregor von Laszewski
laszewski@gmail.com
chameleon: gregor
futuresystems: gvl
Abstract:
Include in your abstract a short summary of the report or project
Proposal:
Include a section called proposal in which you in detail describe what you will do.
Artifacts:

Include a section Artifacts describing what you will produce and where you will store it.

Examples are:

  • A Survey Paper
  • Code on gitlab
  • Screenshots

Homework upload

A video of how to use the Webbrowser to upload the paper is available at:

Video: https://youtu.be/b3OvgQhTFow

Video in cc: TBD

Naturally if you know how to use the git commandline tool use that which will have to master once you start working on your project or term paper.