If not otherwise stated homework in all sections and classes is the same. All lectures are assigned Friday’s and homework is due next week Friday, other than the first week of the semester where the lectures are assigned on Monday (22nd of August) and the first homework is due Friday. Therefore we have not posted explicit due dates, as they are obvious from the calendar. You are welcome to work ahead, but check back in case the homework has been updated. Additional due dates will be posted however in CANVAS. Please visit canvas for these due dates.
As you will be doing some discussions, please PREFACE YOUR POSTS with your Full Name.
External hyperlinks, like Python
Homework Submission is done as follows:
It is very common and encouraged to build study groups to discuss with each other the content of the class. However such groups should not be used to copy homework assignments that are intended for individual submissions.
When working in a team, we recommend that you use English as the communication language. This will help those that are not native English speakers.
- Enroll in the class at Piazza
- Register in https://about.gitlab.com/
- Register in https://www.chameleoncloud.org
If you do not have a computer on which you can do your assignments please apply for an account with Chameleon Cloud. You will have to ask for you to be added to project
CH-818144: https://www.chameleoncloud.org/user/projects/33130/
Note: You will only be allowed to use VMs for a duration of 6 hours.
Register in https://portal.futuresystems.org/
Watch Videos in Section 2: Units 3, 4, and 5. Note these units have overlap with Unit 2 of Section 1. (see Syllabus)
Consider Discussion d1 after Section 1. Please create a new post on the topic “Why is Big Data interesting to me” and also comment on at least 2 other posts.
This assignment may be conducted as a group with at most two students. It will be up to you to find another student, or you can just do the paper yourself. There is no need to keep this team during the semester or the project assignment you can build new teams throughout the semester for different homework. Make sure your team contributes equally.
This assignment requires to write a paper that is 2 pages in length. Please use the 2 column ACM proceedings Format.
- Conduct the Discussion homework first.
- Review what plagiarism is and how to not do it
- Install jabref and organize your citations with jabref
Write a paper discussing all of the following topics:
- What is Big Data?
- Why is Big Data interesting to me? (Summarize and/or contrast positions in the discussion list. This is not just your position. See our note bellow.)
- What limitations does Big Data Analytics have?
- If you work in a team please also discuss different positions if there are any. Make sure the work is shared and no academic honesty policy has been violated.
Please note that a discussion took place on the discussion list that you need to analyze. It is important that you summarize the position and identify a mechanism to evaluate the students responses. One option is that your discussion could be augmented by classifications and statistics. It is allowable to include them as figures in the paper. Others may just highlight selected points raised by the course members.
You will be submitting the paper in gitlab.com as discussed in:
http://bdaafall2016.readthedocs.io/en/latest/gitlab.html
You will be uploading the following files into the paper1 directory:
paper1.tex sample.bib paper1.pdfAfter you upload the files, please go to Canvas and fill out the form for the paper1 submission. You will have to upload the appropriate links.
Please watch Section 3 Unit 6. Total Length 2.5 hours, (see Syllabus)
Consider Discussion d3 after Section 3. Please post about the topic “Where are the Big Data Jobs now and in future? Discuss anything you can share – areas that are hot, good online sites etc.” and also comment on at least 2 other posts.
This requires to write a paper that is two pages in length. Please use the 2 column ACM proceedings Format. Write a paper discussing the following topics:
- What is the role of Big Data in health?
- Discuss any or all areas from telemedicine, personalized (precision) medicine, personal monitors like Fitbit, privacy issues.
You will be submitting the paper in gitlab.com as discussed in:
http://bdaafall2016.readthedocs.io/en/latest/gitlab.html
You will be uploading the following files into the paper2 directory:
paper2.tex sample.bib paper2.pdfAfter you upload the files, please go to Canvas and fill out the form for the paper2 submission. You will have to upload the appropriate links.
A video of how to use the Webbrowser to upload the paper is available at:
Video in cc: TBD
It is important that you know how to cite. Please see the page n-resources for guidelines
Bonus points: Use d2 to discuss the topic of crowd sourcing in relationship to big data. Conduct research if needed.
Consider Discussion d4 after Section 4 Please post on topic “Sports and Health Informatics”:
- Which are most interesting job areas;
- Which are likely to make most progress
- Which one would you work in given similar offers in both fields
- Comment on at least 2 other posts.
This requires to write a paper that is from one to two pages in length. Please use the 2 column ACM proceedings Format.
This assignment may be conducted as a group with at most two students. It will be up to you to find another student, or you can just do the paper yourself. There is no need to keep this team during the semester or the project assignment you can build new teams throughout the semester for different homework. Make sure your team contributes equally.
Chose one of the alternatives:
Alternative A:
Using what we call Big Data (such as video) and Little Data (such as Baseball numerical statistics) in Sports Analytics. Write a paper discussing the following topics:
- Which offer most opportunity on what sports?
- How is Big Data and Little Data applied to the Olympics2016?
Alternative B (This assignment gives bonus points if done right):
How can big data and lIttle data be used in wildlife conservation, pets, farming, and other related areas that involve animal. Write a 2 page paper that covers the topic and addresses
- Which opportunities are there related to animals?
- Which opportunities are there for wildlife preservation?
- What limitations are there?
- How can big data be best used? give concrete examples.
- This paper could be longer than two pages if you like
- You are allowed to work in a team of six. The number of pages is determined by team members while the minimum page number is 2. The team must identify who did what.
- However the paper must be coherent and consistent.
- Additional pages are allowed.
- When building teams the entire team must approve the team members.
- If a team does not want to have you join, you need to accept this. Look for another team or work alone.
- Use gitlab to share your LaTeX document or use microsoft one drive to write it collaboratively.
To easily develop code and not to effect your local machine, we will be using ubuntu desktop in a virtual machine running on your computer. Please make sure your hardware supports this. For example, a chrome book is insufficient.
The detailed description including 3 videos are posted at:
Please conduct form that page Homework 1, 2 & 3
Next you will be using python in that virtual machine.
Note
You can use your native OS to do the programming assignment. However if you like to use any cloud environment you must also do the Development virtual machine as we want you to get a feeing for how to use ubuntu before you go on the cloud.
- Hardware:
- Identify a suitable hardware environment that works for you to conduct the assignments. First you must have access to a sufficiently powerful computer. This could be your Laptop or Desktop, or you could get access to machines at IU’s computer labs or virtual machines.
- Setup Python:
- Next you will need to setup Python on the machine or verify if python works. We recommend that you use python 2.7 and NOT python 3. We recommend that you follow the instructions from python.org and use virtualenv. As editor we recommend you use PyCharm or Emacs.
- Canopy and Anaconda:
- We made bad experiences with Canopy as well as Anaconda on some machine of a Teaching Assitant. Therefore we recommend agains using thise systems. It will be up to you to determine if these systems work for you. We do recommend that you use python.org and virtualenv. If you have already started using canopy or anaconda you can do so (but we do not recommend it).
- Useful software:
- Tasks:
Learn Python, E.g. go through the python_big_data (and python_intro if you need to) lesson.
Use virtualenv and pip to customize your environment.
Learn Python pandas <http://pandas.pydata.org/> and do a simple Python application demonstrating:
- a linechart
- a barchart, e.g. a histogram
Find some real meaningful data such as number of people born in a year or some other more interesting data set to demonstrate the various features.
Review of Scipy: look at the scipy manual and be aware what you can do with it in case you chose a Project
Deliverables prg1:
The goal of this assignment is to choose one or two datasets (see datasets), preprocess it to clean it up, and generate a line graph and histogram plot. Your figures must provide labels for the axes along with units.
Submit your programs in a folder called
prg1
, which must contain the following:
requirements.txt
: list of python libraries your programs need as installable by:pip install -r requirements.txt
fetchdata.py
: a python program that, when run aspython fetchdata.py
will produce dataset files in CSV format calleddata-line.csv
anddata-hist.csv
.linechart.py
: a python program that, when run aspython linechart.py data-line.csv
will generate a line chart as save it in PNG format to a file calledlinechart.png
.histogram.py
: a python program that, when run aspython historgram.py data-hist.csv
will generate a histogram plot as save it in PNG format to a file calledhistogram.png
README.rst
: a RST format file which documents the datasets you used, where you fetched them from, howfetchdata.py
cleans them to generate thedata-{line,hist}.csv
files.Warning
Missing items will result in zero points being given
Please prepare for the selection process for a project or a term paper:
- Review the guidelines for the project and term paper.
- Identify if you are likely to do a project or a term paper
- Build teams, chose your team members wisely. For example if you have 3 people in the team and only two do the work, you still get graded based on a 3 person team.
- Decide for a topic that you want to do and the team. Commit to it by end of Week 5.
- For that week the homework also includes to make a plan for your term paper and write a one page summary which we will approve and give comments on. Note teaming can change in actual final project. If you are in a team, each student must submit an (identical) plan with a notation as to teaming. Note teaming can change in actual final project.
- You will completing this Form Form, throughout the semester in which you will be uploading the title, the team members, and the location of your proposal in gitlab with direct URL, description of the artifacts and the final project report.
Create a NEW post to discuss your final project you want to do and look for team members (if you want to build a team).
Obtain an account on Futuresystems.org and join project FG511. Not that this will take time and you need to do this ASAP. No late assignments will be accepted. If you are late this assignment will receive 0 points.
- Which account name should i use?:
The same name as you use at IU to register. If you have had a previous class and used a different name, please let us know, so we can make a note of it. Please do not apply for two accounts. If you account name is already taken, please use a different one.
- Obtain an account on https://www.chameleoncloud.org. Fill out the Poll TBD (This assignment is optional, but we have made good experience with Chameleon cloud, so we advise you to get an account. As you are a student you will not be able to create a project. We will announce the project in due time that you can join and use chameleon cloud).
- Inform yourself about OpenStack and how to start and stop virtual machines via the command line.
- Optionally, you can use cloudmesh_client for this (If you use cloudmesh client you will get bonus points).
Consider the Python code available on Section 6 Unit 13 Files tab (the third one) as HiggsClassIIUniform.py. This software is also available When run it should produce results like the file TypicalResultsHW5.docx on the same tab. This code corresponds to 42000 background events and 300 Higgs. Background is uniformly distributed and Higgs is a Normal (Gaussian) distribution centered at 126 with width of 2. Produce 2 more figures (plots) corresponding to experiments with a factor of 10 more or a factor of 10 less data. (Both Higgs and Background increase or decrease by same factor). Return the two new figures and your code as Homework in github under the folder prg2.
What do you conclude from figures about ability to see Higgs particle with different amount of data (corresponding to different lengths of time experiment runs) Due date October 25 Video V6: Video Review/Study Section 7 Units 12-15; total 3 hours 7 minutes. This is Physics Informatics Section.
Post on Discussion d6 after Section 7, the “Physics” topic:
- What you found interesting, remarkable or shocking about the search for Higgs Bosons.
- Was it worth all that money?
- Please also comment on at least 2 other posts.
Post on Discussion d7 on the topic:
- Which is the most interesting/important of the 51 use cases in section 7.
- Why?
- What is most interesting/important use case not in group of 51?
- Please write one post and comment on at least 2 other posts in the discussions.
Post on Discussion d9:
- What are benefits for e-Commerce?
- What are limitations for e-Commerce?
- Waht are risks and benefits for Banking industry using big data?
Use Discussion d10 in case you have questions about PRG-GEO
PRG-GEO can be found here: PRG-GEO: Geolocation
Discuss what you learnt from video you watched in S11: Parallel Computing and Clouds under Discussion d11
Consider any 5 cloud or cloud like activities from list of 11 below. Describe the ones you chose and explain what ways they could be used to generate an X-Informatics for some X. Write a 2 page paper wit the Paper format from Section paper_format:
- http://aws.amazon.com/ (Links to an external site.)
- http://www.windowsazure.com/en-us/ (Links to an external site.)
- https://cloud.google.com/compute/ (Links to an external site.)
- https://portal.futuresystems.org/ (Links to an external site.)
- http://joyent.com/ (Links to an external site.)
- https://pod.penguincomputing.com/ (Links to an external site.)
- http://www.rackspace.com/cloud/ (Links to an external site.)
- http://www.salesforce.com/cloudcomputing/ (Links to an external site.)
- http://earthengine.google.org/ (Links to an external site.)
- http://www.openstack.org/ (Links to an external site.)
- https://www.docker.com/ (Links to an external site.)
Work on your project
Discuss what you learnt from videos you watched in last 2 weeks of class Sections 12-15; chose one of the topics: Web Search and Text mining, Big Data Technology, Sensors, Radar Each Discussion about the topic is to be conducted in the week it is introduced. Due dates Friday’s.
Continue to work on your Term Paper or Project
Due date for the project is Dec 2nd. It will a considerable amount of time to grade your project and term papers. Thus the deadline is mandatory. Late projects and term papers will receive a 10% grade reduction. Furthermore dependent on when the project is handed in it may not be graded over the Christmas break.
For some projects you will need access to a cloud. We recommend you evaluate which cloud would be most appropriate for your project. This includes:
We intend to make some small number of virtual machines available for us in a project FG511 on FutureSystems:
Note
FutureSystems OpenStack cloud is currently updated and will not be available till Sept.
Documentation about FutureSystems can be found at OpenStackFutureSystems
Once you created an account on FutureSystems and you do a project yOu can add yourself to the project so you gain access. Systems staff is available only during regular business hours Mo-Fri 10am - 4pm.
You could also use the cloudmesh client software on Linux and OSX to access multiple clouds in easy fashion. A Section will introduce this software.
All reports and paper assignments will be using the ACM proceedings format. The MSWord template can be found here:
paper-report.docx
A LaTeX version can be found at
however you have to remove the ACM copyright notice in the LaTeX version.
There will be NO EXCEPTION to this format. In case you are in a team, you can use either gitlab while collaboratively developing the LaTeX document or use MicrosoftOne Drive which allows collaborative editing features. All bibliographical entries must be put into a bibliography manager such as jabref, endnote, or Mendeley. This will guarantee that you follow proper citation styles. You can use either ACM or IEEE reference styles. Your final submission will include the bibliography file as a separate document.
Documents that do not follow the ACM format and are not accompanied by references managed with jabref or endnote or are not spell checked will be returned without review.
Please do not use figures ore tables toe artificially inflate the length of the report. Make figures readable and provide the original images. Use PDF for figures and not png, gif, org jpeg. This way the figures you produce are scalable and zooming into the paper will be possible.
Report Checklist:
Develop a software system with OpenStack available on FutureSystems or Chameleoncloud to support it. Only choose the software option if you are prepared to take on programming tasks.
In case of a software project, we encourage a group project with up to three members. You can use the discussion list for the Software Project to form project teams or just communicate privately with other class members to formulate a team. The following artifacts are part of the deliverables for a project
A report must be produced while using the format discussed in the Report Format section. The following length is required:
Reports can be longer up to 10 pages if needed. Your high quality scientific report should describe a) What you did b) results obtained and c) Software documentation including how to install, and run it. If c) is longer than half a page and can not be reproduced with shell scripts or easy to follow steps you will get points deducted.
Code repositories are for code, if you have additional libraries that are needed you need to develop a script or use a DevOps framework to install such software. Thus zip files and .class, .o files are not permissible in the project. Each project must be reproducible with a simple script. An example is:
git clone ....
make install
make run
make view
Which would use a simple make file to install, run, and view the results. Naturally you can use ansible or shell scripts. It is not permissible to use GUI based DevOps preinstalled frameworks. Everything must be installable form the command line.
Datasets that may inspire projects can be found in datasets.
You should also review sampleprojects.
The topic should be close to what you will propose. Please contact me if you change significantly topic. Also inform me if you change teaming. These changes are allowed; We just need to know, review, and approve.
You can use the discussion list for the Term Paper to form project teams or just communicate privately with other class members to formulate a team.
The following artifacts are part of the deliverables for a term paper. A report must be produced while using the format discussed in the Report Format section. The following length is required:
A gitlab repository will contain the paper your wrote in PDF and in docx or latex. All images will be in an image folder and be clearly marked. All bibtex or endnote files will be included in the repository.
The directory structure thus look like:
./paper.docx
./paper.pdf
./refrences.enl
./images/myniftyimage-fig1.pptx
./images/myniftyimage-fig1.pdf
Please submit a one page ACM style 2 column paper in which you include the following information dependent on if you do a term paper or Project. The title will be proceeded with the keyword “PROJECT” or “REPORT”
A project proposal should contain in the proposal section:
or
The Authors need to be listed in the proposal with Fullname, e-mail, and gitlab username, if you use futuresystems or chameleoncloud you will also need to add your futuresystems or chameleoncloud name. Please put the prefix futuresystems: and/or chamelon: in the author field accordingly. Please only include if you have used the resources. If you do not use the resources for the project or report, ther is no need to include them.
Example:
Gregor von Laszewski
laszewski@gmail.com
chameleon: gregor
futuresystems: gvl
Include a section Artifacts describing what you will produce and where you will store it.
Examples are:
A video of how to use the Webbrowser to upload the paper is available at:
Video: https://youtu.be/b3OvgQhTFow
Video in cc: TBD
Naturally if you know how to use the git commandline tool use that which will have to master once you start working on your project or term paper.