Due: January 19th
Welcome to ECE 29595, Introduction to Data Science. In this course, we will cover a wide variety of topics related to data analysis -- histograms, estimation, regression, clustering, classificaiton, etc. -- all explored through writing Python programs to implement these analyses.
This first homework assignment will ask you to set up your environment correctly -- both the Python environment and the submission infrastructure that we will be using.
Goals
In this assignment you will:
- Set up your GitHub account
- Use git to clone the homework 0 account
- Set up your Python environment
- Use git to submit your assignment
Background
Version Control
This class uses git for version control. Git is a distributed version control system. That means there are two repositories: local and remote. When you commit changes, only the local repository is changed. This makes commits fast and independent of network connections. If your computer is damaged, you still lose the local repository.
To change the remote repository, you need to push the changes. If your computer is damaged, you can retrieve the code from the remote repository.
Please read the guide at github about how to use version control.
Please push your code to GitHub often. Not only does that prevent you from losing any code if you accidentally delete anything, it helps us help you debug, by giving us access to your latest code.
Python
We will use Python in this class to implement our various analyses. Python is a scripting language that has extensive library support for data science, in the form of the SciPy module stack: NumPy provides many basic mathematical operations on lists and arrays of data; pandas provides higher-level data structures that facilitate managing data sets; Matplotlib is a library for simple visualizations of data; and IPython (plus Jupyter) provides a way of using Python interactively.
In this class, we will use python 2.7.10 (this is the version of python installed on the ecegrid machines). Many of the modules we will use in this class are already installed on the ecegrid machines, as well, but you can use the pip
package manager to install the latest versions:
python -m pip install --user [package name]
Note that the --user
flag installs the package locally (i.e., in your home directory). This is useful to install packages that might override the default packages on the ecegrid machines. If the package is already installed, you can also upgrade to the latest version:
python -m pip install --user --upgrade [package name]
The package versions that we will use in this class are:
scipy
version 1.0.0numpy
version 1.14.0pandas
version 0.22.0matplotlib
version 2.1.1
These are the latest versions of these libraries. If you are using older versions of these libraries during development, it is your responsibility to ensure that your code works with the above versions.
You may also want to install Jupyter notebook (package name jupyter
) -- this
will allow you to write python scripts interspersed with text and plots (in
later homeworks, we will expect a short writeup of your results in addition to
any code).
Python modules
Python does not have a large number of built-in commands. Instead, Python relies on a wide range of modules to provide additional functionality. These modules can be used in your script by import
ing them (this is like using #include
in C). For example, to import numpy
, you would use the line:
import numpy as np
This tells Python two things: first, that you want to use the numpy
module. Second, whenever you invoke methods of the numpy
module, they will be part of the np
"namespace" -- any functions will be preceded by np
:
input = np.random.randint(0, 10, 100, 'i')
Will invoke the random.randint
method from the numpy
module.
Instructions
We recommend that you take these steps while logged in to an ecegrid machine (from a terminal). If you want to work on your homework on a different machine, you are free to, though you may have to slightly modify some of the instructions below.
If you have not logged in to ecegrid before, you can connect using ThinLinc from any web browser. You may be asked to choose a Gnome or KDE desktop (pick either). From there, you can open a terminal window to execute these commands.
1) Set up your Github account
Create a Github account (if you do not already have one). This is the account you should use to create and submit all of your assignments this semester.
Fill out this Google form to let us link your GitHub username with your Purdue account.
2) Create a git repository for the assignment
Make sure you are logged in to your GitHub account. On Blackboard, click on the HW0 GitHub Classroom link. This link will set up a repository for
homework 0 at https://github.com/ECEDataScience/hw0-<your username here
.
This repository contains starter code for your assignment (in this case,
this README, plus the file hw0.py
).
3) Set up an SSH key with GitHub
Set up a public SSH key in your GitHub account (if you haven't already). To do this, first generate a new ssh key:
> ssh-keygen
Hit enter three times (to accept the default location, then to set and confirm an empty passphrase). This will create two files: ~/.ssh/id_rsa
(your private key) and ~/.ssh/id_rsa.pub
(your public key)
Then print out your public key:
> cat ~/.ssh/id_rsa.pub
And copy it to the clipboard. Then follow steps 2-8 here.
4) Clone the repository to develop your assignment
Cloning a repository creates a local copy. Change your directory to whichever directory you want to create your local copy in, and type:
> git clone git@github.com:/ECEDataScience/hw0-<your username here> hw0
This will create a subdirectory called PA01, where you will work on your code.
In this command: git clone
copies a
repository. git@github.com:/ECEDataScience/hw0-<your username here>
tells git
where the server (remote copy) of your code is. hw0
tells
git to place the code in a local directory named hw0
If you change to directory hw0
and list the contents, you should see the
files you will need for this assignment:
> cd hw0
> ls
You should see README.md
(this file) and hw0.py
.
As you develop your code, you can commit a local version of your changes (just to make sure that you can back up if you break something) by typing:
> git add <file name that you want to commit>
> git commit -m "<describe your changes>"
git add <filename>
tells git to "stage" a file for
committing. Staging files is useful if you want to make changes to
several files at once and treat them as one logical change to your
code. You need to call git add every time you want to commit a file that you have changed.
git commit
tells git to commit a new version of your code including
all the changes you staged with git add
. Note that until you execute
git commit
, none of your changes will have a version associated with
them.
To copy your changes back to Github (to make sure they are saved if your computer crashes, or if you want to continue developing your code from another machine), type
> git push
If you do not push, the teaching staff cannot see your solutions.
5) Write a simple test program to print out module version numbers
In this assignment, all you need to do is write a simple Python script that will print out the module version numbers for the modules we will be using in this course: scipy
, numpy
, pandas
, and matplotlib
.
To do this, you should import the module, then print the __version__
variable. hw0.py
shows how to do this for numpy
:
import numpy as np
print np.__version__
You can run this program from the command line as follows:
> python hw0.py
What you need to submit
You should modify hw0.py
to also print out the version numbers for scipy
, pandas
, and matplotlib
(in that order) after the version number for numpy
. This is the only file you need to create or modify for this assignment.
Don't forget to push the updated version of hw0.py
to GitHub!
Submitting your code
You will use git's "tagging" functionality to submit assignments. Rather than using any submission system, you will use git to tag which version of the code you want to grade. To tag the latest version of the code, type:
> git tag -a <tagname> -m "<describe the tag>"
This will attach a tag with name
> git tag -a submission -m "Submission for hw0"
> git push --tags
This will create a tag named "submission" and push it to the remote server. The grading system will check out whichever version you have tagged "submission" and grade that.
If you want to update your submission (and tell the grading system to ignore any previous submissions) type:
> git tag -a -f submission -m "Submission for hw0"
> git push --tags
This will overwrite any other tag named submission with one for the current commit.
Please be careful about the following rules:
-
For each assignment, you should tag only one version with "submission". It is your responsibility to tag the correct one. You CANNOT request regrading if the grading program retrieves the version that you do not want to submit.
-
After tagging a version "submission", any modifications you make to your program WILL NOT BE GRADED (unless you update the tag, as described above).
-
The grading program starts retrieving soon after the submission deadline of each assignment. If your repository has no version tagged "submission", it is considered that you are late.
-
The time of submission is the time when you push the code to the repository, not the time when the grading program retrieves your code. If you push the code after the deadline, it is late. Even though you push before the grading program starts retrieving your program, it is still considered late.
-
You should push at least fifteen minutes before the deadline. Give yourself some time to accommodate unexpected situations (such as slow networks).
-
You are encouraged to tag partially working programs for submission early. In case anything occurs (for example, your computer is broken), you may receive some points. Please remember to tag improved version as you make progress.
- Do not send your code for grading. The only acceptable way for grading is to tag your repository.