Classifying handwritten digits.
In-Class Kaggle competition for CS4786: Machine Learning for Data Science
Fall 2016
The first in-class Kaggle competition for our class involves a clustering challenge:
You are provided with a data set created from 12000 hand written digits (from '0' - '9'). You are only provided information extracted based on the images of the handwritten digits. The underlying label of which digit each of the handwritten digit is is not provided to you. Your task in this competition is to cluster/classify (based on very weak supervision) these data points into 10 clusters such that each cluster corresponds to one of the digits from '0' to '9'. Seed labels of a few data points (30 of them, 3 for each digit) are provided.
Here is what you are provided with
- The Features describing the image of handwritten digits: For each handwritten digit, a 113 dimensional feature vector is extracted based on the image of the hand written digit. This is provided to you in the features.csv file. The file has 12000 lines (one for each hand written digit) in the comma separated values format.
- A similarity graph: a graph connecting the 12000 data points is provided in the adjacency matrix form in the file adjacency.csv consisting of the adjacency matrix in comma separated value format. Two nodes are connected in this graph if the corresponding noisy image of the digits are similar enough (dissimilarity smaller than a fixed threshold).
3 labeled points for each of the 10 classes: To help you identify your 10 clusters with the right digit from '0'-'9' we provide 3 example data-points for each digit in the data set in the file seed.csv. The file consists of 10 lines, one for each digit from '0'-'9'. Each line has 3 numbers providing the line number or index of 3 data point belonging to that class. The line numbering starts from 1 (not 0).
Task: For each handwritten digit, predict what the corresponding label from '0' to '9' is. The competition will be hosted on in-class-Kaggle.
Kaggle Link: https://inclass.kaggle.com/c/cs-4786-competition-1
Download the data below as zip file. When unzipped you will find the three files, Adjacency.csv, seed.csv and features.csv
Group size: Group of size 1-4 students.
Due date The deadline is 11:59 pm, Thursday, 27th October. The due date for the report on CMS will be announced soon and is a couple days after the competition closes on Kaggle. Submit what you have at least once by an hour before that deadline, even if you haven’t quite added all the finishing touches — CMS allows resubmissions up to, but not after, the deadline. If there is an emergency such that you need an extension, contact the professors.
- Footnote: The choice of the number “four” is intended to reflect the idea of allowing collaboration, but requiring that all group members be able to fit “all together at the whiteboard”, and thus all be participating equally at all times. (Admittedly, it will be a tight squeeze around a laptop, but please try.)
Deliverables:
- Early Report: Each group should submit a one to two page preliminary report that includes preliminary thoughts about how you plan to attempt the competition. For each individual in the group, ;asp include what the individual has done so far and plan to do for the competition. This report is due on October 4th. All the group members can merge their preliminary reports into one preliminary_writeup.pdf on CMS. (worth 10% of the competition grade)
- Report: In the end of the competition each group should submit a 5-15 page writeup that includes visualization, clear explanation of methods etc. See grading guidelines for details about what is expected from the writeup. (worth 50% of the competition grade)
- Predictions: Competition is held on Kaggle in-class as a competition. You can submit your predictions to kaggle to compete with your friends. You should also submit your predictions on CMS. (worth 40% of the competition grade)
Collaboration and academic integrity policy
Students may discuss and exchange ideas with students not in their group, but only at the conceptual level.
We distinguish between “merely” violating the rules for a given assignment and violating academic integrity. To violate the latter is to commit fraud by claiming credit for someone else’s work. For this assignment, an example of the former would be getting detailed feedback on your approach from person X who is not in your group but stating in your homework that X was the source of that particular answer. You would cross the line into fraud if you did not mention X. The worst-case outcome for the former is a grade penalty; the worst-case scenario in the latter is academic-integrity hearing procedures.
The way to avoid violating academic integrity is to always document any portions of work you submit that are due to or influenced by other sources, even if those sources weren’t permitted by the rules.2
2. Footnote: We make an exception for sources that can be taken for granted in the instructional setting, namely, the course materials. To minimize documentation effort, we also do not expect you to credit the course staff for ideas you get from them, although it’s nice to do so anyway.
Grading Guidelines: (still under construction)
You are allowed to use methods from outside of what is covered in class. But if you do, provide a clear comparison with ``reasonable methods’’ covered in class.
Visualization (10%)
Inclusion of plots/diagrams (5%)
Explanation of how visuals helped develop the model (5%)
Algorithms (30%)
Correct use of algorithms (15%)
Used principled approach to extract information from similarity graph and provided clear explanation and reasoning (5%)
Extracted and used common information from both the features and the similarity graph in a principled fashion and provided clear explanation and reasoning(5%)
Used clustering algorithms to cluster datapoints into classes, clearly explained and analyzed the method (5%)
Explanation of how algorithms helped to develop model (15%)
Showed evident understanding of each algorithm used
Model (40%)
Use of data (30%)
Individual testing (10%)
Tested performance on just features, just graph
Combining data (10%)
Combined data from features and graph to develop model
Partial supervision (10%)
Used seeds to classify points into classes
Parameters (10%)
Evident testing of different parameters (5%)
Reasons for choosing certain parameters (5%)
Failed Attempts (20%)
Explanation (10%)
Explained how they developed their failed models and why they think those models failed
Improvement (10%)
Explained how failed attempts led them to develop their final model
Bonus (at the discretion of the graders):
Tried new or more methods not necessarily covered in class
Developed new algorithm or methods, tweaked existing methods to fit the problem better
…