You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Can you find the right digit?

In-Class Kaggle competition for CS4786: Machine Learning for Data Science

Fall 2016

You are provided data set of 12000 data-points, each data point corresponding to information extracted from handwritten digits from '0' to '9'.  Your task in this competition is to cluster/classify (based on very weak supervision) these data points into 10 clusters such that each cluster corresponds to one of the digits from '0' to '9'. Seed labels of a few data points (30) with corresponding labels in '0' to '9' are provided. To improve accuracy of your classification is your goal.

The first in-class Kaggle competition for our class involves a clustering challenge: Your goal is to cluster data into 10 classes based on very little supervision (only 30 of the 12000 data points are labeled). Here is what you are provided with

  • The Features describing the image of handwritten digits: For each handwritten digit, a 113 dimensional feature vector is extracted to represent the image. There are 12000 images in total. (features.csv)
  • A similarity graph: a graph connecting the 12000 data points is provided in the adjacency matrix format. Two nodes are connected in this graph if the corresponding noisy image of the digits are similar enough (dissimilarity smaller than a fixed threshold). (adjacency.csv)
  • 3 labeled points for each of the 10 classes: To help you identify your 10 clusters with the right digit from '0'-'9' we provide 3 example data-points for each digit in the data set. (seed.csv) 

Task: For each handwritten digit, predict what the corresponding digit from '0' to '9' was.

 

Download the data here.

 

 

Due date The dealine is 11:59 pm, Wednesday 4 OctoberThe due date for the report on CMS will be announced soon and is a couple days after the competition closes on Kaggle. Submit what you have at least once by an hour before that deadline, even if you haven’t quite added all the finishing touches — CMS allows resubmissions up to, but not after, the deadline. If there is an emergency such that you need an extension, contact the professors.

 

  1. Footnote: The choice of the number “four” is intended to reflect the idea of allowing collaboration, but requiring that all group members be able to fit “all together at the whiteboard”, and thus all be participating equally at all times. (Admittedly, it will be a tight squeeze around a laptop, but please try.)

Collaboration and academic integrity policy 

Students may discuss and exchange ideas with students not in their group, but only at the conceptual level.

We distinguish between “merely” violating the rules for a given assignment and violating academic integrity. To violate the latter is to commit fraud by claiming credit for someone else’s work. For this assignment, an example of the former would be getting detailed feedback on your approach from person X who is not in your group but stating in your homework that X was the source of that particular answer. You would cross the line into fraud if you did not mention X. The worst-case outcome for the former is a grade penalty; the worst-case scenario in the latter is academic-integrity hearing procedures.

The way to avoid violating academic integrity is to always document any portions of work you submit that are due to or influenced by other sources, even if those sources weren’t permitted by the rules.2

2. Footnote: We make an exception for sources that can be taken for granted in the instructional setting, namely, the course materials. To minimize documentation effort, we also do not expect you to credit the course staff for ideas you get from them, although it’s nice to do so anyway.


  • No labels