Skip to content
This repository has been archived by the owner on Jan 19, 2019. It is now read-only.

Project Data Exploration

Krishna Parashar edited this page Oct 8, 2015 · 1 revision

This assignment involves preliminary data exploration and visualization of your project dataset. It is the first group assignment. You should hand in your write-up using the submission page linked at the end of this page.

You will probably find it most convenient to do your analysis/write-up with an IPython Notebook. You can attach the notebook to your submission, but make sure all the data cells and graphics cells have been evaluated in the file you submit. We wont have access to your data, and wont be able to rerun the cells. If you have to run any long calculations, save the results to a file and reload them in the notebook so you can keep editing and quickly re-running the notebook.

Please use the following named sections. There will probably be overlap between the early sections and your project proposal. Its fine to re-use text from the proposal. But some projects have made large or small changes in their project topic and we need to keep track of their current topic.

Problem Statement and Background (4 points)

A high-level statement of the problem you intend to address, e.g. finding correspondences between neural recordings and DNN layers. Try to translate the high-level into specific questions if you can.

Give background on the problem you are solving: why it is interesting, who is interested, what is known, some references about it, etc.

The Data Source(s) You Intend to Use (2 points)

Describe the data source(s) you will use. If you're not doing one of the recommended projects, make sure you have access to the data you want to use in the quantity and quality you need. Describe how much data you have, where it is stored, and if you will collect more in future.

Data Flaws/Weaknesses and Cleaning (4 points)

Now that you have some data in hand and are exploring it, describe any flaws of challenges in the data. i.e. formatting errors, out-of-range errors, missing fields, noise, etc.

If data is being joined, describe the joining process and any problems with it - explain the metric used for fuzzy joins.

Explain how you will handle missing or duplicate keys. Describe the tools you used to examine/repair/clean the data.

Check for statistical anomalies, such as outliers. Show histograms if you encounter any such problems.

Basic Data Characteristics (4 points)

This section and the following ones are highly dependent on the type of data you have. For numerical data, you can use the methods we used in Lab 2 (EDA) and HW 2, i.e. descriptive statistics, histograms, scatter plots, and regression. Dont be restricted to those. Use other tools that we talked about (e.g. box-and-whisker plots and the other visualization types from Lecture 2 if they are appropriate).

For text you can do basic power-law plots for the distribution of the words. Although this by itself is unlikely to be very interesting (almost any text dataset will have a power-law roll-off at -1). If possible use natural language techniques to enrich the raw text data. e.g. NER to enrich it with place and business names, or a parser to find some sentiment words. Measure and visualize the statistics of these special words/phrases.

Image data is more challenging, although if you picked an image project, you probably have some familiarity with image-processing tools (scikit-image, opencv or ImageJ). GIve some summary characteristics of your images, perhaps dynamic range, estimates of SNR, perhaps try clustering into a few image classes.

Surprises (4 Points)

Summarize the most surprising results you found in your exploration of the data. Include graphics as appropriate. Explain why they were surprising - i.e. what you expected to see, and what you actually saw. Remember to form expectations about the data - but prepare to be wrong. If you don't have expectations, you can't be surprised.

Next Steps, any Obstacles (2 Points)

Summarize your next steps, which will probably be your formal data analysis. Please mention any obstacles that you have run into or anticipate at this stage.

Submit

Submit to the submission page here.