3.5 Exploratory Data Analysis

Much of the work emLab does is Exploratory Data Analysis (EDA), especially during the beginning stages of a project when we are familiarizing ourselves we new datasets and rapidly prototyping data wrangling and modeling approaches. The following quote from Hadley Wickham’s R for Data Science describes EDA as:

EDA is an iterative cycle. You: 1) Generate questions about your data; 2) Search for answers by visualising, transforming, and modelling your data; 3 Use what you learn to refine your questions and/or generate new questions.

EDA is a creative and exploratory process without defined rules. Hadley Wickham’s chapter on EDA is an excellent place to start for ideas, but we encourage creativity and exploration during this phase of a project.

However, we recommend as a best practice that all EDA should include an “Interpretations, questions, and new ideas” section. The researcher doing EDA is poised in an excellent position for providing initial interpretations of the data, raising outstanding questions about the data, and generating novel insights and potential research questions. This section should contain the following information:

Interpretations about what you found. Does it match our intuition? Is there anything surprising?
Questions about the data. Are there any questions about how to use or interpret the data? Are there any potential problems with the data or analysis?
New insights and research questions. Have you uncovered any new insights about the data or research? Have you discovered any new research questions that might be worth pursuing?

By including a section in each EDA analysis that discusses these aspects, we can build insight generation into our workflows. This information should be in an easy to find location on the Team Drive, such as in a markdown-reports directory that contains EDA PDFs generated by R Markdown. This way, other members of the project team can read it, digest it, and respond with further ideas.