29 Nov 2021
Data Science

The Column Hackathon

A short writeup describing my participation in a Hackathon with information about the exploratory data analysis and cleaning I performed, data visualization, and data interpretations.

Introduction

This project was completed as part of the Data Career Jumpstart's Hackathon with data from The Column, a chemical engineering email newsletter. The objective was to help The Column optimize email distribution and content to cater to its' readers.

EDA and Data Cleaning

The data provided consisted of multiple spreadsheets containing information about email open rates, links, and link click counts which had to be parsed and combined into one spreadsheet for analysis. Furthermore, there were many duplicate rows or rows with invalid data that had to be purged before EDA could begin. In total, 563 Links were parsed and cleaned up for analysis.

I used a package named AutoViz to automatically examine the data and provide suggested visualizations to help with the EDA. This is a great way to quickly understand what you are working with, but in my experience it also tends to churn out some pretty useless visualizations as well.

From the graph on the left, we can see that readers are more responsive to the newsletter when it is sent on Monday, as the click rate to external links is much higher. The chart below shows that readers are more engaged with the newsletter around mid-morning.

Link Scraping and Results

The links that were processed were then scraped to determine the most commonly interacted-with topics. Approximately 442k words were analyzed. I created word frequency and pair frequency graphs to visualize the most commmonly occuring words and pairs from the links that reader's clicked on in the newsletter. This was assumed to be a measure of the content readers were most interested in.

The graphs above show that "chemical" and "company" are the 2 most common words in all of the links clicked, which makes sense given the audience of this newslettter. Interestly, "press release" is the most common pairing, indicating that readers are interested in reading the press releases belonging to the stories within the newsletter.

Finally, just to add a little more character to the project, I created the wordcloud below with text size indicative of word count.


The Certificate

And of course, what would any good Hackathon contest be without a participation certificate?

</p>