Is Candy Corn Really that Hated? A Fun Halloween Data Engineering Project.
Shipyard Uses Shipyard

Is Candy Corn Really that Hated? A Fun Halloween Data Engineering Project.

Steven Johnson
Steven Johnson

I have a secret to share with you, Shipyard community.... I LOVE candy corn. I love going to stores during Halloween season and grabbing all the candy corn flavored items. This year, I've secured candy, a milkshake, a candle, and energy drinks that all have that wonderful candy corn flavor.

When I see all the new candy corn items, I get excited and grab one (or maybe a few), and I feel like everyone at the store starts to judge me. I feel like all of these items are a trap. Are the stores stocking these items just to see who is crazy enough to buy them? Should liking candy corn be as weird as society makes it out to be?

According to 538, the top two candies of the Halloween season are Reese's and Twix. Do people actually like those candies more than candy corn? Before starting my fun Halloween data engineering project, I took myself over to Twitter and took a quick look.

Well, that seems to back up 538 pretty well. However, two tweets may not be enough to look at the general public's views on these candies. I decided to take a more in-depth look at the sentiment around these three candies.

I downloaded 500 tweets that mention each candy. I cleaned up the tweets and tokenized them, then passed them through a sentiment analysis to see if the tweets were positive or negative. Let's jump in to see if candy corn deserves the hate it receives.

Gaston from "Beauty and the Beast" loves candy corn. What's sentiment analysis from Twitter/X tell us in this fun data engineering project?

Download Tweets

We have to start with the Twitter API to search for tweets with the specific candy in them. I'm not going to walk through the process to sign up for a Twitter developer account, however you'll need to create one here and apply for elevated access (application was accepted immediately) to follow along.

With those prerequisites out of the way, lets download some tweets. It's possible to follow along with Twitter's API documentation for this process, but I found Tweepy to be a much simpler process. I created an org blueprint in Shipyard for this process, so I could conceal my API credentials as well as reuse this code to get Tweets at a later date if I needed. Here's the code I used in my blueprint:

import tweepy
import pandas as pd
import os

api_key = os.environ.get('api_key')
api_key_secret = os.environ.get('api_key_secret')
access_token = os.environ.get('access_token')
access_token_secret = os.environ.get('access_token_secret')
keyword = os.environ.get('keyword')

def grab_tweets(candy):
    auth = tweepy.OAuthHandler(api_key, api_key_secret)
    auth.set_access_token(access_token, access_token_secret)

    api = tweepy.API(auth)

    keywords = candy + " -filter:retweets"
    limit = 500
    tweets = tweepy.Cursor(api.search_tweets, 
    q=keywords,
    count = 15, 
    tweet_mode = 'extended').items(limit)


    columns = ['User', 'Tweet','Date', 'Location', "Sentiment"]
    data = []

    for tweet in tweets:
        data.append([tweet.user.screen_name, tweet.full_text, tweet.created_at, tweet.user.location,''])

    df = pd.DataFrame(data, columns=columns)

    df.to_csv(f'{candy}.csv')

grab_tweets(keyword)

A few things to note for the code above:

  • I treated api_key, api_secret, access_token, access_token_secret, and keyword as Blueprint Variables in Shipyard to allow users to input their own values.
  • You need to install Pandas and Tweepy as Python packages.
  • The limit value is the number of Tweets being pulled. For Tweepy, it's maxed out at 1,000.
  • I placed the Tweets into a data frame and later to a CSV to be easier to work with in the following steps.

With the open source low code blueprint created, you can now create a new fleet in Shipyard and use the blueprints to create a separate vessel for each piece of candy.

best data orchestration tool

Now that we have vessels in place to download Tweets about each candy. We can start our process of training our model and scoring our Tweets.

Preparing for Analysis

Sentiment analysis can be done in so many ways using Python. There are pre-trained models such as Vader, but I decided to use the Natural Language Toolkit (NLTK) to train a model based on example Tweets. I followed a guide by Shaumik Daityari to help train the model and clean and tokenize my Tweets. To do this I created four Python functions: Remove Noise, Get All Words, Get Tweets for Model, and Train Model.

Remove Noise

def remove_noise(tweet_tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tweet_tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\		(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)

        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

The remove noise function is here to do the cleaning for us. The function uses regex to get rid of mentions from our Tweets as well as getting rid of links. The function also allows us to pass in our own list of stop words to be removed from the Tweets because they do not factor into the sentiment.

Get All Words and Get Tweets for Model

def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token

def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)

The two functions above are helper functions that will allow our training and test data to be in the correct format to be used in the next function.

Train Model

def train_model():
    positive_tweets = twitter_samples.strings('positive_tweets.json')
    negative_tweets = twitter_samples.strings('negative_tweets.json')
    text = twitter_samples.strings('tweets.20150430-223406.json')

    positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
    negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

    positive_cleaned_tokens_list = []
    negative_cleaned_tokens_list = []

    for tokens in positive_tweet_tokens:
        positive_cleaned_tokens_list.append(remove_noise(tokens, STOP_WORDS))

    for tokens in negative_tweet_tokens:
        negative_cleaned_tokens_list.append(remove_noise(tokens, STOP_WORDS))


    all_pos_words = get_all_words(positive_cleaned_tokens_list)



    positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
    negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

    positive_dataset = [(tweet_dict, "Positive")
                        for tweet_dict in positive_tokens_for_model]

    negative_dataset = [(tweet_dict, "Negative")
                        for tweet_dict in negative_tokens_for_model]

    dataset = positive_dataset + negative_dataset

    random.shuffle(dataset)

    train_data = dataset[:7000]
    test_data = dataset[7000:]

    classifier = NaiveBayesClassifier.train(train_data)
    return train_data, test_data, classifier

The function above will run through the previous functions to prepare sample Tweets to train and test our model. The function returns the training data, testing data, and the classifier that we will use to score our Tweets. I placed all four function definitions in a single vessel in Shipyard along with the following import statements and connected the vessel to the Download Tweets vessels from earlier.

import re
import string
from nltk import NaiveBayesClassifier
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import twitter_samples
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from nltk.tag import pos_tag
import random

Modeling

We need to add one final vessel to actually run the functions that we just created and score the Tweets we downloaded. Let's jump into the code.

from train_model import *
import re
import string
import nltk
from nltk import NaiveBayesClassifier
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import twitter_samples
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nltk.download('twitter_samples')
from nltk.tag import pos_tag
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')
import random
from nltk import classify
import pandas as pd
pd.options.mode.chained_assignment = None
from nltk.tokenize import word_tokenize



train_data, test_data, classifier = train_model()

print("Accuracy is:", classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(5))

candycorn = pd.read_csv('candycorn.csv')
reeses = pd.read_csv('reeses.csv')
twix = pd.read_csv('twix.csv')

candies = [candycorn,reeses,twix]


for candy in candies:
    custom_tokens = []
    for i in candy.index: 
        custom_tokens = remove_noise(word_tokenize(candy['Tweet'][i]),STOP_WORDS)
        sentiment = classifier.classify(dict([token, True] for token in custom_tokens))
        if sentiment == 'Positive': 
            candy['Sentiment'][i] = 1
        else: 
            candy['Sentiment'][i] = 0

candycorn_sentiment = candycorn['Sentiment'].mean()
reeses_sentiment = reeses['Sentiment'].mean()
twix_sentiment = twix['Sentiment'].mean()


print('Sentiment is scored on a scale of 0 to 1. 0 being extremely negative and 1 being extremely positive.')
print(f'The sentiment of Candy Corn is {candycorn_sentiment}.')
print(f'The sentiment of Reeses is {reeses_sentiment}.')
print(f'The sentiment of Twix is {twix_sentiment}.')

Important note: I named my definition Python script train_model.py. You will need to change the first line of the code above based on what you named the previous Python file.

The code above does three main things: train and provide a model, clean and score each Tweet, provide a conclusion score. Let's break it down part by part:

Train and Provide a Model

train_data, test_data, classifier = train_model()

print("Accuracy is:", classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(5))

The following code takes the model that we built using the training Tweets and returns a classifier object that we can use to test Tweets to find the accuracy of the model and show the most informative features, so we can have a good idea on what is happening behind the scenes.

Clean and Score each Tweet

candycorn = pd.read_csv('candycorn.csv')
reeses = pd.read_csv('reeses.csv')
twix = pd.read_csv('twix.csv')

candies = [candycorn,reeses,twix]


for candy in candies:
    custom_tokens = []
    for i in candy.index: 
        custom_tokens = remove_noise(word_tokenize(candy['Tweet'][i]),STOP_WORDS)
        sentiment = classifier.classify(dict([token, True] for token in custom_tokens))
        if sentiment == 'Positive': 
            candy['Sentiment'][i] = 1
        else: 
            candy['Sentiment'][i] = 0

This code block takes our CSV from our initial Vessels and sends them through a loop that cleans and tokenizes each Tweet. Then, the Tweets are scored using our model. The classifier returns a string of Positive or Negative. I mapped the positive scores to 1’s and negative scores to 0’s to help make computations easier later and included the values into the corresponding data frames.

Provide a Conclusion Score

candycorn_sentiment = candycorn['Sentiment'].mean()
reeses_sentiment = reeses['Sentiment'].mean()
twix_sentiment = twix['Sentiment'].mean()


print('Sentiment is scored on a scale of 0 to 1. 0 being extremely negative and 1 being extremely positive.')
print(f'The sentiment of Candy Corn is {candycorn_sentiment}.')
print(f'The sentiment of Reeses is {reeses_sentiment}.')
print(f'The sentiment of Twix is {twix_sentiment}.')

Last but not least, we need to provide some conclusion and hopefully give Candy Corn the respect it deserves. I am taking an average of the Sentiment column that we created above where a 0 would be extremely negative and a 1 being extremely positive. With all that code out of the way, my final Shipyard Fleet looks like this:

After saving my changes, I kicked off a run to take a look at our results

Results

Model Accuracy

best data orchestration solution

In our results, our model has a 99.6% accuracy. You can see the top 10 informative features above as well. Now, we can take a look at the results given for the candy.

Candy Sentiment

These results are quite shocking to say the least based on the 538 article from the introduction. Candy Corn and Reese's have a similar score that leans positive. Twix is leaning more towards the negative side based on the 500 Tweets that were downloaded.

To be more definitive, it would be great to look at a much larger sample size of Tweets, but I will take this small win for candy corn. Regardless of what candy you enjoy, our team at Shipyard wishes you a Happy Halloween!

Ready to get started with Shipyard? Sign up for our free Developer plan now.