Weeks two and three at Metis got kind of mathy! Armed with some EDA and data cleaning skills, we jumped into linear regression. While linear regression isn’t always the best model, more often it probably is. It is a powerful tool to make sense of data, and one that is sure to show up again and again.
Business Problem
In linear regression, the goal is to predict the value of some continuous variable for a given observation, or data point. The classic example is predicting a house price given its features.
While developing my project, I was curious about how households choose to spend their money and what drives those decisions. Is there something inherent about cultural norms that could indicate what a household might be likely to spend its money on?
Understanding this dynamic better could be powerfully informative for companies: which international markets are likely to spend more on education, for example, or innovative food products? Such an understanding could help to inform whether or not a company should enter a new country or whether a company in a given country should expand its offerings and make a new subset of products available.
While developing my project, I was curious about how households choose to spend their money and what drives those decisions. Is there something inherent about cultural norms that could indicate what a household might be likely to spend its money on?
Understanding this dynamic better could be powerfully informative for companies: which international markets are likely to spend more on education, for example, or innovative food products? Such an understanding could help to inform whether or not a company should enter a new country or whether a company in a given country should expand its offerings and make a new subset of products available.
Background
How does one measure cultural norms? There are several models to choose from, but in this case I chose to use Hofstede’s six cultural dimensions because they’re quantified.
An example of a cultural norm is individualism versus collectivism. In a country that leans more towards individualism, people define themselves by their qualities and preferences, but in a country that veers towards collectivism, people define themselves through specific relationships with others.
An example of a cultural norm is individualism versus collectivism. In a country that leans more towards individualism, people define themselves by their qualities and preferences, but in a country that veers towards collectivism, people define themselves through specific relationships with others.
Data Collection and Methods
I combined data from five different sources for this project:
Household spending metrics:
Cultural metrics:
Other country metrics:
I collected much of this data through scraping using Selenium and BeautifulSoup and then spent a great deal of time cleaning, converting currencies, and merging my datasets.
Household spending metrics:
Cultural metrics:
Other country metrics:
- From the World Bank
- Data on median household income
I collected much of this data through scraping using Selenium and BeautifulSoup and then spent a great deal of time cleaning, converting currencies, and merging my datasets.
Models
When I began modeling, I had many targets and features to choose from:
I chose clothing as my target variable because I had relatively few missing data points and, more importantly, clothing is an interesting metric: it’s a necessity, but is also something people might spend on for other reasons, such as maintaining appearance, as gifts for others, or for leisure. This indicates that there could be more variance between countries for spending on clothing (if it is a priority for a given country for a particular reason) than there would be for something like (unprepared) food, which is more of a baseline need. Through looking at correlations between my target variable and my features, I chose a handful of features to predict spending on clothing.
Looking at p value to see which features contributed significantly to the model, I figured out which to keep and which to discard. In some cases, where pair plots indicated that I should do so, I used feature engineering to transform features (e.g. log, square root, interaction term) in order to see if those transformations were more predictive. Additionally, I scaled my data (due to high variance) and fit a linear regression model as well as a polynomial model and regularization models lasso and ridge.
After all of that, the model that best fit my data was a basic linear regression and had two features:
The adjusted R^2 value, which measures the proportion of variance explained by the features (and which is adjusted for number of features), was 0.86, which is relatively high (closer to 1 is better), but my root mean squared error was 1.5 billion, which is also high given that the average of my target variable was 2.8 billion. Despite attempts, regularization and other efforts did not decrease this error; it appears that there was high variance in the data to begin with. Additionally, there are only so many countries in the world, so I had fewer data points to test on than would have been ideal.
So what does this mean? It makes sense that GDP and population would correspond to how much households spend on clothing. What this basically indicates is that macroeconomics was correct: spending is proportional to how much money people have, and, in this case, more people means more spending.
I wanted to look at another model to see if I could get a better look at what might contribute to one country spending more on clothing than another country would spend. For this model, my target variable was spend on clothing divided by total spend, such that for each country had a proportion of how much was spent on clothing (relative to other spending) rather than just a sheer dollar amount.
After some modeling and feature engineering, I ended up with a linear regression showing that two cultural dimensions, long-term orientation and masculinity, contributed to predicting proportion of spend on clothing. Long-term orientation was positively correlated, but masculinity was negatively correlated, meaning that as masculinity went down, proportion spend on clothing went up. I should note here that I don’t think the word “masculinity” accurately represents the metric in question, so I’ve re-termed it as competition-orientation and its opposite as consensus-orientation for my purposes. For my model, this means that as consensus-orientation went up, so did the proportion of spending on clothing. Here are my descriptions of these cultural dimensions:
Looking at p value to see which features contributed significantly to the model, I figured out which to keep and which to discard. In some cases, where pair plots indicated that I should do so, I used feature engineering to transform features (e.g. log, square root, interaction term) in order to see if those transformations were more predictive. Additionally, I scaled my data (due to high variance) and fit a linear regression model as well as a polynomial model and regularization models lasso and ridge.
After all of that, the model that best fit my data was a basic linear regression and had two features:
- GDP
- Population
The adjusted R^2 value, which measures the proportion of variance explained by the features (and which is adjusted for number of features), was 0.86, which is relatively high (closer to 1 is better), but my root mean squared error was 1.5 billion, which is also high given that the average of my target variable was 2.8 billion. Despite attempts, regularization and other efforts did not decrease this error; it appears that there was high variance in the data to begin with. Additionally, there are only so many countries in the world, so I had fewer data points to test on than would have been ideal.
So what does this mean? It makes sense that GDP and population would correspond to how much households spend on clothing. What this basically indicates is that macroeconomics was correct: spending is proportional to how much money people have, and, in this case, more people means more spending.
I wanted to look at another model to see if I could get a better look at what might contribute to one country spending more on clothing than another country would spend. For this model, my target variable was spend on clothing divided by total spend, such that for each country had a proportion of how much was spent on clothing (relative to other spending) rather than just a sheer dollar amount.
After some modeling and feature engineering, I ended up with a linear regression showing that two cultural dimensions, long-term orientation and masculinity, contributed to predicting proportion of spend on clothing. Long-term orientation was positively correlated, but masculinity was negatively correlated, meaning that as masculinity went down, proportion spend on clothing went up. I should note here that I don’t think the word “masculinity” accurately represents the metric in question, so I’ve re-termed it as competition-orientation and its opposite as consensus-orientation for my purposes. For my model, this means that as consensus-orientation went up, so did the proportion of spending on clothing. Here are my descriptions of these cultural dimensions:
Though my root mean squared error for this model was low, so was my adjusted R^2 at 0.15. In this model, the intercept term was significant in the model, likely indicating that there is a baseline proportion of income that households spend on clothing, or, in other words, there is little variation.That said, these two cultural dimensions did seem to contribute to that variation.
Why these two cultural dimensions contributed to proportion of spend on clothing is somewhat obfuscated. It seems likely that focus on quality of life in the consensus-orientation could correlate with prioritization of having nice things and, thus, with an increase in spending on clothing. More granular data about the countries in question and more specific spending data could elucidate this connection, as well as a potential link with long-term orientation.
Why these two cultural dimensions contributed to proportion of spend on clothing is somewhat obfuscated. It seems likely that focus on quality of life in the consensus-orientation could correlate with prioritization of having nice things and, thus, with an increase in spending on clothing. More granular data about the countries in question and more specific spending data could elucidate this connection, as well as a potential link with long-term orientation.
Conclusion
I would advise a clothing company looking to expand to a new market to consider the long-term orientation and consensus-orientation of that market as metrics that might affect their decision. Additionally, I would advise a company already in a market high in long-term orientation and/or consensus-orientation to look into the market for clothing and whether they might want to create new lines or expand existing offerings. In both cases, I would advise that the company take into account whether the net margins of such a move would be beneficial, depending on predicted spend values, given that those are correlated with GDP and population, which might be more consequential deciding factors.