<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" >

<channel><title><![CDATA[RITA BIAGIOLI - Blog]]></title><link><![CDATA[http://www.ritabiagioli.com/blog]]></link><description><![CDATA[Blog]]></description><pubDate>Sun, 29 Dec 2024 00:33:12 -0800</pubDate><generator>Weebly</generator><item><title><![CDATA[And now... a video!]]></title><link><![CDATA[http://www.ritabiagioli.com/blog/and-now-a-video]]></link><comments><![CDATA[http://www.ritabiagioli.com/blog/and-now-a-video#comments]]></comments><pubDate>Thu, 07 May 2020 17:00:54 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">http://www.ritabiagioli.com/blog/and-now-a-video</guid><description><![CDATA[&nbsp;Exciting news:&nbsp; you can now listen to talk about my latest project -- What Products Bring Us Joy?-- on YouTube!​Stay tuned for more data science projects to come! I have a few ideas I'm playing with. [...] ]]></description><content:encoded><![CDATA[<div class="paragraph">&nbsp;Exciting news:&nbsp; you can now listen to talk about my latest project -- What Products Bring Us Joy?-- on YouTube!<br>&#8203;</div><div><div id="488749983225344397" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><iframe width="560" height="315" src="https://www.youtube.com/embed/PJAtPkODagU" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div></div><div class="paragraph"><br>Stay tuned for more data science projects to come! I have a few ideas I'm playing with.</div>]]></content:encoded></item><item><title><![CDATA[What Products Bring Us Joy?]]></title><link><![CDATA[http://www.ritabiagioli.com/blog/what-products-make-us-happy]]></link><comments><![CDATA[http://www.ritabiagioli.com/blog/what-products-make-us-happy#comments]]></comments><pubDate>Mon, 06 Apr 2020 19:47:00 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">http://www.ritabiagioli.com/blog/what-products-make-us-happy</guid><description><![CDATA[Consumerism, in general, fascinates me. I'm always curious how people engage with products and the emotional valence behind those interactions. What about an object makes us actually feel something? Why and how is this important?There's a lot of recent research indicating that we get more long-term joy out of experiences than we do from objects. At the same time, Marie Kondo has been wildly popular as of late, and her insinuation is that some objects do in fact spark joy. Not all of them, of cou [...] ]]></description><content:encoded><![CDATA[<div class="paragraph">Consumerism, in general, fascinates me. I'm always curious how people engage with products and the emotional valence behind those interactions. What about an object makes us actually feel something? Why and how is this important?<br /><br />There's a lot of recent research indicating that we get more long-term joy out of experiences than we do from objects. At the same time, Marie Kondo has been wildly popular as of late, and her insinuation is that some objects do in fact spark joy. Not all of them, of course, but there are certainly some objects that we want to keep around. Ingrid Fetell Lee has a blog and recently wrote a popular book detailing characteristics of objects that bring us joy, such as color and shape.&nbsp;<br /><br />Given this contrast, how can we better understand what products actually&nbsp;<em>do</em>&nbsp;cause joy rather than (or even, perhaps, in addition to) gather dust?</div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/published/screen-shot-2020-04-06-at-2-53-16-pm.png?1586203073" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div><div class="wsite-multicol"><div class="wsite-multicol-table-wrap" style="margin:0 -15px;"> 	<table class="wsite-multicol-table"> 		<tbody class="wsite-multicol-tbody"> 			<tr class="wsite-multicol-tr"> 				<td class="wsite-multicol-col" style="width:50%; padding:0 15px;"> 					 						  <div class="wsite-spacer" style="height:50px;"></div>   					 				</td>				<td class="wsite-multicol-col" style="width:50%; padding:0 15px;"> 					 						  <div class="wsite-spacer" style="height:50px;"></div>   					 				</td>			</tr> 		</tbody> 	</table> </div></div></div>  <h2 class="wsite-content-title"><strong>Data and Process</strong></h2>  <div class="paragraph">In order to address these questions, I analyzed 450 thousand Amazon product reviews. Tools I used included:<ul><li>Postgresql, SQL Alchemy to store and access data</li><li>Tone analysis using the <a href="https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm" target="_blank">NRC Emotion Lexicon</a><ul><li>It was really important to me to use a crowdsourced lexicon since writers of reviews are a similar population and, thus, meanings of words from the two sources would be more likely to align</li></ul></li><li>Feature engineering and principal component analysis (PCA) to extract/condense the most important features</li><li>Clustering (k-means)</li><li>Time series analysis using Facebook Prophet</li></ul><br />Through using tone analysis, I was able to label each review with a score for joyfulness as well as for several other emotional metrics.&nbsp;<br /><br />I chose k-means clustering because this method made the most sense of my data. I came up with 8 clusters because this number both minimized inertia and was interpretable.<br />&#8203;<br /></div>  <h2 class="wsite-content-title"><strong>What Categories of Products Bring Happiness?</strong></h2>  <div class="paragraph">I was able to isolate a few categories wherein the reviews had higher joy ratings on average and a few categories that had lower joy ratings on average:&nbsp;</div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/published/screen-shot-2020-04-06-at-3-21-35-pm.png?1586204631" alt="Picture" style="width:582;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph">How might we interpret this? The more joyful categories are actually products that tend to be more experiential while the less joyful categories are more practical or functional. This aligns nicely with prevailing research about experiences bringing more long-term joy than possessions.<br />&nbsp;</div>  <h2 class="wsite-content-title"><strong>Relationship Between Star Rating and Joy</strong></h2>  <div class="paragraph"><span>Ok, great. But-- I bet you're wondering about the one to five star rating that reviewers give a product.</span><br /><br /><span>Interestingly enough, there is...</span></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/published/screen-shot-2020-04-06-at-3-27-58-pm.png?1586204944" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph">What this might mean is that joy is actually a separate metric from star rating. It's pretty common to use star rating as a measure of customer satisfaction, and it is, but this analysis would indicate that it is not the only viable or interesting measure.<br /><br />&#8203;So what's the difference between them? Let's look at four categories:</div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/screen-shot-2020-04-06-at-3-39-25-pm_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph">In this figure, average rating for all data is on the y axis and average joy score for ratings with five stars is on the x axis. (As indicated by the lack of correlation, categories had similar average joy scores regardless of rating).<br /><br />&#8203;First let&rsquo;s start with gift cards&mdash; they&rsquo;re high on both average joy and average rating. What I think this means is that people enjoy the experience of giving, the act of it, but also, they know exactly what the product is&mdash; their expectations are met, which I think is what the rating reflects.<br /><br />Office products are relatively high on rating, but low on joy, illustrating that people are getting the function they expect (hence the rating), but they might not be optimizing for the experience.<br /><br />For clothing, people are very joyful, probably because they get to wear something new and experience the item very viscerally, but the ratings are a bit low&mdash; we often order clothing items and they&rsquo;re not really what we expected them to be (or at least that often happens to me!).<br /><br />And finally, software is low on joy and on rating, which means that the software might not be exactly what we expected and, also, we are not enjoying the experience of using it too much.<br /><br />So we&rsquo;ve examined joy by category, but taking category out of the equation, how can we group products agnostic of category and, then, how can we see which products are joyful?<br />&#8203;<br /></div>  <h2 class="wsite-content-title"><strong>What Products Bring Happiness Agnostic of Category?</strong></h2>  <div class="paragraph">In order to investigate this question, I used PCA and clustered products.<br /><br />My algorithm came up with eight clusters which I've labeled:<ul><li><strong><font color="#24678d">Very Joyful!</font></strong><ul><li>High anticipation, joy, positivity, surprise, trust</li></ul></li><li><strong><font color="#24678d">Medium Joyful</font></strong><ul><li>Med anticipation<br />&nbsp; and med joy, high &nbsp;<br />&nbsp; positivity, high trust</li></ul></li><li><strong><font color="#24678d">A Little Joyful</font></strong><ul><li>Med joy, high positivity<br /><br /><br /></li></ul></li><li><strong><font color="#33a27f">Emotionless</font></strong><ul><li>Low on all metrics</li></ul></li><li><strong><font color="#33a27f">Wrote a lot</font></strong><ul><li>Low on all metrics, high<br />&nbsp; on review length<br /><br /></li></ul></li><li><strong><font color="#c2743b">Not Thrilled</font></strong><ul><li>Med negativity</li></ul></li><li><strong><font color="#c2743b">Sad</font></strong><ul><li>Med fear, high negativity and sadness</li></ul></li><li><strong><font color="#c2743b">Sad and Disgusted</font></strong><ul><li>Med anger and sadness,&nbsp;high disgust and negativity</li></ul></li></ul><br />As you can see, my clusters are themselves clustered by gradation: the relatively happy reviews, neutral reviews, and unhappy reviews. Also worth noting is that this analysis was done on five star reviews, so there were less joyful sentiments even about products that were rated highly.<br /><br />In order to delve further into what these clusters might indicate, I wanted to compare a couple reviews from the sad cluster (leaving aside the sad and disgusted cluster, which is getting at a bit of a different construct) and the very joyful cluster.&nbsp;<br /><br /></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/screen-shot-2020-04-06-at-3-48-53-pm_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph" style="text-align:left;">How can we summarize these reviews?&nbsp;<br /><br />For the<strong> sad</strong> cluster, these reviews basically say <br /><strong>"This thing did what I wanted it to do and there is nothing special about it."<br /></strong><br />On the other hand, for <strong>very joyful</strong> cluster, the reviews say <br /><strong>"This thing made me <em>feel</em> and <em>experience </em>in a different way."</strong><br /><br />In the second review for the very joyful cluster, the reviewers are discussing a plug, but there is something about the design of it, the experience of using it, and, in this case, I would assume the convenience of it, that is getting them all excited about the product to the point that they would actually recommend it to others.<br /><br />One thing we can see here, beyond the fact that experience of a product seems to be at play, is that said experience is also informed by certain components of product design and the ways in which users engage with a product.<br /> <br /><br /></div>  <h2 class="wsite-content-title"><strong>A Word on Seasonality</strong></h2>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/screen-shot-2020-04-06-at-3-57-49-pm_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph">It appears that there is a certain confluence between giving and customer satisfaction as measured through joy. Preliminary topic modeling on reviews also indicated that the very joyful cluster had more language related to giving.<br /><br />The further implication here is that what a customer actually&nbsp;<em>does</em>&nbsp;with a product matters. It's not just about the function that that product itself has, but about the emotional function the product has for the purchaser.&nbsp;<br />&#8203;<br /></div>  <h2 class="wsite-content-title"><strong>So What? Why is This Important?</strong></h2>  <div class="paragraph">Understanding consumer satisfaction using data that were not directly solicited from customers through surveys can inform UX research, can influence product design, and can contribute to crafting a more robust consumer strategy.<br /><br />&#8203;Consumer satisfaction &#8203;can have all kinds of positive externalities for word-of-mouth advertising, brand image, and even repurchase behavior.<br /><br />More broadly, the techniques I've illustrated here have broad business applications beyond this context and are likely to add value to any analytic process.<br />&#8203;<br /></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/screen-shot-2020-04-06-at-4-04-13-pm_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>]]></content:encoded></item><item><title><![CDATA[Who is at Fault? Metis Project 4]]></title><link><![CDATA[http://www.ritabiagioli.com/blog/who-is-at-fault-metis-project-4]]></link><comments><![CDATA[http://www.ritabiagioli.com/blog/who-is-at-fault-metis-project-4#comments]]></comments><pubDate>Sun, 05 Apr 2020 21:50:14 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">http://www.ritabiagioli.com/blog/who-is-at-fault-metis-project-4</guid><description><![CDATA[       Finally it had come: the project were we were going to use natural language processing (NLP) and I knew I wanted to do something a bit out of the box.My PhD involved a fair amount of moral psychology work and thinking about what others take offense to cross-culturally. We know some basic things: pretty much every culture is appalled by incest, for example. But what else could we find out?Because I also worked (in various capacities) at UChicago's Booth School of Business, I knew that teac [...] ]]></description><content:encoded><![CDATA[<div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/published/reddit-logo-main-1280x720.jpg?1586123455" alt="Picture" style="width:199;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph">Finally it had come: the project were we were going to use natural language processing (NLP) and I knew I wanted to do something a bit out of the box.<br /><br />My PhD involved a fair amount of moral psychology work and thinking about what others take offense to cross-culturally. We know some basic things: pretty much every culture is appalled by incest, for example. But what else could we find out?<br /><br />Because I also worked (in various capacities) at UChicago's Booth School of Business, I knew that teaching soft skills is a serious priority. How can we best get along with others and make sure that group work flows smoothly? Being able to predict when others think we have crossed a line, or better understanding how they might react to our behavior more broadly, could help us to have more fruitful interactions with each other, leading, potentially, to more productivity within business contexts.<br /><br />Where might I find narratives detailing a moment when someone was unsure of what was his fault? Where could I find crowdsourced determinations of whether that person was at fault?<br /><br />&#8203;REDDIT!<br /><br />In this project, I sought to answer three questions:<br />&#8203;<br />1. Under what circumstances are people unsure of whether they&rsquo;re at fault?<br />2. How do others respond to those narratives?<br />3. Can we predict whom others will find at fault?<br />&#8203;<br /></div>  <h2 class="wsite-content-title"><strong>The Data</strong></h2>  <div class="paragraph">My data were from a subreddit called <a href="https://www.reddit.com/r/AmItheAsshole/" target="_blank">Am I the Asshole?</a> or AITA.&nbsp; This subreddit is often the butt of many a pop culture joke, which gave me all the more reason to take it seriously as a thing to analyze and make sense of.<br />&#8203;<br /></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/screen-shot-2020-04-05-at-5-00-09-pm_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph">Using the Reddit API, I was able to gather around 800 posts and their comments from AITA. I stored these data in MongoDB.<br />&#8203;<br /></div>  <h2 class="wsite-content-title"><strong>1. When Are We Unsure of Fault?</strong></h2>  <div class="paragraph">To answer my first question, I used topic modeling on all of the posts I had. I ended up using a non-negative matrix factorization (NMF) model with a&nbsp;<span style="color:rgb(60, 64, 67)">term frequency-inverse document frequency (TF-IDF) matrix. So what topics came up?</span></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/published/screen-shot-2020-04-05-at-5-07-49-pm.png?1586124495" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph">It's interesting to note here is that family (kid) means that you&rsquo;re the child in the family, family (adult) means that you&rsquo;re the adult in the family.<br /><br />Which topics were discussed the most?</div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/published/screen-shot-2020-04-05-at-5-10-06-pm.png?1586124721" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph">People seem to talk most about work and friends. This makes sense: these are situations where the impression you make likely matters to you. There is a&nbsp;middle level of closeness, as opposed to family who are stuck with you and people you might meet in passing, whom you won&rsquo;t see again. Instead, these middling levels of knowing someone lead to more need for image management, implying a potentially greater likelihood to second guess one's own behavior.<br />&#8203;<span style="color:black"></span></div>  <h2 class="wsite-content-title"><strong>2. How Do Others Respond?</strong></h2>  <div class="paragraph">Though people talked the most about work and friends, the most commented-upon topic was, by far, family where the writer is the adult.&nbsp;Secondarily, people also like to comment on posts about weddings and posts related to family where the writer is the child.&nbsp;<br /><br />What this possibly means is that other people really have opinions on how one should run one&rsquo;s family, but people are somewhat less worried about how their actions will be received in their own families. When we tell narratives about our own families, we might not expect that others are evaluating our behavior, but, in fact, they are.<br /><br />&#8203;Something nice here, though, is that while you might be very worried about how your friends and colleagues perceive you and your actions with them, it's possible that they're not really all that worried about it.<br /><br />Another really nice finding is that the more positive we are, the more positive others are in response:&nbsp;</div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/screen-shot-2020-04-05-at-5-20-37-pm_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph">I used TextBlob and IBM Watson's Tone Analyzer to get sentiment (positive, negative) and tone (a range of emotions) for each review. What I found is that peoples' sentiment actually mimic's the sentiment of the post they're responding to. There's been a lot of research on how humans mirror each other-- usually in person-- but this is preliminary evidence that mirroring is occurring both in terms of sentiment and via written text.&nbsp;<br /><br />Practically, this is really interesting as a best practice for how we should engage with each other. Though this is just a correlation, it is possible that acting positively inspires others to be positive.<br /></div>  <h2 class="wsite-content-title"><strong><br />&#8203;3. Can We Predict Who is at Fault?</strong></h2>  <div class="paragraph">Finally, the question you've all been waiting for! Who is the asshole?! Is it me?!<br /><br />The short answer is, sadly, no: we can't predict who will be at fault given the data we have. All metrics were similar across people who were deemed assholes and people who were not deemed assholes. A classification model also didn't have much explanatory power.&nbsp;<br /><br />If I had to guess, the types of violations, rather than the topics discussed (the people violated), or, perhaps some combination thereof, are what would actually allow us to predict who is deemed at fault. Thus, answering this particular question might require more of a qualitative approach followed by a quantitative one.&nbsp;<br /><br />BUT! I did find one difference between people deemed at fault and people deemed not at fault:&nbsp;<br /></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/published/screen-shot-2020-04-05-at-5-30-32-pm.png?1586125866" alt="Picture" style="width:428;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph">Among other metrics, the average score indicated that posts where the author was not at fault received more upvotes.&nbsp;<br /><br />What I think this means is that people are upvoting or downvoting to flag as "asshole" or "not the asshole"&nbsp;instead of writing their opinion in the comments, which would then be tallied by the Reddit bot.<br /><br /><br /><span></span>This might be why people think that this subreddit is full of apologists: you&rsquo;re more likely to see upvoted posts at the top (due to Reddit's algorithm), and upvoted posts are more likely to be flagged as not at fault.<br /><br /><br /><span></span>So what can we conclude about determining fault overall? Topic, sentiment, and tone (emotion) do not signal whether someone is at fault. I do think there is something giving this signal, but that it likely has to do with violations related to autonomy and obligation to others&mdash; neither of which was picked up in the metrics used.<br /><span></span> <br /></div>]]></content:encoded></item><item><title><![CDATA[Predicting Earthquake Damage: Metis Project 3]]></title><link><![CDATA[http://www.ritabiagioli.com/blog/predicting-earthquake-damage-metis-project-3]]></link><comments><![CDATA[http://www.ritabiagioli.com/blog/predicting-earthquake-damage-metis-project-3#comments]]></comments><pubDate>Sun, 05 Apr 2020 19:40:23 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">http://www.ritabiagioli.com/blog/predicting-earthquake-damage-metis-project-3</guid><description><![CDATA[For our classification project at Metis, I chose to look at data from the 2015 Nepal earthquake. While examining these data is important in general, it was also somewhat personal for me: I was in Kolkata at the time of this earthquake and actually felt it. Weeks later, I traveled to a friend's home (where I have been many times) on the border of Nepal.Using data from DrivenData, I was hoping to differentiate buildings that were slightly damaged from those that were extensively damaged. If damage [...] ]]></description><content:encoded><![CDATA[<div><div class="wsite-multicol"><div class="wsite-multicol-table-wrap" style="margin:0 -15px;"><table class="wsite-multicol-table"><tbody class="wsite-multicol-tbody"><tr class="wsite-multicol-tr"><td class="wsite-multicol-col" style="width:43.790849673203%; padding:0 15px;"><div class="wsite-spacer" style="height:50px;"></div></td><td class="wsite-multicol-col" style="width:10.457516339869%; padding:0 15px;"><div class="wsite-spacer" style="height:50px;"></div></td><td class="wsite-multicol-col" style="width:45.751633986928%; padding:0 15px;"><div class="wsite-spacer" style="height:50px;"></div></td></tr></tbody></table></div></div></div><div><div class="wsite-multicol"><div class="wsite-multicol-table-wrap" style="margin:0 -30px;"><table class="wsite-multicol-table"><tbody class="wsite-multicol-tbody"><tr class="wsite-multicol-tr"><td class="wsite-multicol-col" style="width:96.078431372549%; padding:0 30px;"><div class="paragraph">For our classification project at Metis, I chose to look at data from the 2015 Nepal earthquake. While examining these data is important in general, it was also somewhat personal for me: I was in Kolkata at the time of this earthquake and actually felt it. Weeks later, I traveled to a friend's home (where I have been many times) on the border of Nepal.<br><br>Using data from <a href="https://www.drivendata.org/competitions/" target="_blank">DrivenData</a>, I was hoping to differentiate buildings that were slightly damaged from those that were extensively damaged. If damage level could be predicted, it would help Nepal to plan ahead and to mitigate future damage in this earthquake-prone region.<br><br>Here is a before image of a stupa in Nepal (on the left) and an after image (right). You can see that the earthquake truly was devastating in many locations.</div><div><div class="wsite-multicol"><div class="wsite-multicol-table-wrap" style="margin:0 -15px;"><table class="wsite-multicol-table"><tbody class="wsite-multicol-tbody"><tr class="wsite-multicol-tr"><td class="wsite-multicol-col" style="width:50%; padding:0 15px;"><div><div class="wsite-multicol"><div class="wsite-multicol-table-wrap" style="margin:0 -15px;"><table class="wsite-multicol-table"><tbody class="wsite-multicol-tbody"><tr class="wsite-multicol-tr"><td class="wsite-multicol-col" style="width:49.863760217984%; padding:0 15px;"><div class="wsite-spacer" style="height:50px;"></div></td><td class="wsite-multicol-col" style="width:50.136239782016%; padding:0 15px;"><div class="wsite-spacer" style="height:50px;"></div></td></tr></tbody></table></div></div></div></td><td class="wsite-multicol-col" style="width:50%; padding:0 15px;"><div class="wsite-spacer" style="height:50px;"></div></td></tr></tbody></table></div></div></div></td><td class="wsite-multicol-col" style="width:3.921568627451%; padding:0 30px;"><div class="wsite-spacer" style="height:50px;"></div></td></tr></tbody></table></div></div></div><h2 class="wsite-content-title"></h2><h2 class="wsite-content-title"><strong>Business Problem</strong></h2><div class="paragraph">In this analysis, I wanted to address two questions:&nbsp;<ul><li>What can we do preemptively in order to minimize future earthquake damage?</li><li>What can we prepare to do reactively when an earthquake inevitably occurs?</li></ul><br>The clients I envisioned were an investing company interested in social entrepreneurship and the government of Nepal. The investing company would be interested in supporting industries that could fortify buildings and the government could offer subsidies and incentives for businesses and citizens to build or rebuild using better materials<br>&#8203;<br></div><h2 class="wsite-content-title"><strong>Model</strong></h2><div class="paragraph">After trying out a variety of classification models (logistic regression, SVM, Naive Bayes, and more), a random forest model was the best at predicting whether a building would be significantly or minimally damaged (as measured by ROC AUC).&nbsp;<br><br>I began with 82 features and, using methods to determine which features were the most important, ended up with a model using 20 features.&nbsp;</div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/published/screen-shot-2020-04-05-at-3-26-59-pm.png?1586118475" alt="Picture" style="width:581;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div class="paragraph">My model had an accuracy of 0.88, a recall of 0.93, and an ROC AUC of 0.82. In tuning my model, I wanted to prioritize recall, meaning that I would rather label buildings that would not be damaged as if they would be damaged than vice versa. This way, even if a building wouldn't ultimately be significantly damaged in another earthquake, the owners would have prepared as if it would be and would be safe rather than sorry.<br><br>A recall of 0.93 also means that...</div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/published/screen-shot-2020-04-05-at-3-32-18-pm.png?1586118892" alt="Picture" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div class="paragraph">Finally, I should also note that though I had geographic data for each building, I didn't use it in my model, because I wanted the model to be generalizable to other earthquakes. Though perhaps damage by region would reflect something specific about fault lines in Nepal, my hope is that my findings would be more widely applicable. Thus, the features in this model relate only to the characteristics of the buildings themselves.<br>&#8203;<br></div><h2 class="wsite-content-title"><strong>Insights</strong></h2><div class="paragraph">Through looking at different features, I found that foundation type, ground floor type, building age, and number of floors were particularly important in predicting whether a building would be minimally or significantly damaged. Unfortunately, my dataset obfuscated what the specific foundation and ground floor types were, but I was able to discern more specific&nbsp; information about building age and number of floors.<br><br>&#8203;"Ave Pred Proba" below refers to the average predicted likelihood of significant damage for that foundation type or number of floors.&nbsp;&#8203;<span>&#8203;This interactive visual was made with Tableau.</span></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/published/screen-shot-2020-04-05-at-3-38-49-pm.png?1586119616" alt="Picture" style="width:576;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div><div id="251380019651532289" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class='tableauPlaceholder' id='viz1586119352341' style='position: relative'><noscript><a href='#'><img alt=' ' src='https://public.tableau.com/static/images/Ri/Rita_Project_03_Prob_of_Damage/ProbofDamage/1_rss.png' style='border: none'></a></noscript><object class='tableauViz' style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F'><param name='embed_code_version' value='3'><param name='site_root' value=''><param name='name' value='Rita_Project_03_Prob_of_Damage/ProbofDamage'><param name='tabs' value='no'><param name='toolbar' value='yes'><param name='static_image' value='https://public.tableau.com/static/images/Ri/Rita_Project_03_Prob_of_Damage/ProbofDamage/1.png'><param name='animate_transition' value='yes'><param name='display_static_image' value='yes'><param name='display_spinner' value='yes'><param name='display_overlay' value='yes'><param name='display_count' value='yes'></object></div></div></div><h2 class="wsite-content-title"></h2><div><div class="wsite-multicol"><div class="wsite-multicol-table-wrap" style="margin:0 -15px;"><table class="wsite-multicol-table"><tbody class="wsite-multicol-tbody"><tr class="wsite-multicol-tr"><td class="wsite-multicol-col" style="width:33.333333333333%; padding:0 15px;"><h2 class="wsite-content-title"><strong>Recommendations</strong></h2></td><td class="wsite-multicol-col" style="width:33.333333333333%; padding:0 15px;"><div class="wsite-spacer" style="height:50px;"></div></td><td class="wsite-multicol-col" style="width:33.333333333333%; padding:0 15px;"><div class="wsite-spacer" style="height:50px;"></div></td></tr></tbody></table></div></div></div><div class="paragraph"><strong>Preemptive Recommendations<br>&#8203;</strong><br>Given what we've found, there's a lot that the investing company and the government of Nepal can do ahead of time to mitigate significant damage in the event of another earthquake. The most impactful thing they could do would be to fix buildings with risky attributes now. Additionally, they could support new builds using less risky materials through subsidies and investments. Further work could be done to determine exactly how risky or not risky a given material might be.<br><br><strong>Reactive Recommendations</strong><br><br>Of course, you can never plan for all of the damage that might occur, and there might still be a fair amount. What the government can do is to know which regions are particularly at risk. Assuming fault lines would be consistent across earthquakes, some regions would be more likely to be hit than others. My data detailed 31 regions, which are shown below in terms of likelihood of damage.<br><br>The first figure shows average likelihood of damage overall by region, the second shows the average age of buildings (which we found to be a relevant feature) by region, the third figure shows the average area of the building by region (larger buildings were more likely to incur damage), and the fourth figure shows the average number of floors by region.<br><br>&#8203;This interactive visual was made with Tableau.<br></div><div><div id="485669628479663410" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class='tableauPlaceholder' id='viz1586120255740' style='position: relative'><noscript><a href='#'><img alt=' ' src='https://public.tableau.com/static/images/Ri/Rita_Project_03_Geography/Geography/1_rss.png' style='border: none'></a></noscript><object class='tableauViz' style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F'><param name='embed_code_version' value='3'><param name='site_root' value=''><param name='name' value='Rita_Project_03_Geography/Geography'><param name='tabs' value='no'><param name='toolbar' value='yes'><param name='static_image' value='https://public.tableau.com/static/images/Ri/Rita_Project_03_Geography/Geography/1.png'><param name='animate_transition' value='yes'><param name='display_static_image' value='yes'><param name='display_spinner' value='yes'><param name='display_overlay' value='yes'><param name='display_count' value='yes'></object></div></div></div><div class="paragraph"><br><br>&#8203;I hope to be in or near Nepal again soon. It's a beautiful place at high risk, and is certainly worth striving to preserve.<br><br>Here is a view from my friend's house:</div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/mirik_orig.png" alt="Picture" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div>]]></content:encoded></item><item><title><![CDATA[Can Cultural Dimensions Predict Household Spending? Metis Project 2]]></title><link><![CDATA[http://www.ritabiagioli.com/blog/can-cultural-dimensions-predict-household-spending-metis-project-2]]></link><comments><![CDATA[http://www.ritabiagioli.com/blog/can-cultural-dimensions-predict-household-spending-metis-project-2#comments]]></comments><pubDate>Sat, 01 Feb 2020 19:33:24 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">http://www.ritabiagioli.com/blog/can-cultural-dimensions-predict-household-spending-metis-project-2</guid><description><![CDATA[Weeks two and three at Metis got kind of mathy! Armed with some EDA and data cleaning skills, we jumped into linear regression. While linear regression isn&rsquo;t always the best model, more often it probably is. It is a powerful tool to make sense of data, and one that is sure to show up again and again.&#8203;  &#8203;Business Problem  In linear regression, the goal is to predict the value of some continuous variable for a given observation, or data point. The classic example is predicting a  [...] ]]></description><content:encoded><![CDATA[<div class="paragraph"><span><span style="color:rgb(0, 0, 0)">Weeks two and three at Metis got kind of mathy! Armed with some EDA and data cleaning skills, we jumped into linear regression. While linear regression isn&rsquo;t always the best model, more often it probably is. It is a powerful tool to make sense of data, and one that is sure to show up again and again.<br /><br />&#8203;</span></span></div>  <h2 class="wsite-content-title">&#8203;<span><span style="color:rgb(0, 0, 0); font-weight:700">Business Problem</span></span></h2>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)">In linear regression, the goal is to predict the value of some continuous variable for a given observation, or data point. The classic example is predicting a house price given its features.&nbsp;</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">While developing my project, I was curious about how households choose to spend their money and what drives those decisions. Is there something inherent about cultural norms that could indicate what a household might be likely to spend its money on?</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Understanding this dynamic better could be powerfully informative for companies: which international markets are likely to spend more on education, for example, or innovative food products? Such an understanding could help to inform whether or not a company should enter a new country or whether a company in a given country should expand its offerings and make a new subset of products available.</span></span><br /><br />&#8203;</div>  <h2 class="wsite-content-title">&#8203;<span><span style="color:rgb(0, 0, 0); font-weight:700">Background</span></span></h2>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)">How does one measure cultural norms? There are several models to choose from, but in this case I chose to use <a href="https://www.hofstede-insights.com/product/compare-countries/" target="_blank">Hofstede&rsquo;s six cultural dimensions</a> because they&rsquo;re quantified.&nbsp;</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">An example of a cultural norm is individualism versus collectivism. In a country that leans more towards individualism, people define themselves by their qualities and preferences, but in a country that veers towards collectivism, people define themselves through specific relationships with others.</span></span><br /><br />&#8203;</div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/editor/screen-shot-2020-02-01-at-1-49-51-pm.png?1580587296" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><br />&#8203;<br /></div>  <h2 class="wsite-content-title">&#8203;<span><span style="color:rgb(0, 0, 0); font-weight:700">Data Collection and Methods</span></span></h2>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)">I combined data from five different sources for this project:</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Household spending metrics:&nbsp;</span></span><ul><li><a href="http://data.un.org/Data.aspx?d=SNA&amp;f=group_code%3A302" target="_blank">From the UN</a></li><li><a href="http://datatopics.worldbank.org/consumption/detail#datasource" target="_blank">From the World Bank</a></li></ul><br /><span><span style="color:rgb(0, 0, 0)">Cultural metrics:</span></span><ul><li><span><span style="color:rgb(0, 0, 0)"><a href="https://www.hofstede-insights.com/product/compare-countries/" target="_blank">From Hofstede</a></span></span></li></ul><br /><span><span style="color:rgb(0, 0, 0)">Other country metrics:</span></span><ul><li>&#8203;<a href="https://data.worldbank.org/indicator" target="_blank">From the World Bank</a>&#8203;</li><li>Data on median household income</li></ul><br /><span><span style="color:rgb(0, 0, 0)">I collected much of this data through scraping using Selenium and BeautifulSoup and then spent a great deal of time cleaning, converting currencies, and merging my datasets. </span></span><br /><br />&#8203;</div>  <h2 class="wsite-content-title">&#8203;<span><span style="color:rgb(0, 0, 0); font-weight:700">Models</span></span></h2>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)">When I began modeling, I had many targets and features to choose from:<br />&#8203;</span></span><br /></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/published/slide1.png?1580588442" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)">I chose clothing as my target variable because I had relatively few missing data points and, more importantly, clothing is an interesting metric: it&rsquo;s a necessity, but is also something people might spend on for other reasons, such as maintaining appearance, as gifts for others, or for leisure. This indicates that there could be more variance between countries for spending on clothing (if it is a priority for a given country for a particular reason) than there would be for something like (unprepared) food, which is more of a baseline need. Through looking at correlations between my target variable and my features, I chose a handful of features to predict spending on clothing.&nbsp;</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Looking at p value to see which features contributed significantly to the model, I figured out which to keep and which to discard. In some cases, where pair plots indicated that I should do so, I used feature engineering to transform features (e.g. log, square root, interaction term) in order to see if those transformations were more predictive. Additionally, I scaled my data (due to high variance) and fit a linear regression model as well as a polynomial model and regularization models lasso and ridge. </span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">After all of that, the model that best fit my data was a basic linear regression and had two features:</span></span><br /><br /><ul><li><span><span style="color:rgb(0, 0, 0)">GDP</span></span><br /></li><li> <span><span style="color:rgb(0, 0, 0)">Population</span></span></li></ul><br /><span><span style="color:rgb(0, 0, 0)">The adjusted R^2 value, which measures the proportion of variance explained by the features (and which is adjusted for number of features), was 0.86, which is relatively high (closer to 1 is better), but my root mean squared error was 1.5 billion, which is also high given that the average of my target variable was 2.8 billion. Despite attempts, regularization and other efforts did not decrease this error; it appears that there was high variance in the data to begin with. Additionally, there are only so many countries in the world, so I had fewer data points to test on than would have been ideal.</span></span><br /><span></span><br /><span><span style="color:rgb(0, 0, 0)">So what does this mean? It makes sense that GDP and population would correspond to how much households spend on clothing. What this basically indicates is that macroeconomics was correct: spending is proportional to how much money people have, and, in this case, more people means more spending.</span></span><br /><span></span><br /><span><span style="color:rgb(0, 0, 0)">I wanted to look at another model to see if I could get a better look at what might contribute to one country spending more on clothing than another country would spend. For this model, my target variable was spend on clothing divided by total spend, such that for each country had a proportion of how much was spent on clothing (relative to other spending) rather than just a sheer dollar amount.&nbsp;</span></span><br /><span></span><br /><span><span style="color:rgb(0, 0, 0)">After some modeling and feature engineering, I ended up with a linear regression showing that two cultural dimensions, long-term orientation and masculinity, contributed to predicting proportion of spend on clothing. Long-term orientation was positively correlated, but masculinity was negatively correlated, meaning that as masculinity went down, proportion spend on clothing went up. I should note here that I don&rsquo;t think the word &ldquo;masculinity&rdquo; accurately represents the metric in question, so I&rsquo;ve re-termed it as competition-orientation and its opposite as consensus-orientation for my purposes. For my model, this means that as consensus-orientation went up, so did the proportion of spending on clothing. Here are my descriptions of these cultural dimensions:</span></span><br /><span></span><br /></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/screen-shot-2020-02-01-at-2-25-59-pm_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)">Though my root mean squared error for this model was low, so was my adjusted R^2 at 0.15. In this model, the intercept term was significant in the model, likely indicating that there is a baseline proportion of income that households spend on clothing, or, in other words, there is little variation.That said, these two cultural dimensions did seem to contribute to that variation.&nbsp;</span></span><br /><span></span><br /><span><span style="color:rgb(0, 0, 0)">Why these two cultural dimensions contributed to proportion of spend on clothing is somewhat obfuscated. It seems likely that focus on quality of life in the consensus-orientation could correlate with prioritization of having nice things and, thus, with an increase in spending on clothing. More granular data about the countries in question and more specific spending data could elucidate this connection, as well as a potential link with long-term orientation.</span></span><br /><span></span><br />&#8203;</div>  <h2 class="wsite-content-title"><span><span style="color:rgb(0, 0, 0); font-weight:700">Conclusion</span></span><br /></h2>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)">I would advise a clothing company looking to expand to a new market to consider the long-term orientation and consensus-orientation of that market as metrics that might affect their decision. Additionally, I would advise a company already in a market high in long-term orientation and/or consensus-orientation to look into the market for clothing and whether they might want to create new lines or expand existing offerings. In both cases, I would advise that the company take into account whether the net margins of such a move would be beneficial, depending on predicted spend values, given that those are correlated with GDP and population, which might be more consequential deciding factors.</span></span></div>]]></content:encoded></item><item><title><![CDATA[Using MTA Turnstile Data to Strategize Gala Promotion: Metis Project 1]]></title><link><![CDATA[http://www.ritabiagioli.com/blog/using-mta-turnstile-data-to-strategize-gala-promotion-metis-project-1]]></link><comments><![CDATA[http://www.ritabiagioli.com/blog/using-mta-turnstile-data-to-strategize-gala-promotion-metis-project-1#comments]]></comments><pubDate>Mon, 13 Jan 2020 02:02:35 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">http://www.ritabiagioli.com/blog/using-mta-turnstile-data-to-strategize-gala-promotion-metis-project-1</guid><description><![CDATA[My first week at Metis could certainly be described as a whirlwind. Making use of all the angst and excitement that come with using new skills, I eagerly dove into our first project. We used Python, Pandas, Seaborn, and more to perform an exploratory data analysis, always keeping our framing questions in mind.&#8203;  Framing the Project  &#8203;The premise of the project that we were approached by a (fictitious) client: WomenTechWomenYes (WTWY).&nbsp; Our clients provided us with the following  [...] ]]></description><content:encoded><![CDATA[<div class="paragraph"><span><span style="color:rgb(0, 0, 0)">My first week at Metis could certainly be described as a whirlwind. Making use of all the angst and excitement that come with using new skills, I eagerly dove into our first project. We used Python, Pandas, Seaborn, and more to perform an exploratory data analysis, always keeping our framing questions in mind.<br /><br />&#8203;</span></span><br /></div>  <h2 class="wsite-content-title"><span><span style="color:rgb(0, 0, 0); font-weight:700">Framing the Project</span></span></h2>  <div class="paragraph">&#8203;<span><span style="color:rgb(0, 0, 0)">The premise of the project that we were approached by a (fictitious) client: WomenTechWomenYes (WTWY).&nbsp; Our clients provided us with the following information:<br />&#8203;</span></span><br /></div>  <div class="paragraph"><span><font size="2"><span style="color:rgb(106, 115, 125)">As we mentioned, we are interested in harnessing the power of data and analytics to optimize the effectiveness of our street team work, which is a significant portion of our </span><span style="color:rgb(106, 115, 125); font-weight:700">fundraising efforts</span><span style="color:rgb(106, 115, 125)">.</span></font></span><br /><span><font size="2"><span style="color:rgb(106, 115, 125)">WomenTechWomenYes (WTWY) has an annual gala at the </span><span style="color:rgb(106, 115, 125); font-weight:700">beginning of the summer </span><span style="color:rgb(106, 115, 125)">each year. As we are new and inclusive organization, we try to do double duty with the gala both to fill our event space with individuals </span><span style="color:rgb(106, 115, 125); font-weight:700">passionate about increasing the participation of women in technology</span><span style="color:rgb(106, 115, 125)">, and to concurrently </span><span style="color:rgb(106, 115, 125); font-weight:700">build awareness and reach</span><span style="color:rgb(106, 115, 125)">.</span></font></span><br /><span><span style="color:rgb(106, 115, 125)"><font size="2">To this end we place street teams at entrances to subway stations. The street teams collect email addresses and those who sign up are sent free tickets to our gala.</font></span></span><br /><span><font size="2"><span style="color:rgb(106, 115, 125)">Where we&rsquo;d like to solicit your engagement is to use MTA subway data, which as I&rsquo;m sure you know is available freely from the city, to help us optimize the placement of our street teams, such that we can </span><span style="color:rgb(106, 115, 125); font-weight:700">gather the most signatures, ideally from those who will attend the gala and contribute to our cause</span><span style="color:rgb(106, 115, 125)">.<br />&#8203;</span></font></span><br /></div>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)">I&rsquo;ve added some emphasis in bold. These phrases helped us to frame our project and to design a pipeline which would best address our client&rsquo;s needs.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)"><strong>Primary Goal:</strong> To effectively place WTWY&rsquo;s street team in order to&hellip;</span></span><ul><li><font color="#2a2a2a">Maximize attendance</font><font color="#000000"> (&ldquo;build awareness and reach,&rdquo; &ldquo;gather the most signatures&rdquo;)</font><br /></li><li><span><span style="color:rgb(0, 0, 0)">Target attendees who will&hellip;</span></span><br /><ol><li><span><span style="color:rgb(0, 0, 0)">Be interested in the mission of WTWY (&ldquo;will attend the gala,&rdquo; &ldquo;passionate about increasing the participation of women in technology&rdquo;)</span></span><br /></li><li><span><span style="color:rgb(0, 0, 0)">Contribute to the cause (&ldquo;fundraising efforts,&rdquo; &ldquo;contribute to our cause&rdquo;)</span></span></li></ol></li></ul><br /><span><span style="color:rgb(0, 0, 0)">So where should we put the street team?!</span></span><br /><br /><br /></div>  <h2 class="wsite-content-title"><span><span style="color:rgb(0, 0, 0); font-weight:700">Designing our Approach</span></span><br /></h2>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/published/screen-shot-2020-01-12-at-8-32-39-pm.png?1578882832" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><br /><span><span style="color:rgb(0, 0, 0)">First, we had to make a few assumptions:</span></span><br /><br /><ol><li><span><span style="color:rgb(0, 0, 0)">Prioritizing the highest traffic subway stations would maximize the number of emails collected</span></span></li><li style="color:rgb(0, 0, 0)"><span><span style="color:rgb(0, 0, 0)">We pulled data from May 2019, assuming that:</span></span><ol><li style="color:rgb(0, 0, 0)"><span><span style="color:rgb(0, 0, 0)">These data would be similar to May 2020 and</span></span></li><li style="color:rgb(0, 0, 0)"><span><span style="color:rgb(0, 0, 0)">WTWY will be collecting emails a month before the gala date, which we decided would be in June 2020</span></span></li></ol></li><li style="color:rgb(0, 0, 0)"><span><span style="color:rgb(0, 0, 0)">People in the demographics we choose to target are, in fact, people who are interested in WTWY and its mission.</span></span></li><li style="color:rgb(0, 0, 0)"><span><span style="color:rgb(0, 0, 0)">There are a whole lot of interesting ways to combine entry and exit data. Since we chose to prioritize sheer volume rather than patterns of traffic, we assumed that entry counts were a sufficient data source for our purposes.</span></span></li></ol><br /><span><span style="color:rgb(0, 0, 0)">Given the client&rsquo;s goals and these assumptions, we chose to focus on finding most trafficked stations using </span><a href="http://web.mta.info/developers/turnstile.html"><span style="color:rgb(17, 85, 204)">MTA turnstiles data</span></a><span style="color:rgb(0, 0, 0)"> and to look at demographics using data from </span><a href="http://zipatlas.com/us/ny/new-york.htm"><span style="color:rgb(17, 85, 204)">ZipAtlas</span></a><span style="color:rgb(0, 0, 0)">.<br /><br />&#8203;</span></span></div>  <h2 class="wsite-content-title"><span><span style="color:rgb(0, 0, 0); font-weight:700">Data Process and Pipeline</span></span></h2>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/screen-shot-2020-01-12-at-8-34-12-pm_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)"><br />We followed this process for the datasets we used.</span></span><br /><br /><span><a href="http://web.mta.info/developers/turnstile.html"><span style="color:rgb(17, 85, 204)">MTA turnstile data</span></a><span style="color:rgb(0, 0, 0)"> is collected at every NYC subway station every four hours and for every turnstile. The data are available online, making it easy for anyone who is interested to check them out (acquire). After we acquired these data, we needed to clean them; this included getting rid of white space in column names, dropping entries where the turnstile counter malfunctioned, and, most importantly, making sure we could actually use the entry tallies (transform). Apparently, these turnstiles had been cumulatively tallying up entries (since&hellip; the beginning of time?), and we needed to create a new column with daily counts in order to use these data in a meaningful way.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Data used for demographics was primarily from </span><a href="http://zipatlas.com/us/ny/new-york.htm"><span style="color:rgb(17, 85, 204)">ZipAtlas</span></a><span style="color:rgb(0, 0, 0)">, which provides the following metrics;</span></span><ul><li style="color:rgb(0, 0, 0)"><span><span style="color:rgb(0, 0, 0)">Percent of females in the labor force by zip code (likely interested in WTWY&rsquo;s values)</span></span></li><li style="color:rgb(0, 0, 0)"><span><span style="color:rgb(0, 0, 0)">Percent of people who take public transit by zip code (likely to be using the subway)</span></span></li><li style="color:rgb(0, 0, 0)"><span><span style="color:rgb(0, 0, 0)">Percent of people in professional and scientific jobs by zip code (likely interested in WTWY&rsquo;s values)</span></span></li><li style="color:rgb(0, 0, 0)"><span><span style="color:rgb(0, 0, 0)">Percent of households with an income over $100k per zip code (able to donate)</span></span></li><li style="color:rgb(0, 0, 0)"><span><span style="color:rgb(0, 0, 0)">Population (likely to attend, since they live in NYC)</span></span></li></ul><span><span style="color:rgb(0, 0, 0)">We acquired these data from their site. Cleaning these data wasn&rsquo;t too hard-- just eliminating some symbols (e.g. %) and making that all of the columns were in the correct data types. However, we soon realized that we had all this demographic info by zip code, but did not have zip codes for the subway stations! The MTA helpfully provides </span><a href="http://web.mta.info/developers/data/nyct/subway/Stations.csv"><span style="color:rgb(17, 85, 204)">another site</span></a><span style="color:rgb(0, 0, 0)"> with the coordinates of each station, so we were able to determine their zip codes using a package called </span><a href="https://pypi.org/project/uszipcode/"><span style="color:rgb(17, 85, 204)">uszipcode</span></a><span style="color:rgb(0, 0, 0)">. Then we merged the two dataframes together (transform).</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">So now that we had all the data figured out, what did we learn from it?</span></span><br /><br />&#8203;<br /></div>  <h2 class="wsite-content-title"><span><span style="color:rgb(0, 0, 0); font-weight:700">Exploration and Analysis</span></span></h2>  <h2 class="wsite-content-title"><span><span style="color:rgb(0, 0, 0); font-weight:700"><font size="3">Maximizing Volume of Potential Attendees</font></span></span></h2>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/published/screen-shot-2020-01-12-at-7-23-24-pm.png?1578888900" alt="Picture" style="width:419;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)">First, we found that the most stations trafficked most on a daily basis (on average), were 42nd St Bryant Park, 14th St- Union Square, Times Sq- 42nd St, 34th St- Penn Station, 42nd St- Port Authority, and Canal Street. Ok, great. So are these stations more trafficked on any particular day of the week?</span></span></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/volume-day-of-week_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)">It looks like these stations get more traffic on the weekdays, a trend which held for all of the stations.&nbsp;</span></span><br /><span></span><br /><span><span style="color:rgb(0, 0, 0)">These stations look pretty much the same for entries in the morning and exits in the evening, so it looks like commuters are probably frequently using these stations (and, likely, ones from outside of NYC).&nbsp;</span></span><br /><span></span></div>  <div><div class="wsite-multicol"><div class="wsite-multicol-table-wrap" style="margin:0 -15px;"> 	<table class="wsite-multicol-table"> 		<tbody class="wsite-multicol-tbody"> 			<tr class="wsite-multicol-tr"> 				<td class="wsite-multicol-col" style="width:50%; padding:0 15px;"> 					 						  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/entries-morning_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph" style="text-align:center;">Entries During the Day</div>   					 				</td>				<td class="wsite-multicol-col" style="width:50%; padding:0 15px;"> 					 						  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/entries-evening_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph" style="text-align:center;">Exits in the Evening</div>   					 				</td>			</tr> 		</tbody> 	</table> </div></div></div>  <div class="paragraph"><br /><span><span style="color:rgb(0, 0, 0)">&#8203;Okay, great. We can tell WTWY to station their crews at these stations during the week. But at what times would they catch the most people?<br />&#8203;</span></span><br /></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/screen-shot-2020-01-12-at-8-42-50-pm_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)"><br />Here are entries for each day of the week (0 = Monday and so on). Note: though these lines look continuous, the data were only gathered every four hours, so they&rsquo;re not. Still, this is a great visual way to represent some of what is going on at these stations. We can see that Port Authority and Penn Station seem to spike earlier in the day on weekdays (when people head to work) while Bryant Park and Union Square spike later in the day (perhaps as people are headed home).</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Given our data, here is a time schedule we&rsquo;d recommend:</span></span><br />&#8203;</div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/published/screen-shot-2020-01-12-at-8-43-37-pm.png?1578883458" alt="Picture" style="width:285;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <h2 class="wsite-content-title"><span><span style="color:rgb(0, 0, 0); font-weight:700"><br />&#8203;Prioritizing Demographics of Potential Attendees</span></span></h2>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)">We pulled data for five metrics from ZipAtlas-- females in the laborforce, taking public transit, having professional or scientific jobs, household income over $100k, and population-- but is each of these metrics equally important for our analysis?&nbsp;</span></span><br /><span></span><br /><span><span style="color:rgb(0, 0, 0)">To find this out, we plotted out histograms of these data by zip code:</span></span><br /><span></span><br />&#8203;</div>  <div><div class="wsite-multicol"><div class="wsite-multicol-table-wrap" style="margin:0 -15px;"> 	<table class="wsite-multicol-table"> 		<tbody class="wsite-multicol-tbody"> 			<tr class="wsite-multicol-tr"> 				<td class="wsite-multicol-col" style="width:50%; padding:0 15px;"> 					 						  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/screen-shot-2020-01-12-at-8-45-44-pm_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>   					 				</td>				<td class="wsite-multicol-col" style="width:50%; padding:0 15px;"> 					 						  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/screen-shot-2020-01-12-at-8-45-49-pm_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>   					 				</td>			</tr> 		</tbody> 	</table> </div></div></div>  <div class="paragraph"><br /><span><span style="color:rgb(0, 0, 0)">&#8203;As you can see, the two metrics -- household income over $100k and population-- have far more variance. What does this mean practically? Well, if WTWY prioritizes stations in zip codes with relatively high populations of females in the laborforce, populations frequenting those stations are still pretty similar to stations in zip codes with relatively low populations of females in the workforce. In other words, the data is pretty homogeneous. However, there&rsquo;s a bigger difference in demographics between a station in a zip code with relatively high rates of household income over $100k and a station that is relatively low on that metric. Put simply: WTWY gets more bang for its buck, and probably hits its target groups more effectively, by focusing on stations in zip codes with relatively high rates of household income over $100k and higher populations.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">A word about why population is important here: just because a station is highly trafficked (as are the ones mentioned above), this doesn&rsquo;t mean it&rsquo;s trafficked by the kind of people you want to target. In this case, the most trafficked stations are in areas that 1) have a lot of tourists and 2) have train lines, so a lot of people are commuting in from Connecticut and New Jersey to go to work and entering the subway (as opposed to exiting) at these stations. These people may be less likely to come into the city for a gala. This is an argument to use population as a metric, hopefully targeting people who actually live in the city and might be more likely to attend a gala.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Here are the stations highest on the metrics of household income over $100k and population:<br />&#8203;</span></span><br /></div>  <div><div class="wsite-multicol"><div class="wsite-multicol-table-wrap" style="margin:0 -15px;"> 	<table class="wsite-multicol-table"> 		<tbody class="wsite-multicol-tbody"> 			<tr class="wsite-multicol-tr"> 				<td class="wsite-multicol-col" style="width:50%; padding:0 15px;"> 					 						  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/screen-shot-2020-01-12-at-8-46-32-pm_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>   					 				</td>				<td class="wsite-multicol-col" style="width:50%; padding:0 15px;"> 					 						  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.ritabiagioli.com/uploads/3/7/0/8/37082485/screen-shot-2020-01-12-at-8-46-41-pm_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>   					 				</td>			</tr> 		</tbody> 	</table> </div></div></div>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)"><br />&#8203;Based on these data, and cross-referencing these stations with entry data (by looking at percentile), as well as making sure that these stations represent neighborhoods that have favorable demographics, the following stations are all around good bets:</span></span><ul><li><span><span style="color:rgb(0, 0, 0)">Grand St</span></span></li><li><font color="#000000">103 St&nbsp;</font></li><li><font color="#000000">72nd St</font></li><li><font color="#000000">96th St</font></li><li><font color="#000000">2nd Ave</font></li><li><font color="#000000">86th St</font></li></ul></div>  <h2 class="wsite-content-title"><strong>Conclusions &amp; Recommendations</strong></h2>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)">Given our analyses, we can draw the following conclusions and make the following recommendations to WTWY:<br /></span></span><br /><strong><span><span style="color:rgb(0, 0, 0)">Conclusions:</span></span></strong><br /><span><span style="color:rgb(0, 0, 0)"><br />Volume:</span></span><ol><li style="color:rgb(0, 0, 0)"><span><span style="color:rgb(0, 0, 0)">The most trafficked stations seem to be in commuter areas.</span></span></li><li style="color:rgb(0, 0, 0)"><span><span style="color:rgb(0, 0, 0)">Travel on the weekdays is relatively comparable; travel on weekends is relatively low.</span></span></li><li style="color:rgb(0, 0, 0)"><span><span style="color:rgb(0, 0, 0)">The busiest times are when people are coming to and leaving from work.</span></span></li></ol><br /><span><span style="color:rgb(0, 0, 0)">Demographics:</span></span><ol><li style="color:rgb(0, 0, 0)"><span><span style="color:rgb(0, 0, 0)">Stationing street teams at subway stations is probably an effective way to target people with household incomes over $100k, who are more likely to donate, and to target people who actually live in NYC, and, therefore, might be more likely to attend.</span></span></li><li style="color:rgb(0, 0, 0)"><span><span style="color:rgb(0, 0, 0)">There might be better ways to target women and tech workers, whose values would likely align with WTWY&rsquo;s values, than stationing street teams at subway stations, given the relative demographic homogeneity across the city. And hey, WTWY, we&rsquo;re happy to analyze some other data to figure out how to reach them! You know how to find us.&nbsp;</span></span></li></ol><br /><strong><span><span style="color:rgb(0, 0, 0)">Recommendations:</span></span></strong><br /><br /><span><span style="color:rgb(0, 0, 0)">Volume:&nbsp;</span></span><br /><span><span style="color:rgb(0, 0, 0)">If WTWY chooses to prioritize getting to the most people, they should put their crews at the six subway stations listed above at the times recommended.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Demographics:</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">If WTWY chooses to forego reaching quite as many people, but, rather to reach people who are likely New Yorkers and who likely have the means to donate, they should prioritize the six stations recommended above.</span></span><br /><br />&#8203;</div>  <div class="paragraph"><font size="1">*Icon attributions: people by Wilson Joseph from the Noun Project- Demographic Data by H Alberto Gongora from the Noun Project</font><br /></div>]]></content:encoded></item><item><title><![CDATA[Coming soon!]]></title><link><![CDATA[http://www.ritabiagioli.com/blog/i-really-dont-like-these-buttons]]></link><comments><![CDATA[http://www.ritabiagioli.com/blog/i-really-dont-like-these-buttons#comments]]></comments><pubDate>Sat, 23 Mar 2019 19:17:50 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">http://www.ritabiagioli.com/blog/i-really-dont-like-these-buttons</guid><description><![CDATA[As I get the rest of this site together, I'm coming up with ideas for this blog. Check back soon to see what I write! [...] ]]></description><content:encoded><![CDATA[<div class="paragraph">As I get the rest of this site together, I'm coming up with ideas for this blog. Check back soon to see what I write!</div>]]></content:encoded></item></channel></rss>