GETTING STARTED

A Step-by-Step Guide to Setting Up a Cron Job

Image for post
Image for post
Photo by Possessed Photography on Unsplash

Introduction

Have you ever found yourself doing repetitive tasks on a regular basis? For example, deleting temporary files every week to conserve your disk space, scraping data from a site every week to gather new information or sending recurring emails to the same set of people for “reminder” campaigns, and so on. If so, you might want to set up a cron job scheduler, which will automatically perform the tasks for you at any scheduled time.

Cron comes from “chron,” the Greek prefix for “time.” …


GETTING STARTED

With No Errors

Image for post
Image for post
Photo by Noah Boyer on Unsplash

If you are familiar with the job-hunting process, you probably already noticed that some companies like using take-home assignments to determine if a candidate is the right fit or not. Since most companies use SQL, it’s common that they want to see if you can solve problems using SQL. However, not all the companies will provide you any dataset to work with. It’s likely that a company might only provide a table schema, and you might be wondering if your queries can actually run. Therefore, importing a dataset into a database can be very helpful.

In this article, I will cover how to install MySQL Workbench and import data into MySQL Workbench step by step. …


Comparison of Supervised and Unsupervised Fraud Detection

Image for post
Image for post
Photo by Erik Mclean on Unsplash

Introduction

Since Yelp’s early days, reviews are one of the most important factors customers have relied on to determine the quality and authenticity of a business. A local consumer review survey published last year shows that 90% of consumers used the internet to find a local business in the previous year, and 89% of 35–54-year-olds trust online reviews as much as personal recommendations. Although Yelp’s listings often have hundreds or thousands of reviews, many of those reviews can’t be trusted.

“Fake reviews can be devastating to a brand. Simply put, once shoppers suspect a company of having fake reviews, trust is in question. In an era of misinformation and fake news, brand integrity is essential to building consumer trust, which directly translates to profit.” …


GETTING STARTED

A gentle introduction to imputation of missing values

Missing Data
Missing Data
Photo by Markus Winkler on Unsplash

The biggest challenge for data scientists is probably something that sounds mundane, but very important for any analyses — cleaning dirty data. When you think of dirty data, you are probably thinking about inaccurate or malformed data. But the truth is, missing data is actually the most common occurrence of dirty data. Imagine trying to do a customer segmentation analysis, but 50% of the data have no address on record. It would be hard or impossible to do your analysis since the analysis would be bias in showing no customers in certain areas.

Explore Missing Data

  • How much data is missing? You can run a simple exploratory analysis to look at the frequency of your missing data. If it’s a small percentage, let’s say 5% or less, and the data is missing completely at random, you could consider ignore and delete those cases. But keep in mind that it’s always better to analyze all data if possible, and dropping data can introduce biases. Therefore, it’s always better to check the distribution to see where the missing data are coming from. …

PYTHON FOR PROBABILITY AND STATISTICS

Four Types of Sampling Methods all Data Scientists Must Know

Image for post
Image for post
Photo by Churrasqueira Martins on Kindpng

Why do we need Sampling?

Sampling is used when we try to draw a conclusion without knowing the population. Population refers to the complete collection of observations we want to study, and a sample is a subset of the target population. Here’s an example. A Gallup poll¹, conducted between July 15 to 31 last year, found that 42% of Americans approve of the way Donald Trump is handling his job as president. The results were based on telephone interviews of a random sample of ~4500 calls (assuming one adult per call. ~4500 adults), aged 18 and older, living in the U.S. The poll was conducted during a period of controversy over Trump’s social media comments. For this survey, the population is ALL the U.S …


What I’ve Learned as a First Time Webinar Speaker

Image for post
Image for post
Photo by Nycholas Benaia on Unsplash

I recently spoke at a Webinar about the Mentorship Effect hosted by Correlation One, a data and analytics training program sponsored by leading employers. Although this is not my first time speaking in front of a crowd, I have never been invited to be a panelist before. The idea of being one of the speakers as a new grad sounds totally intimidating but extremely exciting. I have been attending many Data Science Meetup events, where the speakers share their success stories. Who would have thought that one day I would be invited to share my experience as well! …


Probability and Statistics

The Most Common Discrete Probability Distributions Explained with Examples

Image for post
Image for post
Image by Author

Probability Distributions

A probability distribution is a mathematical function that describes the likelihood of obtaining the possible values for an event. A probability distribution may be either discrete or continuous. A discrete distribution is one in which the data can only take on certain values, while a continuous distribution is one in which data can take on any value within a specified range (which may be infinite).

There are a variety of discrete probability distributions. The usage of discrete probability distributions depends on the properties of your data. For example, use the:

  • Binomial distribution to calculate probabilities for a process where only one of two possible outcomes may occur on each trial, such as coin tosses. …

Consulting Case Study 101

The Ultimate Guide to Case Study Interview Preparation

Image for post
Image for post
Photo by Clem Onojeghuo on Unsplash

As a data analyst or data scientist, we not only need to know probabilities and statistics, machine learning algorithms, coding, but most importantly we need to know how to use these techniques to solve any business problems. Most of the time, you will be given a 30–45 min interview with a single data scientist or a hiring manager in which you’ll answer a multifaceted business problem that’s likely related to the organization’s daily work.

When I first started to prepare for the case study interview, I didn’t know there are different types of case studies. The fastest way to be an expert in the case study is to know all the frameworks to solve different kinds of case studies. A case study interview can help the interviewers evaluate if a candidate would be a good fit for the position. Sometimes, they might even ask you a question that they actually encountered. …


Oh NLP with Python

How I build a product recommendation system using python

Image for post
Image for post
Photo by Campaign Creators on Unsplash

This blog is a continuation of my previous work¹, in which I talked about how I gathered product reviews and information through web scraping. I will now explain more about how I built the product recommendation system.

The Goals of this project were to:

  • Gather product information and reviews data from BackCountry.com through web scraping using selenium, beautifulsoup (Part I)
  • Perform an exploratory data analysis using ScoreFast™ platform
  • Convert text data into vector
  • Build a KNN predictive model to find the most similar products
  • Run a Sentiment Analysis on product reviews
  • Use each review’s sentiment score to predict its review’s…

Web Scraping With Python

How I extract data using web scraping with Python during my Data Science Internship at ScoreData

Image for post
Image for post
Image by noshad ahmed from Pixabay

Today, if we think of the most successful and widespread applications of machine learning in business, recommender systems could be one of the first examples people have in mind. Each time you purchase something online, you might see the “products you might also like” section. Recommender systems help users discover items they might like but have not yet found, which could help companies maximize revenue from upselling and cross-selling. As a Data Science Intern at ScoreData, I wanted to take the opportunity to try to build a recommendation model and analyze data on ScoreData’s ML platform (ScoreFast™). Since we don’t have customers’ purchase history from any of the E-commerce websites, I decided to build a content-based recommendation system using product descriptions and reviews. …

About

👩🏻‍💻 Kessie Zhang

I’m passionate about the possibilities that Data Science can enable. I write about what I’ve learned. Never stop learning because life never stops teaching.❤️

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store