Skip to content
Generic filters
Exact matches only

5 Professional Projects Every Data Scientist Should Know

  1. Introduction
  2. Customer Segmentation
  3. Text Classification
  4. Sentiment Analysis
  5. Time Series Forecasting
  6. Recommendation Systems
  7. Summary
  8. References

The goal of this article is to outline projects that a professional Data Scientist will eventually perform or should perform. I have taken a lot of bootcamps and educational courses in Data Science. While they have all been useful in some way, I find that some forget to highlight real-world applications of Data Science. It is beneficial to know what to expect as you transition from educational to professional Data Scientist. Customer segmentation, text classification, sentiment analysis, time series forecasting, and recommender systems can all help your company that you are employed at tremendously. I will perform a deep dive an explain why these specific five projects come to mind, and we will hopefully motivate you to employ these where you work.

Customer segmentation is a form of Data Science where an unsupervised and clustering modeling technique is employed to develop groups or segments of a human population or observations in data. The goal is to create groups that are separate, but the groups themselves have closely related features. The technical term for this separation and togetherness is called:

Between-groups sum of squares (BGSS)

  • how different the unique groups are from one another

Within-group sum of squares (WGSS)

  • how closely related the unique group features are
K-means clustering. Image by Author [2].

As you can see in the image above, these groups are well separated — BGSS and are closely centered — WGSS. This example is ideal. Think of each of the clusters as those groups that you will target with a specific marketing advertisement: ‘we want to appeal to recent college graduates by marketing our company product as young-professional centered’. Some useful clustering algorithms are:

DBSCANK-meansAgglomerative Hierarchical Clustering

What happens with customer segmentation results?

— finding insights about specific groups

— marketing towards specific groups

— defining groups in the first place

— tracking metrics about certain groups

This type of Data Science project is broadly used, but most useful in the marketing industry.

Feature and target of text classification example. Code by Author [3].

Text classification is under the umbrella of Natural Language Processing (NLP), which utilizes techniques to ingest text data. You can think of this algorithm or project as a way to categorize text labels by using text features (along with numeric features as well).

Here [4] is a simple example of utilizing both text and numeric features for text classification. Instead of having one word for your text feature, you could, perhaps, have hundreds and will need to perform NLP techniques, like Part-of-Speech tagging, stop word removal, tf-idf, count vectorizing, etc. A common library Data Scientists use in Python is nltk. The goal of these techniques is to clean your text data, and create the best representation of itself, so as to eliminate noise.

What happens with text classification results?

— automatic categorization of observations

— scores associated with each category suggested

You can also categorize text documents that would otherwise take hours upon hours to manually read.

This type of project is useful in the finance or historian/librarian industry.

Sentiment analysis is also under the umbrella of NLP. It is a way to assign sentiment scores from the text, or more specifically, polarity and subjectivity. It is beneficial to use sentiment analysis when you have plenty of text data and want to digest it to create levels of good or bad sentiment. If you have a rating system already in place at your company, it may seem redundant, but oftentimes people can leave reviews with text that do not match their numerical score. Another benefit of sentiment analysis is that you can flag certain keywords or phrases that you would want to highlight in order to make your product better. Aligning keywords with key sentiment can be used to aggregate metrics that you can visualize what your product is lacking and where possible improvements could be made.

What happens with sentiment analysis results?

— product improvements

— sentiment flagging to for customer service

This type of project is useful in plenty of industries, especially e-commerce, entertainment, or anywhere that includes text reviews.

Photo by Sonja Langford on Unsplash [].

Time series can be applied to several parts of various industries sectors. Most times, time series forecasting can be used ultimately to allocate funds or resources for the future. If you have a sales team, they would benefit from your forecast, as well as investors, as they see where your company is going (hopefully increasing in sales). More directly, if you have certain employees assigned with the forecasted target for that day, you can allocate employees in general, and to certain places. A popular example would be Amazon or any similar company where consumers have frequent behaviors and need an allocation of factories, drivers, and different locations that will merge together.

What happens with time series forecasting results?

— allocation of resources

— awareness of future sales

Some popular algorithms that utilize time series are ARIMA and LSTM.

This type of project is useful in plenty of industries as well, but usually in sales or supply management.

Photo by Simon Bak on Unsplash [6].

While you may or may not be designing Netflix’s next recommendation system algorithm, you may find yourself applying similar techniques to several parts of your business. Think of using this type of project to ultimately achieve the sales of more products from users. As a consumer, if you are buying certain products or groceries, but you see some recommended ones at the end of your cart checkout, you may be inclined to quickly buy one of those recommendations. Expand this result to every user and you can make your companies millions.

Here are some common ways to approach recommendation systems in Data Science.

Collaborative-filtering — alternating least square (matrix factorization)

  • how similar other people are to you and recommends what they like to you

Content-based filtering — cosine similarity

  • how attributes or features about the product you already bought can recommend a similar product in the future

This type of project is useful in plenty of industries as well, but usually in e-commerce and entertainment.

I hope I gave you some inspiration from highlighting these key projects that you may often use already, or will use as a professional Data Scientist. The focus on Machine Learning in education is to focus on obtaining the best accuracy sometimes, but the focus of Data Science in the professional sense is to help your company to improve its product, help people, and save or make more money.

To summarize, here are five popular professional projects to practice:

customer segmentationtext classificationsentiment analysis time series forecastingrecommender systems

I hope you enjoyed my article. Thank you for reading! Please feel free to comment down below and suggest other professional Data Science projects you have encountered so that we can all improve our professional Data Science portfolios.

[1] Photo by freestocks on Unsplash, (2018)

[2] M.Przybyla, k-means visualization, (2020)

[3] M.Przybyla, nlp-example.ipynb, (2020)

[4] M.Przybyla, nlp-example, (2020)

[5] Photo by Sonja Langford on Unsplash, (2014)

[6] Photo by Simon Bak on Unsplash, (2020)