Skip to content
Search
Generic filters
Exact matches only

All the Datasets You Need to Practice Data Science Skills and Make a Great Portfolio | by Rashida Nasrin Sucky | Aug, 2020

Photo by Cullen Smith on Unsplash

Some Interesting Datasets to Upscale You Skills and Portfolio

Rashida Nasrin Sucky

The only way to learn data science, data analysis, machine learning, or artificial intelligence topics is by practicing or doing projects. There is no other alternative to that. But most of the time when I did a project for my portfolio or practice a new concept, I had to spend a good amount of time finding a suitable dataset. I decided to write this article to share some of the datasets I found very useful and interesting. That way at least you have some dataset to practice in hand.

Census Dataset

If you want to get a taste of how to explore a big dataset, work with this one. This dataset is very big.

This one is great for Exploratory Data Analysis, Statistical Analysis & Modeling, and, Data Visualization practice.

Download this dataset from here.

Airbnb Dataset

I received this dataset as a part of an interview a while ago.

I was asked to do an Exploratory Data Analysis and develop a Machine Learning Model using this dataset.

This dataset has a lot of text data and numerical data. You can use this dataset to practice a lot of different types of projects.

You will see several datasets in this link. But I was asked to download the listings.csv file for my interview.

Cars Dataset

This is a reasonable size dataset that can be used to practice some Regression Models and Exploratory Data Analysis.

This dataset contains these columns: YEAR, Make, Model, Size, (kW), Unnamed: 5, TYPE, CITY (kWh/100 km), HWY (kWh/100 km), COMB (kWh/100 km), CITY (Le/100 km), HWY (Le/100 km), COMB (Le/100 km), (g/km), RATING, (km), TIME (h).

Here is the link for this dataset

Heart Disease Dataset

I found this dataset in Kaggle. Since then I have used it in so many different articles to demonstrate a concept.

These are two examples:

You will find some examples of Exploratory Data Analysis done and details about the dataset as well. Check out this dataset. I am sure you will use it a lot.

Download this dataset from this link.

NHANES Dataset

An amazing dataset for learners. The column names of this dataset may not look very understandable at first.

But once you get used to them, you can use this one dataset to practice Data Analysis, Visualization, Statistical Modeling, and Machine Learning models(both classification and regression).

Download it from here

People Wiki Dataset

It contains Wikipedia profiles of some famous people.

The dataset contains three columns: URI, name (name of the person), and text (it includes the Wikipedia profile).

A simple but very useful dataset for Natural Language Processing

Please check out this article to see an example of what you can do with this dataset:

Here is the link to this dataset

Amazon Product Review Dataset

This dataset contains millions of product reviews of the products of amazon.

It has three columns: Name of the product, review, and rating. This dataset is almost a real dataset, very good for Natural Language Processing.

I have a sentiment analysis project and an article where I used this dataset. Please check it out here:

Download this dataset from this link.

Movie Dataset

This is another dataset that is good for Machine Learning and Natural Language Processing.

This one contains the following columns: index, budget, genres, homepage, id, keywords, original_language, original_title, overview, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, vote_average, vote_count, cast, crew, director.

I used this dataset for this project:

Here is the link

Housing Price dataset

This is one of the most common datasets to develop Regression Models. For sure you can use it for other purposes as well.

This is mostly used to predict the housing prices based on the information in the other columns.

This dataset contains these columns: id, date, price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated, zip code, lat, long, sqft_living15, sqft_lot15.

Here is the link.

Mushrooms Dataset

I found this dataset in the course Applied Data Science With Python Specialization in Coursera.

I used it for Classification problems. It can be used for other purposes as well.

It contains these columns: class, cap-shape, cap-surface, cap-color, bruises, odor, gill-attachment, gill-spacing, gill-size, gill-color, stalk-shape, stalk-root, stalk-surface-above-ring, stalk-surface-below-ring, stalk-color-above-ring, stalk-color-below-ring, veil-type, veil-color, ring-number, ring-type, spore-print-color, population, habitat.

Here is the link to this dataset

Olympic Dataset

This dataset has information on the Olympic results. Each row contains the data of a country.

This dataset will give you a taste of data cleaning to start with.

I learned Python’s libraries like Numpy and Pandas using this dataset.

Download this dataset from here

Titanic Dataset

Another very popular dataset. I myself used it a lot, I saw different experienced people using this dataset to present a concept.

This dataset contains these columns: PassengerId, Survived, P-class, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked.

This dataset is good for Exploratory Data Analysis, Machine Learning Models specially Classification Models, Statistical Analysis, and Data Visualization Practice.

Here is the link to this dataset

Iris Dataset

Another widely used dataset in data science courses.

This one is especially good for learning Classification Models.

It contains these columns: SepalLength, SepalWidth, PetalLength, PetalWidth, Name

Here is the link.

Fraud Dataset

I found this dataset from the course Applied Data Science With Python Specialization in Coursera.

We used for Classification Models.

A credit card fraud detection project looks good in a portfolio.

Download this dataset here.

Canada Immigration Dataset

This dataset provides information about how many immigrants came from which country by year.

A great dataset to practice Exploratory Data Analysis and Data Visualization

I used this dataset in this article:

Here is the link

Facebook Stock Data

It provides Facebook stock performance per day.

The columns in this dataset are Date, Open, High, Low, Close, Adj Close, Volume.

This one can be very useful in Time Series Analysis and Visualization or Time Series Related problems.

I used this dataset in this article:

Here is the link

Digits dataset

This dataset contains the pixel values for digits.

This is a commonly used dataset for Multiclass Classification problems.

I got this dataset from Professor Andrew Ng’s Machine Learning course in Coursera.

Download this dataset from this link.

BBC Text Dataset

Another wonderful dataset for Natural Language Processing.

This dataset contains information on different types of news from BBC archives. It’s a big text dataset.

It is normally popular for Multiclass Classification problems.

The dataset is big but it has only two columns: text and category.

Here is the link for this dataset

Cats vs Dogs

Very commonly used to practice Image Classification.

This dataset contains images of cats and dogs.

It is good for computer vision problems.

Here is the link

Malignant vs Benign

Another useful dataset for Computer Vision Problems

This dataset also contains images of two types of skin cancer.

Good for Image Classification problems

Download this dataset from here

Natural Images Dataset

This dataset contains images of airplanes, cars, cats, dogs, flowers, fruit, motorbike, and person.

You can have some practice more of Multiclass Classification

Here is the link to the dataset

Conclusion

These are all the datasets I wanted to share today. You should find good enough sets of datasets and some projects idea as well from this page to practice the necessary skills and make a portfolio.

Recommended Reading