Skip to content
Search
Generic filters
Exact matches only

5 Concrete Real-World Projects to Build Up Your Data Science Portfolio | by Isabelle Flückiger | Aug, 2020

Uniqueness is key, not the fanciness

Isabelle Flückiger
Photo by Clark Tibbs on Unsplash

Do you want to enter the data science world? Congratulations! That’s (still) the right choice.

The market currently gets tougher. So, you must be mentally prepared for a long hiring journey and many rejections. I assume that you have already read that a data science portfolio is crucial and how to build it up. Most of the time, you will do data crunching and wrangling and not applying fancy models.

One question that I am asked on and on is about concrete data sources for cool data and project opportunities to build such a portfolio.

I give you the following five ideas for your data science portfolio and a few hints on developing uniqueness.

1. Customer analytics for a local non-profit organization

An essential task of a non-profit organization is to find the right person, at the right place or location, in the right moment, approached with the right medium for donations for charitable activities. When that can be optimized, the non-profit organization can collect more funds and do more activities.

What makes that project interesting?

First, most non-profit organizations have much data, not necessarily in digitized form, and often not in good quality. The main task is building a database, data crunching, and getting the data in a usable form. You learn to structure the whole data mess, which is still up to 80% of a data science job.

Second, you do something good for the local community, and you show your social responsibility. You interact with people who are not data experts. Both shows needed soft skills for a data science position.

I did voluntarily such projects for an organization that helps children in poverty and for an organization that provides care at home for elderly besides my professional job. Having these experiences builds trust in your person and is a door opener for many other exciting projects.

Finally, non-profit organizations work the same as private banking or wealth management. They also have to acquire the right customer, at the right moment, with the right campaign to bring them money. And I can tell you; the data are also not of better quality than of a non-profit organization. You can directly leverage your experience in other industries.

How to start?

I found the non-profit organizations through my network. There is always somebody within your family, relatives, and friends engaged with a non-profit organization. Then, I agreed on a first get to know meeting and explained to them what my skills are and what is the value of such analyses. I have given them examples from Google and Facebook. And I searched for publicly available information about the increase in leads at other non-profit organizations to provide them with a flavor. After I have given them first the time to think a few days about it, and in each case, they came back and agreed to do the project. Then, I started the whole data crunching work.

When the data is ready to use, you can work through the classical descriptive, predictive, and prescriptive analytics cycle.

2. CERN

The CERN is mainly known for its leading fundamental research in particle physics and the largest particle laboratory globally.

It is often unknown that the CERN makes most of its data, codes, algorithms, and tools they have developed and is using for their research, available to the public. They have sophisticated algorithm testing toolboxes and provide 1-, 2-, 3- and 4-dimensional images. And they have much more.

The CERN does not call that all “innovation.” No, these are just “tools” to perform their “real” innovation task: new frontiers in particle physics.

I can only highly recommend investing some time, browsing through their web pages and explore all the data and tools available for data analytics. It is one of their core businesses and on a very sophisticated level. I still learn a lot today and get many new ideas.

The web page is nested. Please do not lose your passion for the first time browsing it!

On the CERN Open Data Portal, you can find two petabytes of particle physics data, for starting your own analyses.

What makes that project interesting?

When you start as a data scientist with a project, you typically only know that there are somewhere some data. First, you have to explore what data is available, where it can be found, whether it has redundancies, who has knowledge and access to the data, etc.

When starting with CERN data, the task is the same when you are unfamiliar with all the particle physics experiments. Luckily, I had in my data science teams always ex-CERN scientists, making it a lot easier to understand.

Second, having “CERN” on the resume is always an advantage, presupposed that some serious work had been done. Through the physics classes, published issues, webinars, and discussions, you can get part of the community. CERN employs about 2,500 people on-side and has approximately 17,500 contributing scientists globally. Many startup founders have a CERN community background.

Last, you have sparse data, meaning the vital information represented in the data is rare. Of thousands or millions of data points, you only look for a few patterns to find and identify. Finding such sparse signals is essential in many fields: predictive maintenance, finding the billionaire ready to invest in your fund, or precision medicine.

How to start?

Start with getting familiar what the CERN is doing by browsing their web page and Wikipedia. On the Open Data Portal, you have a document link where a lot of background information including links to GitHub, and tutorials can be found. There is also a dedicated Data Science node. Look what the CERN scientists have already done, learn from them, and start analyzing individually selected datasets with your own methods.

Working with CERN data is not a fast project, but a very instructive one. Besides, you can learn a lot about a topic on the frontier of physics.

3. Omdena

Omdena calls itself a collaborative AI platform. It brings project-wise 30–50 people together that solve with data and AI a real-existing problem in this world.

Unlike a Kaggle competition, it is a real end-to-end project with all the project struggles. You are working in a team with different skills, and with all the interpersonal challenges. And you can have a real impact as all projects are linked to one of the UN’s 17 Sustainable Development Goals.

A good friend of mine with 20+ years as a data science expert contributes, on average, 20% of his time for projects on Omdena. And even he is saying that he is always learning a lot of new stuff.

Omdena needs a wide range of skills in the AI, data science and machine learning field, and expertise levels. You have to go through an application process, like applying for an internship, with the big difference that not competitive personalities are looked for but people with team spirit. They do not look only for experts. It is the spirit of collaboration.

What makes that project interesting?

You are part of a real-world data science project. There are no sugarcoated missions, data, and outcomes. It “just” has to solve a real issue with a data-driven approach. You are getting familiar with the whole data science project cycle, and you can experience the different stages and roles.

Next, it is exciting to work side by side with experienced people and to get their mentorship. In just one project, you will learn more than in all your 10 MOOCs and Kaggle competitions.

And last but not least, you are getting a project certificate. Yes, it is another certificate besides your Coursera, Udacity, and university degrees, but it attests your practical experience.

How to start?

Look at the completed, ongoing and upcoming projects. Become familiar with Omdena’s approach and, when interested in participating, follow the guideline here.

4. International and governmental organization

Many international and governmental development organizations are now working data-driven. The UN, WHO, World Bank, International Finance Corporation, Inter-American Development Bank, and the European Bank for Reconstruction and Development are some. Also, most governments have task forces responsible for mission-driven data and AI projects and building an ecosystem.

Besides offering internships, paid, or unpaid, most contracts are fixed-term contracts lasting from a few months to three years.

Further, many data science and AI startups are working with governmental departments.

In the last 12 months, I supported two former team members to find such projects. The one, half-Thai, went to Thailand to work in a big data startup that is working with Thailand’s government.

The other scanned all the job adds, submitted his CV to these international organizations, and contacted people to finally get a fixed-term contract for a project of 4 months at one of the development banks abroad.

What makes that project interesting?

These jobs and projects are often abroad. In addition to practical data science experience, many experiences with a foreign culture, and how to behave in an international diplomacy environment can be gained. That gives you vital soft skills for advancing on the career ladder.

You can take on responsibility from the beginning. Small teams, interactions with decision-makers, presentations in front of leading people, are part of most projects. You often get contacts and mentorship of leading experts in that field, as they often advise international and governmental organizations.

Finally, the projects are unique, and research related, which gives space for new experimentation. Examples of such projects include the analyses of road fatalities of a developing country where the government wants to take action to reduce them or geospatial cause analyses of air pollutions because the government wants to put laws in place to limit it. Many social-economic aspects are integrated into these analytics.

How to start?

The first task is researching the open positions, the ongoing projects, and, importantly, startups working with such organizations.

Positions can be found on UNjobs — not only from the UN but from all the organizations, as mentioned earlier, as well as, e.g., Coursera. Further, search on official homepages for the keyword “data scientist.”

If there should be no suitable internship or short-term job, submit your CV anyway. If they have projects, they compare it with the already available CVs in the database, and if your profile matches, they will contact you.

Second, look for startups that are working with governments. If the startups have projects linked to the UN Sustainable Development Goals, they most probably work with governments.

Another indication for that is when addressing society’s benefits, like water resourcing, safer community, e.g., preventing road accidents or violence, equality aspects, fighting diseases like HIV or malaria, or decreasing pollution.

Start early in looking for such a project. It takes a bit of time and persistence.

But I can highly recommend it. Such an assignment opens many doors during your career, independent of the industry you are working. I could recently move to a global reputable think tank as a program lead. It’s a once in a lifetime chance to get such a position. Why have they asked me? Because I have done such projects in the past.

5. The EDGAR database

EDGAR, the abbreviation for Electronic Data Gathering, Analysis, and Retrieval, is a database that contains all submissions by companies and others that are required by law to file forms with the U.S. Securities and Exchange Commission.

You have wealthy business-relevant information in the form of figures and text. A quick introduction is provided here.

What makes that project interesting?

You learn first, how to access, download, and extract information from a web database, mainly consisting of text. That can be done with Python, and there exists already OpenEDGAR, an open-source software written in Python. But I would recommend other languages like Perl. It is specially designed for text processing, i.e., extracting the required information from a specified text file and converting it into a different form. It is much faster than Python. And if you want to work in a bank, there are still many databases set up in Perl.

It is an excellent database for sentiment analysis and using it to predict company and share price performance. Many fillings are encoded because companies want to shine and not give enough information to competitors. So, this database is a great learning resource for natural language processing (NLP).

Last, these are great topics to start your own blog, either about investments, or NLP. Seriously done, you can get public awareness of your data science work, and it increases your chance for your dream data science job dramatically.

How to start?

Decide on one single company that you want to analyze. Take one that exists at least ten years. Start with the goal to predict if the shares of the companies should be sold or bought.

Familiarize yourself with the different forms in EDGAR. Start with the 10-K, the recent annual report of the company, and the 8-K, the ‘current report’ where events that shareholders should know are published.

Do common sentiment analysis over the last several years and look at the positive, negative, and net sentiments trends. Compare the curves with the development of the share price. Also, the statements have forward-looking information included. Analyze them, and this will give you the trend.

Hint: the language in forward-looking statements contain words like “will”, “should”, “may”, “might”, “intend” and so forth.

Develop it with more sophisticated NLP and sentiment algorithms, by looking at other companies in the same industry, and integrate different sources like news and macro-economic figures. Compare it with share prices and financial ratios. There are no limits in all these analyses and rich content for a blog.

Connecting the Dots

I know that it is hard work to build up a cool data science portfolio. With such a collection, you can make above-average progress in that field, having a lot of fun, and getting your data science dream job.

I do not only recommend this for newbies in the data science area but also senior data scientists. It opens up many new paths during your career, not only because of the projects but also through the newly gained network.

These ideas show you the wide range of possibilities and give ideas to think out of the box.

For me and my friends, the learning factors and fun is essential. That is our main focus when dedicating time to such projects.

That we have built up also an exciting and unique portfolio, was just a waste product.