How to be a responsible data guru
It is all well and good to learn the technical skills that you need to become a data scientist. I think that it is also extremely important to learn to think like a data scientist. That means always questioning…basically everything.
Obviously every data science problem will require you to question your methods and the data in different ways, but there are a few things that I think are important to consider whenever embarking on any new data science project. In this story, I will go through those questions and why I think that they are important to be a responsible data scientist.
My questions for any new data science project are:
- What is the question you are trying to answer?
- Do you know exactly what you are trying to measure?
- Do you have the right data to answer your question?
- Do you know enough about how your data was collected?
- Are there any ethical considerations?
- Who is going to read your analysis and how much do they understand statistics?
- Do you need to be able to interrogate your methods?
It is extremely important to at least have an idea of what question you are trying to answer before you interrogate any data set.
You don’t want to test multiple hypotheses and just see which ones come out as significant. If you did that, you would run into the multiple hypothesis testing problem. We will go into this in more detail when we do our lessons on statistics, but in brief, it occurs when you consider multiple hypotheses at the same time.
When we talk about a significant result, in general, we are referring to a result that we are fairly confident is different from the ‘control’ because of a real effect rather than random chance. 95% confidence is most commonly used (p<0.05).
That leaves an error rate of 5%, where we label a result as significant when it really is not. The problem with testing multiple hypotheses at the same time is that the likelihood of making this type of error for at least one of the hypotheses increases. Thus, by indiscriminately testing many hypotheses at once, you would be increasing your chance of making a false discovery.
So rather than testing at random and seeing what sticks, it is much better practice to strategically use statistical tests when you have a thought out and well-research hypothesis.
- If you had a data set with measurements on 4 different groups and did a t-test between each different combination of them to see if any of them came out to be significant, then you would run into the multiple hypothesis testing problem. You would have an increased likelihood of making an erroneous conclusion.
- It would be much better practice to create a null hypothesis and test that instead.
In addition to avoiding the multiple hypothesis testing problem, having clarity of thought about what question you are trying to answer will keep you from getting side-tracked.
Sometimes there are so many different shiny and interesting insights to be gleaned from a dataset that it can be easy to fall down the wrong rabbit hole. You may end up doing a lot of work to solve interesting problems, but have no answers to your original questions.
That may not be such a big deal if data science is a hobby for you, but it may be much more important if you are trying to work to a deadline, or solve a specific problem for the company you work for.
Once you know what problem you want to solve, you need to know how you are going to solve it. Part of that is deciding what you are trying to measure.
There are often multiple different ways to approach the same question. However, if you choose to measure the wrong effect or variable, then you may not be able to effectively solve your problem. So it is extremely important to thoughtfully consider if what you are trying to measure is the most effective way to answer your question of interest.
Similar to number 1 above, if you choose the wrong effect or variable to measure, then you can spend a lot of time working on an analysis that does not really meet your needs.
- I once needed to create a prediction model where the outcome variable was if an individual had been treated with a specific medical procedure. The data I was using was healthcare claims data where procedures are indicated by codes. Many of the codes were extremely specific so I needed to group a selection of codes together to create my binary outcome variable.
- Whilst I was investigating my procedure of interest, I came across a subset of these procedures which captured my attention. I so wanted to go down the rabbit hole and focus on this subset of the procedure, but that is not what I had been asked to research by the company that I was working for. So I had to table that for a later date and stay on task.
Deciding which approach to take can be the most important step in any data science project.
It may take a bit of research to find out what is the correct variable to measure for your purpose, but it will be worth it. If you are confident that you have correctly chosen what to measure then you can be much more confident in the results of your analysis.
Further, when you are communicating your findings to others you can do so with the assurance that you have appropriately considered which problem you want to solve and how to measure it correctly.
So once you have a problem to solve and know how to measure the effect you are interested in, you need to have the data to make that measurement. There is no point in having the best, most interesting problem and variable to investigate if you do not have the means to do so.
Having the perfect dataset to answer the exact question and measure the specific effect of interest is very rare.
Not having the right data can be a huge frustration as a data scientist.
Often we have to make do with what we have. Thus we may have to measure something slightly different to what we originally intended. That may be fine, as long as we then temper how we report our results to reflect that adjustment.
It is sometimes possible to use one variable that you have data to represent another variable that you cannot gather data on. Yet, if you do this, it is imperative that you report your results appropriately. Make sure that you do not make claims of your results that you do not have the data to support.
- If you wanted to find out which dog breed is the most popular in Australia, but you only have data from one city, Sydney. You can still do an analysis of popular dog breeds, but you need to communicate in your results that your data only reflects the Sydney population.
Just because you have the correct data to measure your effect of interest and solve the problem you have in mind, doesn’t mean you can relax 😉
Datasets don’t just spring into being, fully-formed and containing complete data. There are many different ways data can be collected. Many of them involve machines or people, which means that errors can be made. Try to consider or at least acknowledge as many potential error sources as possible when you do any analysis.
It would be impossible for me to describe all the possible ways that data collection can go wrong. Yet there are a few different investigations you can do to get to know your dataset. They can allow you to understand better what its shortcomings may be.
- Do you know where your dataset was collected? Many variables will have regional variation, which should be captured by the dataset. Yet this cannot be done if the location data was not recorded.
- Do you know when your dataset was collected? If there are daily, weekly or seasonal differences in what you are analysing then it is important to know what time period your data is from.
- Who was your data collected by? was the research paid for by a company who might have a vested interest in the result? was the person who did the study unbiased?
- Was the dataset complete at the time of collection?
- Was any of the data inferred or imputed? Different methods of data imputation should be recorded with the data set so that you can interrogate if you think that the method used was appropriate for your analysis.
- Were any post-collection modifications made to your dataset prior to you receiving it?
- Are there any other possible biases in the collection methodology? If so, they should be accounted for in your analysis, or at least acknowledged in the presentation of your results.
It is important to get to know the history of your dataset so that you can decide if any of these potential shortcomings will prevent you from finding the answers you seek. Or alternatively if they are data issues that you can work with 😃
This is a very important question that is often overlooked. Are your analysis and the results that you have come to ethical?
You might wonder how an analysis could have ethical considerations. Surely numbers appropriately collected and measured with careful analysis could not be unethical? But the reality is that sometimes things like sexism and racism can creep in without us intending it at all. Especially when the world that we are trying to describe using data, has things like inequality and prejudice within it.
- If you were building a machine learning algorithm to predict which people will buy a product. Based on your analysis, one of the features that strongly predict your outcome variable may be race. However, you should decide if it is correct to include race as a feature in your model. Perhaps race is actually not the cause of the difference in your outcome variable, but instead, it is another variable correlated with that race that is causative. That may or may not matter when you are predicting if people will buy a product so that a company can market to them. In contrast, if it is something like predicting if they will pay back their mortgage to decide if they should be given a bank loan, then suddenly it becomes much more important.
- One of the most famous cases of racism in data science was when a facial recognition algorithm that labelled two African American men as gorillas. This was not done maliciously by the people designing the software, but occurred because there was not enough diversity in the training dataset.
There are lots of people who have written about ethics in data science so you can educate yourself to make ethical decisions when designing your data science projects. It is important to acknowledge the biases that exist in the world and decide for yourself if the analysis that you are planning would be ethical to complete. Be mindful and do your absolute best to perform analyses in a way that will make the world a better place.
As data scientists, we are stewards of data. We can influence how companies do business, what decisions governments make, which drugs get developed etc. So it is extremely important that we do so responsibly and mindfully.
Who is going to read or be on the receiving end of your analysis? Is it a user of a website or product? a data science team? marketing team? business development? sales team? etc. Each of these groups will have different levels of statistical understanding. So you may have to tailor your analysis to your audience. In particular, the methods you choose and how you convey your results and any caveats that are associated with them.
It is your responsibility as a data scientist to make sure that you accurately convey your findings to their intended audience. You must take into account how much background in statistics your audience has.
One of the most difficult presentations I ever had to give was to an audience who was extremely diverse in their understanding of statistics. There were neuroscientists, neurosurgeons and also people who had suffered brain trauma in the audience. The key was to include enough details to keep the experts happy but not make the other people feel like I was talking down to them or excluding them from the conversation. It is a balance that I continue to work on in conveying my results to various audiences.
It is also common for different people to want to use your data for their own teams’ purposes. You are obligated to make sure they know what the limitations of your analysis are and if they can confidently make the claims they want to.
This can sometimes lead to conflict. Different stakeholders may have different purposes for your analysis, but it is important to stand your ground and make sure that everyone has a clear understanding of what your results mean. Furthermore, make sure they know what conclusions can be drawn.
- I once worked at a company where I did an analysis for the marketing team. It was a pleasure to work with this team because they communicated with me clearly about what claims they wanted to make about our data. There were a couple of times where I had to politely correct something that they wanted to say that I felt our data could not support, but they took this constructive criticism very well.
- In contrast, these types of conversations do not always go quite so smoothly. I was once working on a data product and the sales team wanted to make certain claims about is capabilities and coverage. I had to be quite firm to make sure that we did not over-hype our capabilities. But I think this is probably the natural state for scientists and sales people to be in. The former being naturally conservative and the latter being much less so 😃
Before you decide what methodology to use in your data science project, it is important to know how much you will need to be able to interrogate the analysis or model you produce.
Sometimes you might need to be able to explain each step you took in great detail. This includes all the variables that were taken into account and why. Other times, accuracy is the name of the game. In that case, it doesn’t really matter what went into the model as long as it predicts with minimum errors.
Knowing this will help you decide on your methodology.
- If you have a prediction problem and want to use a supervised machine learning algorithm to predict a binary outcome variable, you have a few different options. If you need to be able to examine each feature in the model and be able to explain it clearly to someone, then it might be best to use a logistic regression model. Yet, if it wouldn’t matter if your model is more of a black box, then perhaps a random forest model could serve you better.
Whilst explainability should not be the primary consideration of choosing a machine learning model for analysis, it is worth thinking about. It can sometimes be a trade off between accuracy and explainability. So it is really up to you as a data scientist to decide which direction to go.
So before you get started on any data science project ask yourself:
- What is the question?
- Do you know what you are measuring?
- Do you have the right data to answer your question?
- Do you know how your data was collected?
- Are there any ethical considerations?
- Who is going to read your analysis and do they have a background in statistics?
- Do you need to be able to interrogate your methods?
I hope that these 7 questions have been thought provoking. They will make sure that in any data science analysis you perform that you don’t go completely off the rails. Hopefully, they will save you some time and trips down unnecessary rabbit holes. As you gain experience in data science I am sure that you will come up with your own questions that you always like to ask yourself before starting any new project. Just remember to always be mindful in all your analyses and appropriately convey any caveats to your results.