Data scientists often start their academic and even professional journey alone, so they become used to organizing their concepts and code on their local drive by themselves — in perhaps, some makeshift manner. But what happens when you work on a team of several data scientists, data engineers, software developers, and even product managers? You will have to collaborate somehow. Where will your code be shared and controlled for your data science model? GitHub is the answer to these questions.
This platform serves as a tool for teams and cross-functional members of an organization to be on the same version of a codebase using Git, as well as approve and comment on new code changes that have been requested and documented through a pull request.
Below, I will share and describe the benefits of GitHub on a data science project.
GitHub  documents and guides software developers, designers, and project managers through the use of Git, pull requests, issues, wikis, and gists. Setting up a data science project is fairly simple and allows your team to conduct checks and balances of your files and code. Git is the main system for interacting in your terminal to navigate branches, code changes, and ultimately, version control. Gists are also useful for submitting code snippets for sharing — for example, if you do not want to share an entire data science project. Below, I will discuss the benefits of GitHub.
A data science model can work on your local computer, but as soon as you incorporate others on the same project, GitHub can serve as the tool with several benefits that will ensure a successful machine learning model is put into place. I will include more descriptive benefits and examples of Git, pull requests, collaboration, and gists ahead.
Version Control (Git) — you can perform certain commands that will push up new versions of your codebase. With Git commands like the following, a pull request can be created and your data science model code will then be monitored and enhanced. Here are some common, useful Git commands:
- check which branch you are on — git branch
- create a new branch off of your master branch — git branch branch_name
- pull your master branch so it is up-to-date — git pull
- check which branch you are on — git status
- add your code changes from your branch — git add
- commit your changes from your branch — git commit -m “Added change”
- push your changes from your branch — git push
Pull Requests — this action is an extremely useful part of the GitHub platform. With pull requests, oftentimes named “PR’s”, you can have a second, third, or even more set of eyes on your code changes. When you want to add code to an existing master branch, you can create your own branch that will include that new code. People on your team will have to view and test it to make sure that your new additions will be correct. The PR process is not only beneficial for eliminating mistakes and ensuring people will double-check your work, but it is also useful in the sense that all people on your team will be on the same page. When others have to view your changes and approve the new code, they will reiterate the knowledge of the model as it expands to more files and systems.
Collaboration — with the use of GitHub, also comes the associated collaboration from multiple team members that can include other data scientists, software engineers, data engineers, and product managers. Collaboration serves as a benefit in that it will make your data science model more robust, efficient, and possibly more accurate from the influence and impact of others. You can include all appropriate people on the data science model and have a positive impact on the entire project.
Gists — these are useful if you want to share a smaller code snippet to others or even here, right on Medium, where you can display code in its appropriate programming language. It can be an easy way to display an example of your code. When you designate the programming language, say Python, in a .py format, you can easily see the color-coded functions — for example, the import code is highlighted in red. Below, is a gist to serve as an example:
GitHub is a useful tool for your data science project within your organization. It can house code, share, and enhance it through the use of collaboration, Git, pull requests, and gists. There are several other benefits of GitHub as well that are outlined on their mentioned website. A data science model needs all of those key components to secure success.
While the focus in data science in academia is not necessarily on GitHub, but rather theory, concepts, and codes of common machine learning models, there should be a focus on highlighting this platform more before students enter the workforce and have to immediately start working with others. To sum, it is beneficial in developing a successful data science model.
To find out more about the Git part of GitHub, find this article below :
I hope you found this article interesting and useful. Thank you for reading!
 GitHub, Inc., GitHub main page, (2020)
 M.Przybyla, pandas-append.py, (2020)
 M.Przybyla, Common Git Commands Every Data Scientist Needs To Know, (2020)