Tools & Tech you should know during your Data Science learning journey
Everybody wants to be a data scientist. Vast learning resources are available on the internet. Every good thing comes with a side effect. I am not surprised when I meet all those data practitioners claimed themselves as a data scientist. Struggling with basic research methodology or probability concept. But that is a topic for another day. Today we will discuss what you should know and what are the tools you might need to use to learn data science or analytics at the beginning. In the end, you will also find some recommendations for tools to set up your Lab.
Anybody who wants to embark on the journey of the Analytics or Data Science, the first thing they need is a working computer with decent internet speed. The machine can be Notebook, Desktop, even fancy 2in1 Tablet also works. Is somebody mentioning iPad/Android Tablet? If you have a Raspberry Pi, you can start with that too. The point here is you do not need those high configured workstations or crazy computing powered with fancy RGB lighting gaming rig. OK! If you want to play all the new AAA rating games, you need the latest generation nVidia GTX or RTX GPU and the endless number of cores(!) processor with lots of lots of memory (RAM) and of course blazing-fast storage(hard disk). Yes, a Data Scientist also needs those at some point, but not for starting. Let us agree on a working computer with a decent internet is good enough to start learning. Oh yes, we need the internet to download all the tools (everything is free or open-source) we will use to learn.
So whatever system you are planning to use for your learning, it is highly recommended you understand your machine well. Please do not be that person who buys the laptop only because it has the same lighted logo as his phone has. It is particularly important to know the horses before races. Especially whether it’s x86 or x64 or ARM architecture. If you do not know it already, go to your system properties and check it. Keep a note of it. It will come handy later when we download the required software for the learning journey. If you want to learn more about the differences between all these platforms, you can find a very detailed Wikipedia article on each of these. Next thing we need to know about how much memory our system has. This information also available in the system properties and of course you need to understand how much free space you have. We will install software and bringing sample data. So, you need to be remarkably familiar with the storage unit. Thank you to the smartphone manufacturer now even Grandma also knows she needs a half-terabyte phone because she wants to take lots of photos of all the grandchildren with a 48-megapixel ultra-wide phone camera. In summary, you always need to be aware of processor architecture and its speed, total memory, and available space of your machine to set up the overall learning lab environment effectively.
Now you know about your machine’s major hardware. Let us get familiar with another crucial part before starting to debate about which tools we will use.
The operating system or OS comes in many forms and shapes. The world’s most popular OS today is Android. 4 out of 10 devices powered by Android. Can we do analytics in Android? Why not? It is not meant for desktop computing like Windows or OS X or Linux. It is getting matured day by day. Multi-window operation and increasing support of coding environment it is very much capable. So, if you own a high-end smartphone or a Tablet, you can start your data science journey with this device as well. If you have an iPad, you can do a similar level of coding and analysis in your iPad OS as well. You do not need to wait for the laptop or desktop. The most effective OS in our case would be either of these three Microsoft Windows, Apple OS X, or any variant of Linux. In the case of Linux, my favorite is Ubuntu. Once you know which OS you are running, now you need to know what the variant of this is — 32bit or 64bit, which version it is?
Once you are familiar with all these terms, then comes to the next part of the story — what type of user account are you using in the machine. If you are sharing the device with anybody or this has been provided by your office or school, then this is especially important to know your account type. Without the right permission or account type, you might not be able to install or configure all the tools required by yourself. It is preferred to have Administrator or Power user privileges at the time of install or set all the tools. Once everything up and running, then the Regular user account will be excellent. At the time when you want to update/upgrade tools, you also need special privileges.
Every other OS comes with a default text editor. Your daily windows machine has a classic Notepad application, and Mac OS comes with TextEdit. Most of the Linux distribution comes with vi or vim pre-installed. These are well capable editors to open, look around data, and do coding. As coders heavily depend on their editors, they also built editors for them. Every other day if you google, you will not be surprised to find a new editor. Every writer, coder, and the developer has a favorite set of tools, just like every carpenter has a favorite hammer and an artist has a preferred set of brushes. Some editors are available across all OSs. Some are paid, and some are Free or Open Sources. Atom, Sublime Text, Notepad++ are the few among the popular text editor.
Using Text Editor is very straight forward. For starting, just open a new file, write in it and save it with the desired file extension. In most cases, if you do not give any extension, your editor will save it as a plain .txt file. Once you started using it every day, you will find a way to deal with it effectively. Like using some keyboard short cut will be very handy. The more you will use it; you will be starting to be familiar with the add-ons, macros, and other features. We will discuss the text editor in every detail later. In that case, we will use Notepad++ for all our examples.
The world has changed since the original iPhone. Everything is not only good enough to be GUI based. It needs to be touch-optimized. But this part of the world is not yet 100% true. You still need to have some essential file management CLI proficiency for the starter. Like creating a new folder/file or copy, paste, delete a folder/file, more importantly, to know the current working directory or change the directory and finally how to use the help file. For a starter, this will be good enough. Working with the command line may be scary at first, but the more you use it, the less scary it will be. Once you adopt it, you are in control! The CLI awaits your command. We will discuss the most common CLI command relevant for learner data professionals.
Once in a while, the miracle happens in every field, and Microsoft Excel is kind of a blessing from heaven on earth for all the data professionals. It is rare to find a person who worked with data and never encounter Microsoft Excel. From the beginner to the grandmaster of Data Analytics, everyone, if not regularly, occasionally uses this tool. Year after year, this is just getting better. Data cleaning, visualization basic statistical modeling, or advanced exploratory data analysis everything is possible out of the box. Once you start using add on or start building your macros possibilities are limitless. You can disagree to consider Microsoft Excel as a proper analytics tool, but it has all the bells and whistles required to do most of the steps of Analytics.
SAS & IBM SPSS were very popular and influential analytics tools back in the days. They are still outstanding and can solve most of the Advance Analytics or predictive modeling problems pretty well. Challenges are starting when you want to use these tools with all the modern BIG(!) highly scalable eco-system. In this case, the Knime and Rapid Miner are doing an excellent job. You can start learning these tools today, but the learning curve would be very smooth. But not all the business organizations can afford to use all these. The best part of these tools is you can use your favorite python or R language into all these tools. Without mentioning H2o, Dataiku, DataRobot, Orange, and Alteryx, these lists will not be complete.
We have Microsoft Excel and one of the GUI based tools. Do I need to learn to code? Yes and No. If you are working in a large team and you have someone in a team who can quickly do the thing by coding, which is not available for the shelf tools, you can take help from them. Still, it is better to learn necessary coding so the communication can be smooth, and you will also understand the challenges they might be facing to implement your request. The point is nowadays; there is no tool available in the market, which can solve 100% of your data analytics problem out of the box. It will be needed some custom function specific to industry or problem. In those cases, coding is essential.
Now we agreed that we need to learn a programming language to navigate in the world of analytics. Then the question is which language. Let us don’t get into that debate; instead, we can choose either Python or R. Both the languages are well capable of doing similar things. You can learn Java or Julia even. It doesn’t matter which language. Once you are familiar with one to switch to another will not take much time, and there are always google and StackOverflow to help you. But it is better to start with the language which has the largest community. The largest community means lots of tutorials and examples, and forums. You can also find lots of sample projects to follow through.
Not all data can be fit into your Excel Spread Sheet. Excel is a great tool, no doubt. But you need to store your data in someplace from where you organize and can access the data quickly. This is a very crowded space — tons of solutions out there for different types of databases. But if you know and learn standard SQL by heart, you are good to work with any solution with some help with help file or google. Either it is Microsft SQL Server or Oracle or Impala or Google Big Query as a data professional SQL should be your second language. The concept of databases and SQL can help you to do data prep by python or R even if you don’t work on top of any databases.
You are almost there. Now you know about your hardware and required software. For efficiency later, you need some more tools like version control, scheduler; you need to be familiarized with the file compression technology and, of course, the modern cloud-based tools and technology. These can be learned later once you are getting used to the lifestyle of a data professional, which is fascinating only for the right people.
Computer -> CPU: Core i5 or AMD Ryzen 5 class, RAM: 16GB, HDD: 1TB SSD
Operating System -> Ubuntu LTS 64bit or Windows 10 64bit with WSL Ubuntu LTS
Text Editor -> Notepad++
GUI Tools -> Knime & Microsoft Excel
Programming Language -> Python 3 Anaconda Distribution
SQL -> MySQL