16 minute read

Türkçe oku Read in English

be-a-data-scientist

I am often asked by students in our department, “I want to be a Data Scientist - can you draw me a roadmap?” I am always delighted to answer these inquiries and have been doing so for some time. Doing so has enabled me to refine my responses and generate richer answer content. For a while, I have been seeking the opportunity to write this post - not only to reduce the inefficiency of providing different answers to the same question repeatedly, but also to be of assistance to other Industrial Engineering students who are searching for a response to this question. Ultimately, I was able to compile my thoughts and advice on this subject into this post.

The positive perception that the term Data Scientist has achieved over the past decade is truly admirable. It seems as if the methods and tools contained within it had never existed before, and the emergence of the term “Data Science” has sparked excitement among us all. There is no need to debate that it is a very intelligently chosen name. ‘Data’ is already the apple of the eye of today’s decision-making processes. When ‘Science’ is added to it, it becomes even cooler.

“I am a Data Scientist - I am an expert in data and also do the science behind it - wow, I am really cool!”

We used to see job descriptions such as statistician, data analyst, business analyst, data miner, applied mathematician, operations researcher, and machine learning engineer; now, we see a single term for all of them: Data Scientist. The methods under the umbrella of Data Science have actually been around for many years, under different names. The primary reason these methods weren’t as “trendy” as they are now was the limits of data amounts, data access, and computing technology. With the increasing amount of data and developing computing technology, theoretical knowledge has now become easily applicable in practice. For example, when we look at Operations Research, one of the important subfields of Data Science, whose foundations were laid in the 1940s, we see that it was unable to achieve the desired success until the 2000s. In fact, in the late 1970s, there were even articles about the lack of a future for Operations Research:

The following observation of Ackoff (1979), known as the father of Operations Research, is truly interesting:

“Operations Research is dead even though it has yet to be buried. I also think there is little chance for its resurrection because there is so little understanding of the reasons for its demise.”

Yet, today Operations Research is recognised as one of the fastest growing professions by FORBES: Operations Research Analyst - The fastest growing job you have never heard of.

So what has changed in the last 20 years? The answer is actually quite simple: they couldn’t predict the increase in data, the ease of access to data, and the advancements in computer programming and computing technologies we experience today. Of course, it is not fair to critique these past opinions with the luxury of our current conditions, as it can be reasonably assumed that in a period when simplex tables were done by hand, the problems solvable by Operations Research methods were very limited, and hence the expectation of success from Operations Research was diminished. We cannot expect everyone to have a vision like our esteemed Professor Cahit Arf.

Arf, Cahit, Makine Düşünebilir Mi ve Nasıl Düşünebilir?, Atatürk Üniversitesi – Üniversite Çalışmalarını Muhite Yayma ve Halk Eğitimi Yayınları Konferanslar Serisi No: 1, 1959, Erzurum, s. 91-103

I want to become a Data Scientist, where do I start?

data-science-kickoff

To be honest, this is a difficult question to answer for me, as becoming a Data Scientist is a long and complex process with no definitive starting point, route, or endpoint. In addition, the phrase “Data Scientist” is relatively new and encompasses many different roles, making it difficult to provide concrete instructions on which methods and tools to learn. My advice is twofold:

  • (a) First, you should acquire a strong foundation of analytical knowledge, and then decide the fields of Data Science you would like to specialize in.
  • (b) Secondly, you should learn to communicate with computers, through programming, in order to translate your analytical knowledge into the digital world.

As an Industrial Engineering student, you can gain the essential analytical foundation for becoming a Data Scientist through the natural course of your curriculum; however, you may need to make extra effort to acquire the programming knowledge that will combine this foundation with computational power. There is nothing as foolish as wanting to become a Data Scientist, but being afraid of coding. Hence, it is essential to prioritize programming and remove any barriers between you and computational technologies. A Data Scientist can only showcase their skills with the help of computational power. Furthermore, the starting point, route, and end of the process of acquiring the required amount of programming knowledge are clear. You can make quick and short progress. Once you can communicate with the computer, your enthusiasm in this matter will increase, allowing for rapid progress. Quality written, audio, and visual resources including courses from the best universities and professors in the world are available for free and can be easily accessed.

According to which subfield of Data Science you specialize in, your coding needs may change, but generally speaking, you don’t need to be an expert coder, however, you need to build up your basic programming knowledge, be able to read written code and reach a level where you can understand the resources you look up for help. I have also seen an increased perception among students that coding skills will not be necessary due to the latest AI trends. I’m not an old-fashioned thinker, I believe this is possible one day. Even if one day AI engines reach a level where they can write perfect code, I believe we will always need the programming knowledge to be able to feed the AI with code, test the correctness of it, and intervene when needed.

Let’s look at the above two items in more detail now:

(a) Gaining a Fundamental Analytical Knowledge and Choosing a Data Science Field to Specialize In

data-science-analytics

The core analytical methods that make up the foundation of data science can be grouped into three classes: descriptive, predictive, and prescriptive. I use these three classes to classify data scientists according to their work. In general, we can say that such a classification has already been made. For a better and more detailed classification than mine, you can read Cem Vardar’s article The Types of Data Science. Although the logic behind Cem’s framework is similar to mine, he provides a more specific classification.

  • Descriptive Analytics: These people are experts in summarizing, visualizing, and thus uncovering the performance of a system from data that cannot be interpreted through manual calculations and observations.

  • Predictive Analytics: These people develop models that use existing data to predict future outcomes. Machine learning studies are the pioneers of this field.

  • Prescriptive Analytics: These people develop models that use past data to generate prescriptive results to decision makers. Generating prescriptive results means providing certain recommendations for decision problems. The dominating field of study for this area is Operations Research. They differ from the previous two analytics in that they can provide the decision maker with a ready-made decision and also prove that it is the optimal one. It is gaining increasing popularity and is also well suited to the background of Industrial Engineering students.

In my opinion, to be able to claim the title of a Data Scientist, one must acquire adequate knowledge and skills in at least one of these three areas. My suggestion is to select two areas of specialization, with one of them being Descriptive. To put it another way, one can focus on either the pair of Descriptive and Predictive, or Descriptive and Prescriptive. It is important to remember that this is a long and arduous path. There is no definite beginning, route, or end. It is likely to be a journey of learning that will continue throughout one’s life.

The good news is that as Industrial Engineering students, you are relatively luckier than other competing fields in this long and arduous process. In my opinion, the Industrial Engineering curriculum is tailored to produce Data Scientists. Our curricula generally contain courses related to descriptive, predictive, and prescriptive analytics, and the department has a mathematically inclined progression that enhances our analytical aptitude. In many courses, you learn algorithms that will develop your analytical power. Upon graduation, you can read and write algorithms, and you have knowledge and ideas about popular sub-fields of Data Science such as statistics, data analysis and visualization, operations research, and machine learning. Since you are familiar with the terminology in these fields, you are at ease in the subsequent specialization studies you will do.

It is my belief that it is quite difficult, if not impossible, to acquire analytical methods for Data Science without enrolling in a specific long-term training program. As you will see in the next section, I will be directing you to many free resources related to programming. However, this is not possible for analytical methods. For example, it is quite difficult to find a free course on statistics, simulation, or optimization that is appropriate for your level and of high quality. Therefore, my advice is to pay attention to your classes during your Industrial Engineering education. Do not underestimate what will be useful for you in the future and consider them unnecessary. Give your curriculum its due respect and do not disregard the instructors who are teaching you something. Reach out to your professors who work on these topics with project ideas or tell them that you want to work on their existing projects. Set a goal for yourself to become an expert in two of the above analytical fields. If possible, do many projects without questioning the necessity or magnitude of them while you are a student, or after you graduate. If possible, pursue postgraduate training in that field and write a thesis.

I am sure that if you have a good four years in the program, you will be a good candidate for a Data Scientist when you graduate. After that, you can initiate your journey from candidate to expert with the applied projects and postgraduate studies you will do.

(b) Computer Programming

data-science-computer-programming

Having a knowledge of computer programming is essential in order to apply the basic analytical methods you will find in your curriculum and to automate your work. You can be sure that you would not want to do simplex tables or carry out matrix multiplications by hand for your Operations Research exams in real life problems. Even if you really love to do this manually (I haven’t seen a student who loves this yet) or to do gradient descent iterations by hand for your Artificial Neural Network model, you must realize the impracticality of this approach when it comes to larger-scale real life problems.

Sadly, Industrial Engineering students don’t have the same luck in computer programming as they do in (a). Although there have been improvements in many departments, unfortunately, we are still lagging behind in terms of programming courses. We still have departments primarily teaching languages such as Java, C, and derivatives while Python, R, and Julia are being overlooked. What can we say about teaching Matlab? How logical is it to teach a paid software language to students when there are free and better alternatives? These languages are certainly great in their own right. They can even do things that cannot be done with Python, R, or Julia, but the typical Industrial Engineer’s work in computer science is rarely very technical. Instead, they usually work on using and manipulating data. For such tasks, the ideal trio of Python, R, and Julia should be preferred due to its being free, easy to learn, and having a vast and high-quality open-source repository. Therefore, it is my wish that all Industrial Engineering curricula switch to these languages as soon as possible.

The first good news about this topic is that it is not difficult to acquire programming skills. In fact, you can access better ones for free than what is taught in our schools. It is wise to choose this learning path considering that you can access free courses from Harvard, MIT, and the University of Michigan.

The second good news is that the Python, R, and Julia programming languages are user-friendly and easy to learn. I refer to this trio as the ‘Super PRJ’ trio. As data scientists, we can use these languages to collect, understand, visualize, and apply analytical methods such as machine learning and mathematical optimization. Fortunately, these three languages offer us more than we need, they are all free, and have large support communities. You have no chance of not finding the answer to any of your questions. Out of respect for those who code with disks in old times, I don’t want to exaggerate by talking too much about the past, but when we were learning programming in our undergraduate program, all we could do when we got stuck was to go to the program’s own help and try to get a clue. This was very inadequate for specific problems. Nowadays, even if you have a very specific problem, I’m sure you’ll find the answer with a simple Google search using the right keywords.

The third good news is that when you reach the level of being able to code in one of the Python, R, and Julia trio, it is also easy to switch to the other two. I started programming with R and I have been a fan of R since the first day. Whenever I get a new task, I can’t help but open R first. About three years ago, when I started consulting in the business world, I began coding in Python due to customer requests. To be honest, it was very easy for me to transition between the two languages. I’m sure it will be the same for you. Already, initiatives like Posit are showing that these languages will be “one” in the near future. When we see the progress in AI engines, I suspect that in the near future we will be able to write a code in R and have the AI write the Python version.

Starting to Computer Programming

Learning a programming language for the first time can seem strange and difficult. Computers may seem smart, but they don’t always understand what you’re saying. The commands you give them must be in a precise and correct format. If you say something wrong, they return an error message and figuring out the cause of the error can take hours initially. I assure you that the hardest part is the beginning. Once you’ve made a start, your progress will be very rapid. Furthermore, the Python, R and Julia super-trio have huge communities that answer nearly every question you may have. I can’t remember a single question I haven’t been able to find an answer to. So don’t worry, it will be easy to access answers to your questions. Plus, if you have a question you can’t find an answer to, you’re likely on the brink of finding something new and exciting. Lastly, once you’ve become familiar with the Python, R and Julia trio, it can be helpful to learn some web design and programming at a very basic level.

Which language?

As I mentioned above, I believe this question won’t be asked in 10 years’ time. With the emergence of current startups such as Posit and advancements in artificial intelligence engines, I believe that data science programming languages will eventually be seen as one language under the same roof.

Still, if you have a structure that cannot accept anything other than a definitive answer, my answer would be this: you can learn Python first and then R. Not because Python is better than R. In fact, even though I have been forced to work more in Python lately, I always miss R. The reason for this ranking is because the best programming course (at least the best one I know of) that offers access to its entire content for free is Dr. Chuck’s Python for Everybody course.

Python Begining Guide

Without hesitation, I recommend the Python for Everybody course Python for Everybody. All of the course material is available for free to everyone.

If you want to follow the course on its own website, go to https://www.py4e.com/.

If you want to follow the course on Coursera, you can take the following courses in succession:

  1. Programming for Everybody (Getting Started with Python)
  2. Python Data Structures
  3. Using Python to Access Web Data
  4. Using Databases with Python

The content of its own website and Coursera is generally the same. Following it on Coursera may more organized. The course can be followed for free without getting a certificate. There is also a free way to get a certificate. Although these certificates may not have much of an impact, knowing that there is a certificate at the end of the course can sometimes motivate you better. You can read the details about enrolling in the course and getting a certificate in my post An Introductory Guide to MOOC.

After taking these basic courses, this course is also quite good for learning more about Python’s Data Science Libraries: Python Data Analysis.

Since the content of these courses is quite comprehensive, I won’t suggest any additional books. If you can’t attend the course and want to study from a book, you can follow the book of the https://www.py4e.com/ course and gain access to all the course content in written form.

R Begining Guide

I recommend the Data Science program of professor Rafael Irizarry from Harvard for R. It’s a beautiful course designed from start to finish for data work with R: Harvardx Data Science.

Of course there is a lot of detail in this program. I’m writing down the parts I would recommend below. You can access the courses for free with the Audit Track option. For more information about enrolling for free and getting a certificate, you can read my article An Introductory Guide to MOOC.

  1. Data Science: R Basics

  2. Data Science: Visualizations

  3. Data Science: Productivity Tools

I also recommend the following books. All are available for free access:

  1. Introduction to Data Science (Rafael A. Irizarry): This is the book by professor Rafael, who gives the HarvardX course above. He also follows the book in his videos above.

  2. Hands-On Programming with R (Garrett Grolemund): If you want to start from the basics, this book is ideal.

  3. R for Data Science (Hadley Wickham, Garrett Grolemund): A book recommended by the R community.

Web Programming Begining Guide

Finally, I didn’t mention it above, but I believe a Data Scientist should have some understanding of web programming and design as well. Eventually, when you start pulling data from the web or using application APIs, you will need it. Once you are comfortable with one of Python, R, or Julia I would recommend making the jump. This sequence may help you to keep your energy more focused.

For web design, you should aim for a basic level of knowledge in HTML and CSS, and for web programming in PHP and SQL. Dr. Chuck’s series, Web Applications For Everybody, is very good, similar to Python’s, and provides basic level knowledge in HTML and CSS. This should be enough, if you are not going to delve into extra web design work. All the content is available for free here: Web Applications For Everybody.

If you prefer to follow the course on Coursera, you can take the following courses in order:

  1. Building Web Applications in PHP
  2. Introduction to Structured Query Language (SQL)
  3. Building Database Applications in PHP

If you enjoy web design and wish to expand your knowledge, I would recommend the following program:

  1. Web Design for Everybody: Basics of Web Development & Coding Specialization

  2. Introduction to HTML5
  3. Introduction to CSS3
  4. Interactivity with JavaScript
  5. Advanced Styling with Responsive Design

Leave a comment