Introduction to Data Science: Resources Available Online
Data Science is a highly developing field, with a steady upslope of demand for data scientists. Job openings for data scientists have increased by 56% over the past year, according to LinkedIn. There are more and more people who want to start their career in Data Science, or plan to use some Data Science techniques in their work. An important question emerges for the people following this route: “Where can I start learning Data Science?” There is no simple answer to this question. Data Science is a complex multi-disciplinary field. It employs techniques and theories from statistics, multivariable calculus, linear algebra, and Machine Learning. Data scientists need good knowledge in the fields mentioned above, as well as strong programming and data visualization skills. There are many offline and online university programs for those who want to gain a degree in Data Science. In this article, we will consider the case of a person who already has enough background in math, statistics, and programming, and focus on online resources specifically for Data Science. The basic concepts and techniques of Data Science can be learned in different ways, but, in general, it is better to use a resource that gives a complete picture of the subject, such as MOOCS. E-books are also very useful in understanding the basic concepts of Data Science. Usually, books open the subject deeper, but less widely than MOOCS. So, in my opinion, the best way to start is to find a MOOC or e-book that corresponds to your skill level (according to the requirement skills for Data Science mentioned above). For your reference, we have listed below some MOOC platforms, courses and e-books that can be helpful for beginners. MOOCS:
Coursera:
· Machine Learning by Stanford University. This is an extremely popular (2M+ enrolled participants) and highly-rated (4.9) course. It includes multiple case studies and applications.
· Applied Data Science with Python Specialization by the University of Michigan. This program contains 5 courses which are a good introduction to Data Science through the Python programming language.
edX:
· Statistics and Data Science by the Massachusetts Institute of Technology. The advanced and graduate-level program include probability and statistics courses, as part of a well-rounded Data Science curriculum.
· Microsoft Professional Program in Data Science by Microsoft. This program is focused mostly on learning key Data Science tools and widely-used programming languages.
Udemy:
· Python for Data Science and Machine Learning Bootcamp. This course is a good way to learn how to use NumPy, Pandas, Seaborn, Matplotlib, Scikit-Learn, and other tools used in Data Science.
This list of courses is incomplete and many new courses appear each year in this field. Additionally, you can search for courses on other platforms, such as DataCamp, fast.ai, and Udacity. Notably, Coursera and edX platforms suggest courses which are usually better for theory and foundational material. This is not surprising, since they are made with the participation of leading universities. Udemy courses are generally better for more applied learning material.
E-books:
- Python Data Science Handbook and Python for Data Analysis. The core libraries for working with data in Python are introduced in these books. It is important to mention here that O’Reilly Media has published many wonderful books on Data Science.
- Understanding Machine Learning: From Theory to Algorithms. This book helps to understand deeper machine learning algorithms.
Data Science tools explanations
Beginners and advanced data scientists regularly need detailed explanations for some functionality of the Data Science tools which can be found in the documentation of these tools, such as scikit-learn, Python, NymPy, pandas, matplotlib, IPython, NLTK. And, of course, an additional widely-used source for solving mistakes, as well as looking for ideas, is Stack Overflow (Cross Validated will be also helpful).
Data
Another important part along with techniques and tools of Data Science actually is data. In particular, we mean datasets that could be used for learning Data Science. There are several standard datasets which are mostly clean, small enough to fit into memory and review in a spreadsheet, and wonderful for a demonstration of a new learning technique. Most of the sources listed above use datasets such as Wine Quality Dataset, Iris Flowers Dataset, Banknote Dataset and some others. The problem is that using these standard datasets, you cannot really gauge the efficiency of your technique when the complexity, size, and noisiness of data is scaled up. Furthermore, these datasets are extremely boring. So, as soon as you start your learning process, there will be a problem to test your knowledge on datasets from real life. So, where can you find such datasets?
There are plenty of data sources available online
Most governments grant free access to some of their data: European government datasets, US Gov Data, Indian Government Dataset and many others. It is possible to find datasets of some companies such as Amazon, Google, Microsoft, etc. One of the most wonderful places where there are plenty of real-life datasets is Kaggle. This platform has a big community where you can discuss data; within this community, you can find public code with algorithms that solve the prediction problem in specific datasets.
Many data scientists believe that the fastest way to learn Data Science is by working on competitions. Data competition is a great way to learn best practices and gather feedback on your work. Data competition platforms like Kaggle, DrivenData, CodaLab, CrowdANALYTICX, etc., are also sources of datasets and professional community.
It is also worth noting some useful online platforms where you can find articles about data science, code examples, and Data Science news: KDnuggets, Quora, Reddit, Data Science Central, and Dataconomy, are good options, and much more material is available on Twitter, Facebook, YouTube, etc.
Conclusion
In conclusion, there are different job types for data scientists according to their roles in a data science team: Data Analyst, Data Scientist, Data Engineer, and Machine Learning Engineer. Each of their learning paths are slightly different, but the starting point described above is common for all of the job types mentioned.
Interesting For You
What is Data Science?
In recent years, data science has become increasingly prominent in the common consciousness. Since 2010, its popularity as a field has exploded. Between 2010 and 2012, the number of data scientist job postings increased by 15 000%. In terms of education, there are now academic programs that train specialists in data science. You can even complete a PhD degree in this field of study. Dozens of conferences are held annually on the topics of data science, big data and AI. There are several contributing factors to the growing level of interest in this field, namely: 1. The need to analyze a growing volume of data collected by corporations and governments 2. Price reductions in computational hardware 3. Improvements in computational software 4. The emergence of new data science methods. With the increasing popularity of social networks, online services discovered the unlimited potential for monetization to be unlocked through (a) developing new products and (b) having greater information and data insights than their competitors. Big companies started to form teams of people responsible for analyzing collected data.
Read article
Deep Learning Platforms
Artificial neural networks (ANN) have become very popular among data scientists in recent years. Despite the fact that ANNs have existed since the 1940s, their current popularity is due to the emergence of algorithms with modern architecture, such as CNNs (Convolutional deep neural networks) and RNNs (Recurrent neural networks). CNNs and RNNs have shown their exceptional superiority over other Machine Learning algorithms in computer vision, speech recognition, acoustic modeling, language modeling, and natural language processing (NLP). Machine Learning algorithms based on ANNs are attributed to Deep Learning.
Read article
Reinforced Learning
Artificial Intelligence uses three basic methods for machine learning: supervised learning, unsupervised learning, and reinforcement learning. In general, these methods are called learning paradigms. The learning paradigm chosen is determined by the specific task at hand. We choose supervised learning for classification and regression tasks. Cluster identification or anomaly detection are typical tasks that can be solved within the unsupervised learning paradigm. The primary goal of reinforced learning is to create software agents that can automatically interact with an environment, learn from it, and determine the optimal behavior in order to optimize its performance. In this article, we will discuss reinforced learning paradigms in detail.
Read article