Image by Author
Â
Tableau? PowerBI? QlikView? Looker Studio? Excel? These are all very nice tools. (Yes, even Excel. Put your snobbery aside.)
However, they’re not essential for data science workflow. Data scientists might use them when and if they need to make reports or in conjunction with other tools. This article is not about them.
It is about data visualization tools that have become an integral element of a data science workflow. The ones that will help you get through every data science project unscratched.
Â
Â
1. matplotlib (Python)
Â
matplotlib is one of the most commonly used Python data visualization libraries. It’s been a benchmark that all other, newer Python libraries are trying to surpass. It is a highly customizable library that allows you to modify every detail on your static, interactive, or animated plots, from colors and fonts to plot layout, labels, and whatnot.
In addition, matplotlib is a foundation for many other Python plotting libraries. For example, seaborn is built on matplotlib. With seaborn you can create fancier visualizations than in matplotlib, also with significantly less coding. It’s also great for statistical plots, such as boxplots, heatmaps, and pair plots, and for working with DataFrames, as it easily integrates with pandas.
Use in Data Science: Exploratory Data Analysis (EDA) and Model Assessment
matplotlib is used early in the data science workflow during EDA to understand data distributions and relationships. It’s also common in visualizing the results of models during the assessment stage.
Use Scenarios:
- Trends over time with line graphs
- Data distributions with histograms
- Multiple series on one graph for comparison
Pros:
- Versatile and customizable
- Great for complex visualizations
Cons:
- Verbose syntax
- Steep learning curve
Â
2. Plotly (Python)
Â
Plotly does static visualizations, too, but it is especially suitable for interactive visualizations, where you can zoom, hover, and animate data. This makes Plotly a go-to choice for making dashboards and web-based visualizations. In integration with Dash, Plotly is very popular for web applications.
Use in Data Science: Data Presentation and Interactive Dashboards
Plotly is mainly used at the end of the workflow when you want to make final presentations to stakeholders and allow them to explore data.
Use Scenarios:
- Creating interactive dashboards that allow users to filter and explore data
- Displaying large datasets where zooming into details is required
- Representing geographic data on interactive maps
Pros:
- Creating interactive visualizations requires minimal setup
- Easily integrates with web apps
Cons:
- A steeper learning curve for advanced customization
Â
3. Streamlit (Python)
Â
Streamlit is a Python framework for creating interactive data apps with minimal coding.It’s integrated with many Python libraries, such as pandas, matplotlib, Plotly. So, all you need to do is write a Python script, and Streamlit will take care of rest, from the back end to the UI. It easily handles dynamic content, allowing you to combine user input, data visualization, and machine learning, all in one application or dashboard.
Use in Data Science: Interactive Data Applications and Dashboards
Streamlit can be used in EDA, data cleaning, modeling, and experimentation, but it really shines at the end of the workflow, when you need to create interactive dashboards and data apps to present insights.
Use Scenarios:
- Interactive dashboards
- ML model apps to see model predictions
- Customizable web apps to showcase data analysis results
Pros:
- Quick set up
- Minimal code required
- Not requiring front-end development skills
Cons:
- Limited front-end design customization
- Not well suited for more complex web applications
Â
4. D3.js (JavaScript)
Â
D3.js, or Data-Driven Documents, is a very flexible JavaScript library. It covers everything from simple bar charts to complex and interactive visualizations by allowing you to bind data to a Document Object Model (DOM). With this library, you can have complete control over customizing web-based visualizations.
Use in Data Science: Data Presentation and Web Applications
This library is primarily used in the final stages of a data science project when you want to build custom web-based applications or interactive visualizations.
Use Scenarios:
- Creating real-time data visualizations in web applications
- Making interactive infographics and custom visual data reports
- Animating transitions in visualizations to explain data trends better
Pros:
- Ultimate flexibility
- Perfect for web-based interactive visualizations
Cons:
Â
5. ggplot2 (R)
Â
ggplot2 is a visualization package for R programming language based on the ‘grammar of graphics’ approach to creating graphs. This makes creating visualizations very intuitive and allows for high customizability, where you can define size, shape, color, bars, lines, points, etc.
Use in Data Science: EDA and Model Assessment
ggplot2 is commonly used to visualize data trends and distributions and create plots for reports and publications.
Use Scenarios:
- Making statistical plots
- Visualizing model performance
- Faceting a plot to compare trends across multiple data subsets
- Visualizing categorical trends
Pros:
- Ease of use due to a declarative approach
- Publication-quality visuals
Cons:
Â
Conclusion
Â
Which and how many tools you choose depends on your professional needs. In most cases, these five will have you covered in every stage of a data science workflow that requires visualizing data. You can create anything with them, from simple static plots to complex, interactive, animated, or web-based visualizations and dashboards.
This gives you many options for nice-looking visualizations that help you gain data insights during EDA and model assessment.
Â
Â
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.