Jupyter Scatter: Interactive Exploration of Large-Scale Datasets (2024)

\sidecaptionvpos

figurec

Fritz LekschasOzette Technologies, Seattle, WA, USATrevor ManzHarvard Medical School, Boston, MA, US

Abstract

Jupyter Scatter is a scalable, interactive, and interlinked scatterplot widgetfor exploring datasets in Jupyter Notebook/Lab, Colab, and VS Code. Its goalis to simplify the visual exploration, analysis, and comparison of large-scalebivariate datasets. Jupyter Scatter can render up to twenty million points,supports fast point selections, integrates with Pandas DataFrame andMatplotlib, uses perceptually-effective default settings, and offers auser-friendly API.

Keywords Python, Jupyter widget, scatterplot, 2D scatter, interactive data visualization, embedding plot, WebGL

1 Summary

Jupyter Scatter is a scalable, interactive, and interlinked scatterplot widget for exploring datasets in Jupyter Notebook/Lab, Colab, and VS Code (Figure1). Thanks to its WebGL-based rendering engine [1], Jupyter Scatter can render and animate up to several million data points. The widget focuses on data-driven visual encodings and offers perceptually-effective point color and opacity settings by default. For interactive exploration, Jupyter Scatter features two-way zoom and point selections. Furthermore, the widget can compose multiple scatterplots and synchronize their views and selections, which is useful for comparing datasets. Finally, Jupyter Scatter’s API integrates with Pandas DataFrames [2] and Matplotlib [3] and offers functional methods that group properties by type to ease accessibility and readability. Extensive documentation and how tos can be found at https://jupyter-scatter.dev and the code is available at https://github.com/flekschas/jupyter-scatter.

Jupyter Scatter: Interactive Exploration of Large-Scale Datasets (1)

2 Usage Scenario

Jupyter Scatter simplifies the visual exploration, analysis, and comparison of large-scale bivariate datasets. It renders up to twenty million points smoothly, supports fast point selections, integrates with Pandas DataFrame [2], uses perceptually-effective default encodings, and offers a user-friendly API.

In the following, we demonstrate its usage for visualizing the GeoNames dataset [10], which contains data about 120k cities world wide. For instance, to visualize cities by their longitude/latitude and color-code them by continent (Figure2 Left), we create a Scatter widget as follows.

import jscatterimport pandas as pdgeonames = pd.read_parquet( ’https://paper.jupyter -scatter.dev/’ + ’geonames.pq’)scatter = jscatter.Scatter( data=geonames, x=’Longitude’, y=’Latitude’, color_by=’Continent’,)scatter.show()

Without specifying a color map, Jupyter Scatter uses the categorical colorblind-safe palette from [11] for the Continent column, which has seven unique values. For columns with continuous data, it automatically selects Matplotlib’s [3] Viridis color palette. As shown in Figure1 and Figure2 Left, Jupyter Scatter dynamically adjusts the point opacity based on the point density within the field of view. This means points become more opaque when zooming into sparse areas and more transparent when zooming out into an area that contains many points. The dynamic opacity addresses over-plotting issues when zoomed out and visibility issues when zoomed in.

Jupyter Scatter offers many ways to customize the point color, size, and opacity encodings. To simplify configuration, it provides topic-specific methods for setting up the scatterplot, rather than requiring all properties to be set during the instantiation of Scatter. For instance, as shown in Figure2 Right, the point opacity (0.5), size (asinh-normalized), and color (log-normalized population using Matplotlib’s [3] Magma color palette in reverse order) can be set using the following methods.

from matplotlib.colors import AsinhNorm, LogNormscatter.opacity(0.5)scatter.size( by=’Population’, map=(1, 8, 10), norm=AsinhNorm())scatter.color( by=’Population’, map=’magma’, norm=LogNorm(), order=’reverse’)

To aid interpretation of individual points and point clusters, Jupyter Scatter includes legends, axis labels, and tooltips. These features are activated and customized via their respective methods.

scatter.legend(True)scatter.axes(True, labels=True)scatter.tooltip( True, properties=[’color’, ’Latitude’, ’Country’], preview=’Name’)

The tooltip can show a point’s data distribution in context to the whole dataset and include a text, image or audio-based media preview. For instance, the example (Figure2 Right) shows the distribution of the visually encoded color property as well as the Latitude and Country columns. For numerical properties, the distribution is visualized as a bar chart and for categorical properties the distribution is visualized as a treemap. As the media preview we’re showing the city name.

Jupyter Scatter: Interactive Exploration of Large-Scale Datasets (2)

Exploring a scatterplot often involves studying subsets of the points. To select points, one can either long press and lasso-select points interactively in the plot (Figure3 Bottom Left) or query-select points (Figure2 Right) as shown below. In this example, we select all cities with a population greater than ten million.

scatter.selection( geonames.query(’Population > 10000000’).index)

The selected cities can be retrieved by calling scatter.selection() without any arguments. It returns the data record indices, which can then be used to get back the underlying data records.

cities.iloc[scatter.selection()]

To automatically register changes to the point selection one can observe the scatter.widget.selection traitlet. The observability of the selection traitlet (and many other properties of scatter.widget) makes it easy to integrate Jupyter Scatter with other Jupyter Widgets.

For instance, Figure3 (Left) shows a UMAP [12] embedding of the Fasion MNIST dataset [13] where points represent images and the point selection is linked to an image widget that loads the selected images.

import ipywidgetsimport jscatterfashion_mnist = pd.read_parquet( ’https://paper.jupyter -scatter.dev/’ + ’fashion -mnist -embeddings.pq’)# Custom image widgetimages = ImagesWidget()scatter = jscatter.Scatter( data=fashion_mnist, x=’umapX’, y=’umapY’, color_by=’class’, background_color=’black’, axes=False,)ipywidgets.link( (scatter.widget, ’selection’), (images, ’images’))ipywidgets.AppLayout( center=scatter.show(), right_sidebar=images)

Comparing two or more related scatterplots can be useful in various scenarios. For example, with high-dimensional data, it might be necessary to compare different properties of the same data points. Another scenario involves embedding the high-dimensional dataset and comparing different embedding methods. For large-scale datasets, it might be useful to compare different subsets of the same dataset or entirely different datasets. Jupyter Scatter supports these comparisons with synchronized hover, view, and point selections via its compose method.

Jupyter Scatter: Interactive Exploration of Large-Scale Datasets (3)

For instance, there are many ways to embed points into two dimensions, including linear and non-linear methods, and comparing point clusters between different embedding methods can be insightful. In the following, we compose a two-by-two grid of four embeddings of the Fashion MNIST dataset [13] created with PCA [14], UMAP [12], t-SNE [15], and a convolutional autoencoder [16]. As illustrated in Figure3 (Right), the point selection of the four scatterplots is synchronized.

from jscatter import Scatter, composeconfig = dict( data=fashion_mnist, color_by=’class’, legend=True, axes=False, zoom_on_selection=True,)pca = Scatter(x=’pcaX’, y=’pcaY’, **config)tsne = Scatter(x=’tsneX’, y=’tsneY’, **config)umap = Scatter(x=’umapX’, y=’umapY’, **config)cae = Scatter(x=’caeX’, y=’caeY’, **config)compose( [ (pca, "PCA"), (tsne, "t -SNE"), (umap, "UMAP"), (cae, "CAE") ], sync_selection=True, sync_hover=True, rows=2,)

Note, by setting zoom_on_selection to True and synchronizing selections, selecting points in one scatter will automatically select and zoom in on those points in all scatters.

3 Statement of Need

Jupyter Scatter is primarily a tool for data scientists to visually explore and compare bivariate datasets. Its ability for two-way point selections and synchronized plots, enable interactive exploration and comparison in ways that is not possible with existing widgets (e.g., multiple linked scatterplots) or requires considerable effort to set up (e.g., two-way communication of point selections).

Further, due to its usage of traitlets [17], Jupyter Scatter integrates easily with other widgets, which enables visualization researchers and practitioners to build domain-specific applications on top of Jupyter Scatter. For instance, the Comparative Embedding Visualization widget [18] uses Jupyter Scatter to display four synchronized scatterplots for guided comparison of embedding visualizations. Andrés Colubri’s research group is actively working on a new version of their Single Cell Interactive Viewer which will be based on Jupyter Scatter.

4 Implementation

Jupyter Scatter has two main components: a Python program running in theJupyter kernel and a front-end program for interactive visualization. ThePython program includes a widget and an API layer. The widget defines the viewmodel for drawing scatterplots, while the API layer simplifies defining theview model state, integrating with Pandas DataFrames [2] andMatplotlib [3]. The front-end program is built on top ofregl-scatterplot [1], a high-performance rendering librarybased on WebGL, ensuring efficient GPU-accelerated rendering.

All components are integrated using anywidget [19] to create across-platform Jupyter widget compatible with various environments, includingJupyter, JupyterLab, Google Colab, VS Code, and dashboarding frameworks likeShiny for Python, Solara, and Panel. The Python program uses anywidget andipywidgets [20] to commuincate with the front-end, using binary datasupport to efficiently send in-memory data to the GPU, avoiding the overhead ofJSON serialization. This approach enables the transfer of millions of datapoints from the Python kernel to the front-end with minimal latency.Bidirectional communication ensures the visualization state is shared betweenthe front-end and kernel, allowing updates to scatterplot properties and accessto states like selections. Coordination is managed using anywidget APIs,enabling connections to other ipywidgets like sliders, dropdowns, and buttonsfor custom interactive data exploration widgets.

5 Related Work

There are many Python packages for rendering scatterplots in notebook-like environments. General-purpose visualization libraries like Matplotlib [3], Bokeh [21], or Altair [22] offer great customizability but do not scale to millions of points. They also don’t offer bespoke features for exploring scatterplots and require manual configuration.

More bespoke dataset-centric plotting libraries like Seaborn [23] or pyobsplot [24] require less configuration and make it easier to create visually-pleasing scatterplots but they still fall short in terms of scalability.

Plotly combines great customizability with interactivity and can render scatterplots of up to a million points. However, drawing many more points is challenging and the library also focuses more on generality than dedicated features for scatterplot exploration and comparison. Plotly’s WebGL rendering mode is also bound to the number of WebGL contexts your browser supports (typically between 8 to 16) meaning that it can’t reader more 8 to 16 plots when using the WebGL render mode. Jupyter Scatter does not have this limitation as it uses a single WebGL renderer for all instantiated widgets, which is sufficient as static figures don’t need constant re-rendering and one will ever only interact with a single or few plots at a time. Being able to render more than 8 to 16 plots can be essential in notebook environments as these are often used for exploratory data analysis.

Datashader [25] specializes on static rendering of large-scale datasets and offers unparalleled scalability that greatly exceeds that of Jupyter Scatter. One can also fine-tune how data is aggregated and rasterized. However, this comes at the cost of limited interactivity. While it’s possible to interactively zoom into a rasterized image produced by Datashader, the image is just drawn at scale instead of being re-rendered at different field of views. Re-rendering can be important though to better identify patters in subsets of large scatterplots through optimized point size and opacity.

Finally, except for Plotly, none of the tools support interactive point selections, a key feature of Jupyter Scatter. Also, no other library offers direct support for synchronized exploration of multiple scatterplots for comparison.

Acknowledgements

We acknowledge and appreciate contributions from Pablo Garcia-Nieto,Sehi L’Yi, Kurt McKee, and Dan Rosén. We also thank Nezar Abdennur for hisfeedback on the initial API design.

References

Lekschas [2023]Fritz Lekschas.Regl-Scatterplot: A Scalable Interactive JavaScript-based Scatter Plot Library.Journal of Open Source Software, 8(84):5275, 4 2023.doi: 10.21105/joss.05275.URL https://joss.theoj.org/papers/10.21105/joss.05275.
Wes McKinney [2010]Wes McKinney.Data Structures for Statistical Computing in Python.In Proceedings of the 9th Python in Science Conference, pages 56–61, 2010.doi: 10.25080/Majora-92bf1922-00a.URL https://doi.org/10.25080/Majora-92bf1922-00a.
Hunter [2007]J.D. Hunter.Matplotlib: A 2D graphics environment.Computing in Science & Engineering, 9(3):90–95, 2007.doi: 10.1109/MCSE.2007.55.URL https://doi.org/10.1109/MCSE.2007.55.
Dekker etal. [2023]Job Dekker, Frank Alber, Sarah Aufmkolk, BrianJ Beliveau, BenoitG Bruneau, AndrewS Belmont, Lacramioara Bintu, Alistair Boettiger, Riccardo Calandrelli, ChristineM Disteche, etal.Spatial and temporal organization of the genome: Current state and future aims of the 4d nucleome project.Molecular cell, 2023.doi: 10.1016/j.molcel.2023.06.018.URL https://doi.org/10.1016/j.molcel.2023.06.018.
Kerpedjiev etal. [2018]Peter Kerpedjiev, Nezar Abdennur, Fritz Lekschas, Chuck McCallum, Kasper Dinkla, Hendrik Strobelt, JacobM Luber, ScottB Ouellette, Alaleh Azhir, Nikhil Kumar, etal.Higlass: web-based visual exploration and analysis of genome interaction maps.Genome biology, 19:1–12, 2018.doi: 10.1186/s13059-018-1486-1.URL https://doi.org/10.1186/s13059-018-1486-1.
Mair etal. [2022]Florian Mair, JamiR Erickson, Marie Frutoso, AndrewJ Konecny, Evan Greene, Valentin Voillet, NicholasJ Maurice, Anthony Rongvaux, Douglas Dixon, Brittany Barber, etal.Extricating human tumour immune alterations from tissue inflammation.Nature, 605(7911):728–735, 2022.doi: 10.1038/s41586-022-04718-w.URL https://doi.org/10.1038/s41586-022-04718-w.
Greene etal. [2022]Evan Greene, Greg Finak, Fritz Lekschas, Malisa Smith, LeonardA D’Amico, Nina Bhardwaj, CandiceD Church, Chihiro Morishima, Nirasha Ramchurren, JanisM Taube, PaulT Nghiem, MartinA Cheever, StevenP Fling, and Raphael Gottardo.Data Transformations for Effective Visualization of Single-Cell Embeddings, 2022.URL https://github.com/flekschas-ozette/ismb-biovis-2022.
Spracklin etal. [2023]George Spracklin, Nezar Abdennur, Maxim Imakaev, Neil Chowdhury, Sriharsa Pradhan, LeonidA Mirny, and Job Dekker.Diverse silent chromatin states modulate genome compartmentalization and loop extrusion barriers.Nature structural & molecular biology, 30(1):38–51, 2023.doi: 10.1038/s41594-022-00892-7.URL https://doi.org/10.1038/s41594-022-00892-7.
Misra [2022]Rishabh Misra.News category dataset.2022.doi: 10.48550/arXiv.2209.11429.URL https://arxiv.org/abs/2209.11429.
GeoNames [2024]GeoNames.GeoNames, 2024.URL https://www.geonames.org.
Okabe and Ito [2002]Masataka Okabe and Kei Ito.How to make figures and presentations that are friendly to color blind people, 2002.URL https://jfly.uni-koeln.de/color/.
McInnes etal. [2018]Leland McInnes, John Healy, and James Melville.Umap: Uniform manifold approximation and projection for dimension reduction.2018.doi: 10.48550/ARXIV.1802.03426.URL https://arxiv.org/abs/1802.03426.
Xiao etal. [2017]Han Xiao, Kashif Rasul, and Roland Vollgraf.Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms.2017.doi: 10.48550/arXiv.1708.07747.URL https://arxiv.org/abs/1708.07747.
Pearson [1901]Karl Pearson.On lines and planes of closest fit to systems of points is space.The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.doi: 10.1080/14786440109462720.URL https://doi.org/10.1080/14786440109462720.
vander Maaten and Hinton [2008]Laurens vander Maaten and Geoffrey Hinton.Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008.URL http://jmlr.org/papers/v9/vandermaaten08a.html.
Kingma and Welling [2013]DiederikP Kingma and Max Welling.Auto-encoding variational Bayes.2013.doi: 10.48550/ARXIV.1312.6114.URL https://arxiv.org/abs/1312.6114.
IPython development team [2024]IPython development team.Traitlets: A lightweight Traits like module, 2024.URL https://github.com/ipython/traitlets.
Manz etal. [2024]Trevor Manz, Fritz Lekschas, Evan Greene, Greg Finak, and Nils Gehlenborg.A general framework for comparing embedding visualizations across class-label hierarchies.4 2024.doi: 10.31219/osf.io/puxnf.URL https://osf.io/puxnf.
Manz [2024]Trevor Manz.Anywidget: Jupyter widgets made easy, 2024.URL https://github.com/manzt/anywidget.
Jupyter widgets community [2015]Jupyter widgets community.ipywidgets: Interactive widgets for the jupyter notebook, 2015.URL https://github.com/jupyter-widgets/ipywidgets.
Bokeh development team [2018]Bokeh development team.Bokeh: Python library for interactive visualization, 2018.URL https://bokeh.pydata.org/en/latest/.
VanderPlas etal. [2018]Jacob VanderPlas, Brian Granger, Jeffrey Heer, Dominik Moritz, Kanit Wongsuphasawat, Arvind Satyanarayan, Eitan Lees, Ilia Timofeev, Ben Welsh, and Scott Sievert.Altair: Interactive statistical visualizations for python.Journal of Open Source Software, 3(32):1057, 2018.doi: 10.21105/joss.01057.URL https://doi.org/10.21105/joss.01057.
Waskom [2021]MichaelL. Waskom.seaborn: statistical data visualization.Journal of Open Source Software, 6(60):3021, 2021.doi: 10.21105/joss.03021.URL https://doi.org/10.21105/joss.03021.
Barnier [2024]Julien Barnier.Observable Plot in Jupyter notebooks and Quarto documents, 2024.URL https://github.com/juba/pyobsplot.
Anaconda developers and community contributors [2024]Anaconda developers and community contributors.Datashader: Accurately render even the largest data, 2024.URL https://github.com/holoviz/datashader.