Interactive visual manipulation of large-scale line data

Short Abstract

As datasets grow in size, the demand for efficient rendering and manipulation of this data intensifies. This research paper explores how GPU shaders can significantly enhance the processing of large-scale line data. In addition, various (semi-automatic) interaction tools are explored to optimally cluster line data and effectively select certain lines to be in or out of focus.

Visual abstract

Visual abstract

Title and Research Question

Interactive visual manipulation of large-scale line data

The term “large scale” refers to thousands or millions of Paralell Coordinate Plot (PCP) lines; this can be, for example, millions of Reflectance Imaging Spectroscopy (RIS) data lines [PGK*22] or large-scale weather data [DPD*17] [ZSE*15]. I would like to focus on fuzzy-clustering [SSLC02] line groups by guiding expert users to pull certain lines in or out of context. My research questions are as follows:

  • (Main question) How can, an expert using the system developed in this paper, effectively cluster large-scale line data with minimal effort?
    • How to optimally (with as few clicks) select certain lines to be in or out of focus?
    • How to make the rendering of large-scale line data more efficient?
    • How to help a user to cluster lines in a semi-automatic way? And, when a automatic selection is made, how to make sure it is what the user actually expected to select?

Background

To gain a better understanding of how this research differs from previous work, it is important to understand the current state of the art. This section explains in a short review what existing research has been done in the following three key areas: large-scale high-dimensional data selection, data rendering techniques, and line data clustering methodologies.

Many authors have attempted to solve the problem of clustering (large-scale) point data

[
, BD15
, AH23
]
. This is done by using a variety of techniques, such as the use of a Mahalanobis brush [FH21] or a fuzzy classifier
[
, Y93
, SSLC02
]
or K-means++ clustering with a lightweight neural network [WWS*23]. This paper will borrow some of these techniques and apply them to 2D line data.

Fewer research has been done on clustering large-scale line data. One of the few papers that has been written about this is the work by T.Trautner et al. [TB21]. Their paper assumed that the data was pre-clustered and only focused on the rendering of this data. However, the methods in the paper do prioritise well-segmented clusters and renders lines in a way that makes them stand out. This paper will focus on effective clustering of the line data and will use the rendering technique as a way to show the user the clusters.

When lines are not yet clustered, it can be difficult to differentiate lines. A line plot with many groups (>6) is commonly called a “spaghetti plot” [DHH*02]. These plots are often used in meteorology to visualize weather patterns [SZD*10]. A common downside is: “many lines visualized in one plot can be hard to read for an expert”. A trivial way to reduce this clutter, is to display lines with certain channels from T. Munzner [M14]. like color, size, and transparency. These three caracteristics are forinstance used by A. Kamal et al. [KDJ*21]. They assume that the lines originate from a certain mean-line that has multiple branches of noise, the dataset is encoded using color (gradients) to visualize line uncertainties.

Lines can be aggregated to simplify the plot. One way is to measure each line as being typical or outlying. P. Micheaux et al. [DMV20] proposes a new statistical method called curve depth that helps analyze data represented as curves or trajectories and labels each line as typical or outlying. Aggregating with (Iso)contours is another common visualization technique to reduce clutter in spaghetti plots [PWB*09]. F. Ferstl [FKRW16] extends this idea by incorporating conditional probability visualizations to reveal joint occurrences of contours across different regions, offering a more informative and less cluttered alternative to traditional spaghetti plots.

Another statistical method is the use of Curve Density Estimates [LH11]. This method is able to give a rough sense of the how the data is distributed in certain areas. When a million lines are drawn in a plot on an average screen of 1920x1080, the expected number of overlapping lines at each pixel is ~1000:1. With density methods users are able to base their clustering on the density of lines. In addition, this extra context could give semi-automatic clustering a better starting point by selecting some equally dense areas across the x-axis.

To automate the work of clustering, J.Wei et al. [WYCM11] introduces a GPU-optimized Expectation Maximization algorithm for clustering large-scale point data. Although this method is able to establish a great seed for reaching a local optimum, expert-led clustering is expected to achieve a better clustering [BDTSW22].

An expert lead tool has been developed by A. Popa [PGK*22]. This tool allows the user to map the line data with the help of t-SNE to a 2D point cloud. A multi-view approach is used to show the user the point and line data at the same time. By clustering the point data, the user can view what lines are in the same cluster. This paper will focus on extending their tool to a higher dimension by directly selecting lines.

Evaluation metric

These evaluation metrics will be based on a user study with a few experts in various fields to find out if the applicaiton can be used to cluster lines in most general cases. The main evaluation metrics will be:

  1. A user is able to select and group out of a large co-hort of lines with low FP/FN
  2. When a semi-automatic brush is implemented, the stearing of the user should be as minimal as posible. This means the automatic part should score at least some percentage in some artificial/real datasets when selecting/grouping lines.
  3. As an additional benefit of the project, it should be straightforward to create your own data plot using your own dataset.

Weekly planning

As my thesis will be user-driven I will spend at least a few hours every week on some user feedback. I have planned the weekly schedule fairly conservatively; some weeks could eventually contain +50% of the work as listed below. However, due to my user-driven approach, I must remain flexible and adjust my planning based on the feedback I receive. The current planning is as broadly as follows:

  • Week 1-3: How to select lines using a basic brush
  • Week 4-6: Research and implement an effective way to cluster lines
  • Week 7-9: Implement a semi-automatic brush to give the user some guidance

Basic user feedback can occur after week 6, while the core of the application can be tested after week 9. The time beyond week 9 will be used to enhance the application further and explore how its core can be extended toward a more standalone technical paper.

  • Q1 - Have the core of the application ready for meaningfull user feedback
  • Q2 - Explore possible extensions (as listed below)
  • Q3 - In addition to implementing others’ work, I will delve deeper into the technical details of rendering the lines. This will elevate the paper beyond simply being an application that renders a pre-selected dataset. A potential end goal could be to also render graphs, area plots, or other examples from the D3 library.

Possible extensions

Based on the results of the first nine weeks, I will decide on the next steps. These could be:

  • Going in depth on the rendering of millions of lines in a WebGPU environment
  • Implementing a more advanced brush
  • Use edge bundeling [LHT17] to simplify the number of lines being drawn
  • Implement multiple (3+) levels of focus/context. For giving the user a better way to focus on what lines are more or less likely to be added to the current selection. Forinstance:
    1. Selected lines (Highlight colour + Shadow)
    2. Selected Cluster groups (Faded Colour)
    3. Unselected Cluster groups (Display as Mean+Std similar to A. Popa [PGK*22])
    4. Noise lines (blurred in the background)
    5. All other lines (Density plot)

During the Seminar

  • (Research) the work by T. Trautner (Line Weaver, Honeycomb Plots, Sunspot Plots)
  • (Implement) a WebGL version of [TB21] in 3D

Weeks ≤0 - Feb 1 - Mar 14

  • (Implement) Setup a prototype of rendering 1k/1M lines in WebGPU.
  • (Research) Decide on a Title and Research Question
  • (Research) Read first chapter (basics) of (webgpufundamentals)
  • (Write) Write a formal research proposal

Week 1 - Mar 17 - 21 (Research week)

Week 2 - Mar 24 - 28 (Going in depth)

  • (Implement) Render the selected lines on top using a depth buffer
  • (Implement) Make both ROW and COLUMN sizes based on data instead of hardcoded
  • (Research) How to best select lines (Mahalanobis brush by C. Fan et al.) or find others

Week 3 - Mar 31 - Apr 4 (Selection week)

  • (Implement) Enable the ability to select multiple lines, allowing users to create, add, remove, or delete the chosen lines.
  • (Research) Read about Structured Brushing [RSM*15] papers
  • (Write) Setup overleaf file

Week 4 - Apr 7 - 11 (UI/UX week)

  • (Implement) Add the ability to create groups in the data.
  • (Implement) Hide already created groups
  • (Implement) Have the ability to show your False- Positive/Negative scores (or other evaluation metrics)

Week 5 - Apr 14 - 18 (Focus/Context week)

  • (Implement) Make lines more pop-out similar to those described in the paper about Line Weaver or linking&brushing
  • (Implement) Merging/Splitting of clusters

Week 6 - Apr 21 - 25

  • (Research) Read on how a simple ML brush can better segment/cluster lines
  • (Research) How to automatically naively cluster lines
  • (Write) Evaluation metric / User study guidelines

Week 7 - Apr 28 - May 2

  • (Implement) Simple ML based brush
  • (implement) Automatic naive clustering
  • (Research) Read on how a fuzzy classifier can be used to have a line be categorised in multiple clusters

Week 8 - May 6 - 9

  • (Implement) Some fuzzy-clustering brushes
  • (Research) Read on how lines can be drawn more efficiently in a buffer
  • (Write) Background section

Week 9 - May 12 - 16

  • (Implement) All about performance
  • (Write) Introduction

Week 10 - May 19 - 23

  • (Implement) Tidy up the codebase to make it more extendable.
  • (implement) Sub-sampling
  • (Write) Format/Review the first version of the research paper

Week 11… - May 26

  • (implement) Allow the user to upload their own data
  • (implement) Extend the Manivault application to embed the methods described above
  • (Write) Abstract, Method, Discussion & Conclusion

Bibliography

[AH23] - Efficient Density-peaks Clustering Algorithms on Static and Dynamic Data in Euclidean Space - 2023 Daichi Amagata, Takahiro Hara
Source
[BD15] - Computational feasibility of clustering under clusterability assumptions - 2015 Ben-David, Shai
doi.org/https://doi.org/10.48550/arXiv.1501.00437
[BDTSW22] - Human-supervised clustering of multidimensional data using crowdsourcing - 2022 Butyaev, A., Drogaris, C., Tremblay-Savard, O., Waldispühl, J.
doi.org/10.1098/rsos.211189
[CCL*22] - Parallel gravitational clustering based on grid partitioning for large-scale data - 2022 Lei Chen, Fadong Chen, Zhaohua Liu, Mingyang Lv, Tingqin He, Shiwen Zhang
Source
[DHH*02] - Analysis of longitudinal data - 2002 Diggle, Peter, Heagerty, Patrick, Heagerty, Patrick J, Liang, Kung-Yee, Zeger, Scott
[DMV20] - Depth for Curve Data and Applications - April/ 2020 de Micheaux, Pierre Lafaye, Mozharovskyi, Pavlo, Vimond, Myriam
doi.org/10.1080/01621459.2020.1745815
[DPD*17] - Albero: A Visual Analytics Approach for Probabilistic Weather Forecasting - 10/ 2017 Diehl, A., Pelorosso, L., Delrieux, C., Matković, K., Ruiz, J., Gröller, M.E., Bruckner, S.
Source
[FH21] - On sketch-based selections from scatterplots using KDE, compared to Mahalanobis and CNN brushing - 2021 Fan, Chaoran, Hauser, Helwig
[FKRW16] - Visual analysis of spatial variability and global correlations in ensembles of iso‐contours - 2016 Ferstl, F., Kanzler, M., Rautenhaus, M., Westermann, R.
doi.org/10.1111/cgf.12898
[KDJ*21] - Recent advances and challenges in uncertainty visualization: a survey - 2021 Kamal, Aasim, Dhakal, Parashar, Javaid, Ahmad Y., Devabhaktuni, Vijay K., Kaur, Devinder, Zaientz, Jack, Marinier, Robert
Source
[LH11] - Curve Density Estimates - 2011 Lampe, O. Daae, Hauser, H.
Source
[LHT17] - State of the Art in Edge and Trail Bundling Techniques - 2017 Lhuillier, A., Hurter, C., Telea, A.
Source
[M14] - Visualization analysis and design - 2014 Munzner, Tamara
Source
[PGK*22] - Visual Analysis of RIS Data for Endmember Selection - 2022 Andra Popa, Francesca Gabrieli, Thomas Kroes, Anna Krekeler, Matthias Alfeld, Boudewijn Lelieveldt, Elmar Eisemann, Thomas Höllt
doi.org/10.2312/gch.20221233
[PWB*09] - Ensemble-Vis: A Framework for the Statistical Visualization of Ensemble Data - 2009 Potter, Kristin, Wilson, Andrew, Bremer, Peer-Timo, Williams, Dean, Doutriaux, Charles, Pascucci, Valerio, Johnson, Chris R.
doi.org/10.1109/ICDMW.2009.55
[RSM*15] - Towards Quantitative Visual Analytics with Structured Brushing and Linked Statistics - 01/ 2015 Radoš, S., Splechtna, R., Matković, K., Đuras, M., Gröller, E., Hauser, H.
doi.org/https://doi.org/10.1111/cgf.12901
[SSLC02] - Automated fuzzy clustering of neuronal pathways in diffusion tensor tracking - 2002 Shimony, Joshua S, Snyder, Avi Z, Lori, Nicholas, Conturo, TE
[SZD*10] - Noodles: A Tool for Visualization of Numerical Weather Model Ensemble Uncertainty - 2010 Sanyal, Jibonananda, Zhang, Song, Dyer, Jamie, Mercer, Andrew, Amburn, Philip, Moorhead, Robert
doi.org/10.1109/TVCG.2010.181
[TB21] - Line Weaver: Importance-Driven Order Enhanced Rendering of Dense Line Charts - 2021 Trautner, Thomas, Bruckner, Stefan
[WWS*23] - Clustering-TinyPointNet for Fast Large-scale Point Cloud Semantic Segmentation - 2023 Wang, Zhenglin, Walsh, Kerry, Sabrina, Fariza, Piyathilaka, Lasitha, Lin, Yufeng
doi.org/10.1109/DICTA60407.2023.00039
[WYCM11] - Parallel clustering for visualizing large scientific line data - 2011 Wei, Jishang, Yu, Hongfeng, Chen, Jacqueline H, Ma, Kwan-Liu
[Y93] - A survey of fuzzy clustering - 1993 M.-S. Yang
Source
[ZLZP21] - A New Membership Scaling Fuzzy C-Means Clustering Algorithm - 2021 Zhou, Shuisheng, Li, Dong, Zhang, Zhuan, Ping, Rui
doi.org/10.1109/TFUZZ.2020.3003441
[ZSE*15] - Demand for multi-scale weather data for regional crop modeling - 2015 Gang Zhao, Stefan Siebert, Andreas Enders, Ehsan Eyshi Rezaei, Changqing Yan, Frank Ewert
Source