Master thesis Goal / Problem statement

The goal of the thesis is to allow any user to enter some dataset that is between 10,000 and 1 million lines. The tool should allow users to gain a better understanding of this dataset. To make this more concrete:

A user is asked questions about a complex dataset, which they cannot answer without interacting with the system. After making as few selections as possible, the user can now respond correctly to the same questions posed earlier.

Possible use cases for the tool could be:

Experts could benefit from large quantities of line data clustering in the following areas.

  1. Art preservation (RIS pigment identification) This is the use case A.Popa et al. [PGK*22] used. My version would be different and more useful by allowing experts to directly select clusters by interacting with the line data.
  2. Transportation (Traffic Pattern Analysis): Clustering traffic volume data collected from sensors on a specific road or train segment over a 24-hour period. Clusters can reveal typical daily patterns like morning/evening commute peaks, off-peak flow, and weekend variations. When tracking trains, many trains could be out of service, meaning no traffic occurs for some vehicles at certain time intervals.
  3. Medicine (ECG Analysis): Grouping heartbeat intervals (RR intervals) or shapes over a fixed time window (e.g., 10 seconds) from ECGs. This case is interesting because the clusters are highly concentrated around the y = 0 line with many non-regular peaks.
  4. Weather prediction (Temperature, Pressure): Finding and grouping trends for certain areas could be used with small multiples of different weather or climate data to generate meaningful predictions.

General user questions

These are some of the questions that any user should be able to answer after interacting with the system.

  1. How many clusters are present in the dataset
  2. How many lines are present in cluster A, B and C (These are predefined clusters that a user is expected to reselect/recreate)
  3. What is size of the biggest cluster
  4. How many lines are part of background noise; they have a chance of <15% belonging to any one cluster
  5. How many lines are/aren’t outliers in the clusters A,B and C

Expert questions

The selections made by experts are expected to outperform automatic selection techniques. This hypothesis is supported by the idea that automatic solutions lack awareness of the context in which the data is presented. Current general solutions cannot make selections that are meaningful for each specific use case. These are expert-specific questions where automatic systems would be inferior.

  1. (RIS pigment identification) - What are all lines that correspond to each of the five PIGMENT NAME?
  2. (Transportation) -What are the ten most used trains on average?
  3. (Medicine) - What are all heartbeats that are irregular?
  4. (Weather prediction) - What are the lines that correspond to the 10 warmest years in some area?

Bibliography

[PGK*22] - Visual Analysis of RIS Data for Endmember Selection - 2022 Andra Popa, Francesca Gabrieli, Thomas Kroes, Anna Krekeler, Matthias Alfeld, Boudewijn Lelieveldt, Elmar Eisemann, Thomas Höllt
doi.org/10.2312/gch.20221233