Imputation helps to deal with missing values in lipidomics datasets.
About the author
Olga (Olya) Vvedenskaya
Sci. Communications Officer
Dr. Dr. Olya Vvedenskaya studied medicine, and further obtained her PhD in the field of molecular oncology. She loves to deliver scientific messages in a clear and accessible manner.
• Mass spec data points can be missing due to various reasons
• Missing data can lead to biased interpretations and hinder statistical analyses
• Imputation methods fill in the gaps in lipidomics datasets
Olga (Olya) Vvedenskaya
Sci. Communications Officer
Dr. Dr. Olya Vvedenskaya studied medicine, and further obtained her PhD in the field of molecular oncology. She loves to deliver scientific messages in a clear and accessible manner.
Mass spectrometry (MS) is a powerful analytical technique widely used in life sciences. Analyzing lipids with MS allows scientists to get insights into a large variety of biological and medical questions. However, the reliability of insights drawn from mass spectrometry data can be compromised by missing values, a common challenge in real-world experiments. The data points can be missing because of several reasons. It could be due to the compound not being present in a sample, due to sample preparation (for example, added chemicals for analyte degradation), or due to the detection limit of the instrument. This leads to signals from some compounds being below the limit of quantitation (LOQ).
Missing data in mass spectrometry experiments can lead to biased interpretations, hinder statistical analyses, and limit the scope of biological insights. For example, an experiment output has five lipids measured. CE 18:1;0 was the only lipid measured in all the samples. All other lipids have at least one missing value.
Typical lipidomic dataset with missing values highlighted in red.
Many researchers start their analysis with a principal component analysis (PCA). However, a PCA method does not accept missing data points. Therefore, a researcher should either use another method or try to work around the problem. This can be done by using only completely measured lipids or, due to the high cost of information loss, by imputing the missing data points. The word ‘imputation’ comes from the Latin imputo, which means to reckon, to attribute, to clear up.
Data imputation methods aim to address these challenges by filling in the gaps. They allow for more statistical methods to be applied. It also simplifies the programming effort, and can also often extract more insights from the existing data. There are many methods available today for imputing missing lipid data points and scientists are often faced with the question of which one to choose. The answer to this question is not easy, because there is no single best method. But there is probably a method that works best for the respective data.
An important question to ask is why the data points are missing. Is it completely random? Or is there some factor (tracked or untracked) that influences the probability of a data point being missing? Although these questions are important for the choice of the imputation method, it is not possible to answer them with confidence for every missing data point.
In lipidomics context, for example, there are many data points missing because they are below the limit of detection. Therefore, one can be confident that imputing all the values using the mean or median value for the corresponding lipid would result in poor imputation quality. Also imputing these missing data points consequently with zero cannot be correct, because values below a detection limit are not necessarily zero. Therefore, it seems to be reasonable to impute all missing data points by half the detection limit. However, this is also problematic, because it is obvious, that not all missing data points of the same lipid are given by the same value. It would be unrealistic to impute all missing values of CE 20:1;0 in the example by half the limit of detection. That would mean that 6 out of 9 samples would have the same imputed value for this lipid.
There is a need for a more advanced method. Lipotype scientists investigated many different imputation methods, applied them to lipidomic datasets, and performed simulation studies. They concluded, that the k-nearest neighbor truncation approach described by Shah et al. works best. In brief, k-nearest neighbor truncation, unlike other imputation methods, takes the limit of detection into account. Missing values are imputed by first transforming lipid-wise all values to a common scale, then finding the nearest neighbor (=lipid) for the lipid-containing missing values, imputing the missing values with that of the nearest neighbor, and finally back-transforming the data. Below is the data table of our example after imputing the missing values.
Typical lipidomic dataset with imputed values highlighted in red.
The advantage of a complete dataset is not just the ability to use a broader range of methods and simplify the programming process. The dataset can now be more informative, e.g. statistical tests can be applied at all or with greater confidence. Although the benefits of imputation should be treated with caution, imputation simplifies all further statistical analyses and expands the possibilities when working with lipidomics datasets.
Lipotype Lipidomics technology provides a powerful solution for customers seeking insights into cellular lipid profiles. Lipotype aids in lipid research in various directions, like dermatology or cardiovascular diseases.
Related articles
Lipidomics data analysis: WGCNA
WGCNA may be useful for condensing all the lipid species into a few modules that can be linked to clinical traits.
Lipotype is the leading lipidomics provider for all scientists. The mass spectrometry-based platform can be applied to all biological samples, and is completed with data visualization and statistical analyses. Lipotype translates complex lipidomics data sets into convincing lipidomics results, in as little as two weeks
Share this story
You are currently viewing a placeholder content from OpenStreetMap. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.