Advanced solar radiation prediction using combined satellite imagery and tabular data processing

Contents

The dataset Pixel imputation using hybrid random forest and GANs with 8-connected pixel analysis Noisy regions identification with SOM Noise removal using the latent diffusion model Missing data imputation using diffusion model Features extraction Solar radiation prediction using modified LSTM (LSTM with seasonality and trend handling)

This part of the paper presents the methodology and each part of it. The methodology is based on two types of data: satellite images and satellite-derived tabular data. The problem of solar radiation prediction can be formally defined as follows: Let (x) be the input data, which consists of satellite images and tabular data (e.g., meteorological parameters). The goal is to predict the solar radiation (y) at a given time and location. Our proposed method (f) takes (x) as input and produces (y) as output, i.e., (y = f(x)). Here, (f) represents our hybrid model, which includes preprocessing steps (e.g., noise removal, missing data imputation) and a modified LSTM for prediction. Our approach is a forecasting model, as it uses historical data (e.g., past satellite images and meteorological data) to predict future solar radiation values.

The methodology contains two paths: the first path takes the input as the satellite images, and the second path takes the input as satellite-derived tabular data. The first path contains three main steps: the first step is about missing pixel imputation by imputing the missing pixel using random forest and identity GANs. The paper presents three novel modifications in this step: the first modification is about adding the identity block to the generator in the GANs to avoid the vanishing gradient problem and mode collapse. The second modification is about using the neutrosophic statistical formulation for the 8-connected pixel surrounding the missing pixel. This modification helps to generate a pixel compatible with other surrounding pixels. The second step in the first path is about using the SOM to identify the noisy regions in the satellite images. The third step in the first path is about using the latent diffusion model to remove the noise from the noisy region identified in the previous step. The second path takes the input as tabular satellite data; this path contains only the missing tabulated data using the diffusion model. After the preprocessing steps in both paths, the methodology combines the outputs from the two paths after extracting the regional data from the first path. Then, the methodology uses feature selection to select the features used by the modified LSTM to predict the solar radiation. Fig. 1 shows the methodology diagram.

The dataset

This study utilizes two datasets: The Solar Irradiation Measurement (SRM) Dataset and the Satellite-based Solar Irradiation (SSR) Dataset. The SRM dataset provides ground-based solar irradiation measurements, while the SSR dataset offers satellite-derived estimates, ensuring a comprehensive approach to solar energy forecasting.

The SRM dataset, compiled by the National Renewable Energy Laboratory (NREL), includes high-precision ground measurements of solar irradiation and related atmospheric variables⁴⁵. The dataset spans the period from January 1, 2010, to December 31, 2020, covering multiple geographical locations across the United States and Europe. The dataset is recorded at an hourly temporal resolution, which ensures enough granularity for capturing variations in solar irradiation and optimizing interpolation and prediction tasks. It comprises twelve essential features, including Global Horizontal Irradiance (GHI), Direct Normal Irradiance (DNI), Diffuse Horizontal Irradiance (DHI), air temperature, relative humidity, wind speed, wind direction, atmospheric pressure, precipitation, and cloud cover. Given the importance of DHI in solar energy applications, this study focuses primarily on this feature to enhance the accuracy of irradiation prediction models. Table 2 provides a detailed description of the key features included in the SRM dataset, highlighting the meteorological and atmospheric parameters that influence solar irradiation levels. These parameters are essential for ensuring accurate model training and validation.

The SRM dataset is primarily derived from the US SURFRAD network, which includes monitoring stations in locations such as Desert Rock, Nevada; Goodwin Creek, Mississippi; and Penn State, Pennsylvania. To ensure data reliability, the dataset undergoes rigorous quality control procedures, including routine validation against known meteorological standards, checks for sensor malfunctions, and removal of anomalous readings. Additionally, nighttime values are excluded to maintain a precise focus on daytime solar irradiation measurements. The dataset is publicly available through the NREL repository, facilitating its use in further research and model development.

Table 2 Solar radiation measurement (SRM) dataset.

The SSR dataset, compiled by the European Space Agency (ESA), consists of satellite imagery and derived solar irradiation products. This dataset provides extensive spatial coverage across Europe and North Africa and includes both tabular data and satellite images in GeoTIFF and JPEG formats. It spans the period from 2010 to 2022, ensuring long-term analysis of solar irradiation trends. The dataset is constructed using multiple satellite sensors, including Sentinel-2, Landsat 8, and MODIS, which offer high-resolution spectral data suitable for solar energy applications⁴⁶. The SSR dataset contains nine essential features, encompassing surface reflectance, cloud cover percentage, aerosol optical depth, water vapor content, surface elevation, and derived solar irradiation parameters such as GHI, DNI, and DHI. In this study, emphasis is placed on the DHI component, as it plays a crucial role in improving solar irradiation predictions. Table 3 presents a comprehensive overview of the features included in the SSR dataset, detailing the satellite-based parameters utilized in this study.

Table 3 Satellite-based solar radiation (SSR) dataset.

The selection of these datasets is based on multiple criteria, including spatial resolution, temporal coverage, and data reliability. The SRM dataset offers high temporal accuracy, making it an ideal reference for validating satellite-derived predictions. In contrast, the SSR dataset provides superior spatial granularity, which is critical for capturing localized atmospheric variations. While geostationary satellites offer higher temporal resolution, their spatial coverage is often less detailed compared to Low Earth Orbit (LEO) satellites, such as those utilized in this study. The advantage of LEO-based datasets lies in their ability to provide fine-scale spatial details, though they are limited by longer revisit times, which can affect real-time applications.

By integrating ground-based and satellite-derived datasets, this study ensures greater accuracy and robustness in solar irradiation modeling. The SRM dataset provides high-reliability data across North America and Europe, whereas the SSR dataset extends the applicability of the model to regions with limited ground-based data coverage, such as North Africa. This combination establishes a comprehensive foundation for developing scalable and precise solar irradiation forecasting models. Both datasets are publicly accessible, with the SRM dataset available via the NREL repository and the SSR dataset accessible through the ESA Earth Observation Data Portal.

Pixel imputation using hybrid random forest and GANs with 8-connected pixel analysis

This part of the methodology used to impute the missing pixels of the satellite images. The missing pixel may occur during transmission and capturing. This part of the methodology contains combination of multiple parts as mentioned in figure. The hybrid model in the block diagram contains two main parts. The first part is the random forest model. The output value of the random forest algorithm is used as the input value of the generative adversarial interpolation network. And the second part is the GANs with identity block and 8-connected pixels. This part of the methodology contains set of novel modification; the first modification is about to add the identity block; the identity block is to solve the mode collapse problem and the vanishing gradient problem. This help network to generate different pixel each time of training. The second modification is about using the 8-connected pixel which used to calculate the average value of the surrounding pixels using neutrosophic statistical analysis method. The neutrosophic used to give value of each 8 connected pixel and then calculate the average value of the 8-connected pixels. The GANs network generate value of the missing pixel and after generation the pixel; it compared with the value of the average of the 8-connected pixel. The model stops when the generated value is near the average the 8-connected pixel.

The process begins with the application of Random Forest, which is used to predict the missing pixel values based on the available surrounding pixel information. The RF model is particularly effective in capturing the complex dependencies within the data due to its ensemble learning nature. However, Random Forest alone may not fully account for the intricate patterns and variability present in satellite images, which is where the integration of GANs becomes essential as illustrated in Fig. 2.

Generative Adversarial Networks (GANs), as depicted in Fig. 3, are then utilized to refine the imputed pixel values generated by the RF model. The GAN architecture comprises two main components: a generator and a discriminator. The generator attempts to create realistic pixel values, while the discriminator evaluates the authenticity of these generated values against the real data. This adversarial process continues until the generator produces pixel values that are indistinguishable from the actual data, thereby enhancing the quality of the imputation.

To further refine the imputation, the framework incorporates an 8-connected pixel analysis, a technique that considers the spatial relationships between a pixel and its eight immediate neighbors. This analysis ensures that the imputed pixels are not only accurate in isolation but also consistent with the surrounding pixel structure, preserving the overall integrity of the satellite images. As depicted in Fig. 4, the final step involves the integration of the outputs from both the RF model and the GANs, resulting in a robust imputation model capable of handling the complexities of satellite image data. The combined approach of Random Forest and GANs, supported by 8-connected pixel analysis, provides a significant improvement in the prediction of solar energy radiation, ensuring that the imputed data is both accurate and reliable.

Noisy regions identification with SOM

The Time Series Growing Self-Organizing Map (TS-GSOM) is a powerful technique that can be effectively employed to identify and remove noise from satellite imagery. The TS-GSOM is an unsupervised neural network model that can learn the underlying patterns and structures within a time series of satellite images. By treating each pixel in the satellite images as a time series, the TS-GSOM can capture the spatial and temporal characteristics of the data, enabling it to distinguish between genuine features and noise. The model begins with a small initial map and adaptively grows its size and complexity as it learns the intricate patterns present in the satellite data. This dynamic growth allows the TS-GSOM to identify regions within the images that exhibit anomalous or noisy behavior, which can then be selectively targeted for further denoising. The unsupervised nature of the TS-GSOM makes it particularly well-suited for satellite imagery, where the sources and characteristics of noise can be highly complex and variable. By leveraging the TS-GSOM’s ability to learn the data’s inherent structure, satellite image analysts can effectively isolate and remove noise, leading to more accurate and reliable interpretation of the underlying land cover, environmental changes, and other important geospatial information.

The TS-GSOM model consists of three key components: the input layer, the growing self-organizing map, and the output layer. The input layer takes in the time series of satellite image pixels, treating each pixel as a multivariate time series. The growing self-organizing map forms the core of the model, starting with a small initial grid of neurons and adaptively expanding its size and complexity as it learns the underlying patterns in the data. The learning process involves competitive learning, where neurons compete to represent the input data, and cooperative learning, where neighboring neurons adjust their weights to capture the spatial and temporal relationships. As the map grows, it forms clusters of neurons that correspond to distinct features and structures within the satellite images, allowing the model to differentiate between genuine image content and noise. Finally, the output layer aggregates the learned representations from the growing self-organizing map, providing a denoised version of the input satellite images by selectively reconstructing the non-noisy regions. The TS-GSOM model operates in an iterative fashion, with each iteration refining the map and improving the noise removal capabilities. By leveraging the self-organizing and adaptive nature of the growing map, the TS-GSOM can effectively identify and suppress the noise in satellite imagery, leading to enhanced image quality and more accurate interpretation of the underlying geospatial information as shown in Pseudocode (3 -1).

Noise removal using the latent diffusion model

The use of a latent diffusion model is a promising approach for removing noise from the satellite imagery. Latent diffusion models are a type of generative neural network that can effectively map noisy input images to a clean, noise-free representation in a learned latent space. By training the latent diffusion model on a dataset of high-quality satellite images, the model can learn the underlying patterns and features that characterize clear, unobstructed imagery., the trained model can then iteratively denoise the input, progressively removing unwanted artifacts and distortions while preserving the important structural and spectral information. This allows for the recovery of a clean, high-fidelity representation of the ground features, which is crucial for accurately measuring and monitoring the archaeological site over time.

This latent space encodes the essential features of the image in a lower-dimensional form, while filtering out the unwanted noise. The decoder then takes this clean, noise-free latent representation and generates a reconstructed output image that closely matches the original, high-quality version. The training of the latent diffusion model involves iteratively refining this encoding-decoding process, minimizing the reconstruction error between the model output and the ground truth clean images. This allows the network to learn an effective mapping from the noisy input to the denoised output, enabling it to generalize and denoise new satellite images with high fidelity. The encoder and decoder architecture, along with the diffusion-based training process, are the core components that give the latent diffusion model its powerful denoising capabilities for satellite imagery.

The latent diffusion network operates by progressively transforming a noisy input satellite image into a clean, denoised output. This is achieved through a series of diffusion steps, where the network gradually removes the noise while preserving the underlying image features. The process begins with the encoder, which takes the noisy input image and maps it to a compact latent representation. This latent encoding captures the essential image information in a lower-dimensional form, effectively separating the signal from the noise. The decoder then uses this clean latent representation to generate the denoised output image, restoring the visual quality and details. Crucially, the training of the latent diffusion model involves iteratively adding and removing noise from the input images, learning the inverse mapping that can effectively denoise new samples. By modeling this diffusion process, the network develops a robust understanding of how to remove unwanted artifacts and distortions from the satellite imagery, while maintaining the important structural and spectral characteristics. Through this iterative, diffusion-based approach, the latent diffusion model can produce high-quality, denoised satellite images that are crucial for accurate analysis and interpretation of the target site as shown in Psuedocode (3-2).

Missing data imputation using diffusion model

To handle the issue of missing values in the satellite-derived tabular data, we employed a diffusion-based imputation approach. Due to factors such as sensor malfunctions, atmospheric interference, or insufficient ground coverage, some entries in the tabular dataset derived from the satellite imagery were missing. We addressed this by modeling the underlying data distribution using a diffusion probabilistic model. Specifically, we trained a conditional diffusion model that could generate plausible completions for the missing entries based on the observed non-missing features. The diffusion model learned the complex statistical relationships between the different variables in the tabular data by iteratively adding controlled noise and then reversing the process to recover the original data distribution. Once trained, we used the diffusion model to sample likely values for the missing entries, conditioning on the known feature values for each row. This approach allowed us to impute the missing data in a manner that preserved the multivariate structure and higher-order statistics of the original satellite-derived tabular dataset. The diffusion-based imputation provided more accurate and realistic estimates compared to simpler techniques, enabling us to maximize the information content used in the subsequent data analysis.

The diffusion-based imputation approach we employed consisted of several key components. At the core was a conditional diffusion probabilistic model that learned the underlying data distribution of the satellite-derived tabular dataset. This diffusion model was composed of a noise prediction neural network and a Markov Chain Monte Carlo (MCMC) sampling procedure. The noise prediction network took as input the known feature values for a row with missing entries, and outputted predictions of the noise that would need to be sequentially added to generate the missing values. The MCMC sampling then iteratively applied this learned noise addition process in reverse, starting from random initializations, to produce plausible completions for the missing entries that matched the observed data distribution.

The key steps of missing tabular data imputation as the follow:

1.
Train a noise prediction model on the complete data.
2.
For each row with missing values:
1. (a)
  Randomly initialize the missing values.
2. (b)
  Perform MCMC sampling to update the missing values:
  1. (i)
    Predict the noise to add using the trained model.
  2. (ii)
    Update the missing values by adding the predicted noise.
  3. (iii)
    Accept the new values based on an MCMC criterion.
3. (c)
  Update the original row with the final imputed values.
3.
Return the dataset with the imputed missing values.

And pseudocode (3-3) and (3-4) refers to how the missing data imputation using the diffusion model as the follow.

Features extraction

In this study, feature extraction was performed by integrating information from two distinct datasets to construct a comprehensive set of predictors for the LSTM forecasting model. The first dataset consisted of tabular data containing financial, economic, and demographic variables, while the second dataset provided geographical information derived from satellite imagery, including land cover classifications, vegetation indices, and infrastructure characteristics. To ensure a seamless integration of these datasets, a spatial join operation was applied, aligning the tabular data with their corresponding geographic coordinates. This process enriched each tabular record with relevant satellite-derived features corresponding to specific locations.

Following data integration, feature engineering techniques were employed to capture complex relationships between the tabular and geospatial attributes. The engineered features incorporated spatial lags, geographic clustering metrics, and multimodal representations of satellite imagery, enhancing the predictive capacity of the model. To refine the feature set, a rigorous selection process was conducted using recursive feature elimination (RFE) and permutation importance. This procedure identified the most informative predictors, which primarily included global horizontal irradiance, direct normal irradiance, and air temperature from the SRM dataset. Additionally, surface reflectance in the visible and near-infrared bands, aerosol optical depth, and cloud cover percentage from the SSR dataset demonstrated significant contributions to the predictive model. These features exhibited strong importance scores and effectively captured the spatial and temporal variations in solar radiation.

To address potential multicollinearity among the selected features, the Variance Inflation Factor (VIF) was computed for each variable. Features with high collinearity, indicated by a VIF exceeding a predefined threshold, were either removed or transformed using Principal Component Analysis (PCA). This approach ensured that the final feature set retained its predictive capacity while minimizing redundancy. By leveraging a diverse set of predictors from both tabular and geospatial sources and applying systematic feature selection and dimensionality reduction techniques, the constructed feature set was optimized to enhance the accuracy and reliability of solar irradiation predictions. Table 4 provides a summary of the selected features and their corresponding importance scores, highlighting their respective contributions to the model’s predictive performance.

Table 4 Selected features and their importance scores.

Solar radiation prediction using modified LSTM (LSTM with seasonality and trend handling)

The traditional LSTM model, while effective at capturing complex temporal patterns, may struggle to accurately predict solar radiation due to the strong seasonal and trend components inherent in such time series data. To address this limitation, the Modified LSTM model incorporates dedicated mechanisms to explicitly handle the seasonality and trend present in the input solar radiation data. Specifically, the input time series is first decomposed into its seasonal, trend, and residual components using techniques such as Seasonal-Trend decomposition using Loess (STL). The seasonal and trend components are then fed into separate LSTM sub-networks, allowing the model to learn the unique characteristics of these various data features. The outputs of the seasonal and trend LSTM sub-networks are then combined with the residual component to produce the final solar radiation prediction. This explicit modeling of the underlying drivers of solar radiation helps the Modified LSTM overcome the limitations of the traditional LSTM, resulting in improved forecasting accuracy.

The Modified LSTM model for solar radiation prediction is designed with several key components. It begins with an input layer that processes the solar radiation time series. This input is then passed through a decomposition module that divides the time series into seasonal, trend, and residual components. The seasonal and trend components are each processed by separate LSTM sub-networks, which have their own LSTM units and internal states. These sub-network outputs are then concatenated with the residual component and passed through a final dense layer to generate the solar radiation prediction. This architecture enables the model to capture the unique characteristics of the seasonal, trend, and residual components, resulting in more accurate forecasts compared to the traditional LSTM approach, as demonstrated by the hyperparameters detailed in Table 5.

Table 5 Hyperparameters of LSTM approach.

Source link