Detecting cloud presence in satellite images using the RGB-based CLIP vision-language model

Research output: Contribution to conferenceAbstractpeer-review

69 Downloads (Pure)

Abstract

The text medium has begun to play a prominent role in the processing of visual data over the last years, such as images [1, 2, 3, 4, 5, 6, 7], or videos [8, 9, 10]. The use of language allows human users to easily adapt the computer vision tools to their needs and so far, it has primarily been used for purely creative purposes. Yet, vision-language models could also pave the way for many remote sensing applications that can be defined in a zero-shot manner, without the need for extensive training or any training at all. At the core of many text-based vision solutions stands CLIP, a vision-language model designed for measuring alignment between text and image inputs [1]. In this work, the capability of the CLIP model to recognize cloud-affected satellite images is investigated. The approach to this is not immediately obvious; the CLIP model operates on RGB images, while a typical solution to detect clouds in satellite imagery involves more than the RGB visible bands, such as infrared, and is often sensorspecific. Some past works have explored the potential of an RGB-only cloud detection model [11], but the task is considered significantly more challenging. Furthermore, the CLIP model has been trained on the general WebImageText dataset [1], so it is not currently obvious how well it could perform with a task as specific as classification of cloud-affected satellite imagery. In this work, the capability of the official pre-trained CLIP model (ViT-B/32 backbone) is put to test. There are two important insights gained here: it allows to estimate the utility of representations learned by CLIP for cloud-oriented tasks (which can potentially lead to more complex uses such as segmentation or removal), and further, it can act as a tool for filtering datasets based on the presence of clouds. The CLIP model [1] has been designed for zero-shot classification of images where labels can be supplied (and hence, specified as text) upon inference. The CLIP model consists of separate encoders for text and image input, with jointly learned embedding space. A relative measure of alignment between a given text-image pair can be obtained by computing the cosine similarity between the encodings. The manuscript explores four variants of using CLIP for cloud presence detection, shown in Table 1, one (fully zero-shot) based on text prompts (1), and (2)-(4) based on minor (1,000 gradient steps with batch size of 10, on only the training dataset) fine-tuning of the high-level classifier module. In the case of (2), a linear probe is attached to the features encoded by the image encoder. In the case of (3), a CoOp approach is employed, as described in [12]. Finally, the Radar (4) approach applies a linear probe classifier to the image encodings of both RGB data and a false-color composite of the SAR Data (VV, VH, and mean of the two channels are encoded as 3 input channels). Furthermore, the learned approaches (2)-(4) are tested for (dataset/sensor) transferability. The (a) variants correspond to the training and testing data coming from the same sensor, while the (b) variants employ transfer. The text prompts for method (1) were arbitrarily selected as "This is a satellite image with clouds" and "This is a satellite image with clear sky" with no attempt to improve them.
Original languageEnglish
Number of pages3
Publication statusPublished - 21 Jul 2023
EventInternational Geoscience and Remote Sensing Symposium - Pasadena, United States
Duration: 16 Jul 202321 Jul 2023
https://2023.ieeeigarss.org/index.php

Conference

ConferenceInternational Geoscience and Remote Sensing Symposium
Abbreviated titleIGARSS
Country/TerritoryUnited States
CityPasadena
Period16/07/2321/07/23
Internet address

Keywords

  • detection
  • cloud presence
  • satellite images
  • RGB-based CLIP vision-language model

Fingerprint

Dive into the research topics of 'Detecting cloud presence in satellite images using the RGB-based CLIP vision-language model'. Together they form a unique fingerprint.

Cite this