Deep learning based identification and interpretability research of traditional village heritage value elements: a case study in Hubei Province

Materials and methods

Study area area

Hubei Province, located in central China at 108° 21′–116° 07′ east longitude and 29° 05′–33° 20′ north latitude, has a profound historical heritage. It served as the capital of Chu State during the Warring States period and established itself as the cultural center of Chu in Hubei. Throughout its history, from the Qin, Han and Six Dynasties to the Tang, Song, Ming and Qing Dynasties, Hubei's culture has preserved the essence of Chu culture while assimilating the characteristics of other regions. In particular, Hubei is the province through which the Yangtze River flows the longest, positioning it as a convergence point for China's four cardinal directions: east, west, south and north. In addition, Hubei's geographic location intersects with major ancient transportation routes. These include the Sino-Russian Ten Thousand Mile Tea Road, the Ancient Tea and Horse Trade Road, the Ancient Salt Road connecting Sichuan and Hubei, and the Huguang to Sichuan Immigrant Passage. These ancient routes crisscross Hubei, bringing a wealth of historical and cultural elements to its traditional villages.

Hubei is famous for its diverse and culturally rich traditional villages, which exhibit the unique socio-cultural characteristics of "blending northern and southern influences and incorporating elements from east and west". These villages represent typical settlement development in the Yangtze River Basin. As of May 2023, a total of 270 traditional villages in Hubei have been recognized and included in the prestigious list of Chinese traditional villages [37]. Figure 1 shows that the terrain of Hubei is mainly characterized by mountainous regions in the east, west, and north, while the central area consists of low-lying areas and a partially open basin in the south. Traditional villages are mainly concentrated in the hilly and mountainous areas in the western and eastern parts.

Data collection and processing

In 2022, our research team, supported by the Department of Housing and Urban–Rural Development of Hubei, embarked on the comprehensive "Survey and Archiving of Traditional Villages". From June to July, we meticulously conducted extensive field research on traditional villages in various regions of Hubei. The survey covered more than 700 villages, 270 of which were designated as national-level traditional villages. The images covered in this article are from these 270 traditional villages (Fig. 1). Our research encompassed field photography, questionnaire distribution, and interviews with local villagers, utilizing a range of devices including drones, cameras, and smartphones. However, given that images captured by these devices may possess varying pixel sizes, we conducted preprocessing during data collection. This preprocessing involved data normalization and image resizing to ensure uniform size and feature representation across all images.

In this study, we focus on traditional villages in Hubei Province as our research subject, and we collected over 12,000 images from 270 villages through field surveys. During the selection process, we identified 3805 images that represent the characteristics of traditional villages across various regions of Hubei Province. These images serve as the foundation for constructing our dataset. This serves as a solid foundation for our study, facilitating an in-depth analysis of the characteristics, cultural heritage, and challenges facing traditional villages in Hubei. Through careful organization and analysis of these data, we aim to improve our understanding of the TVHVE. In addition, we plan to apply advanced DL techniques, such as image classification, to automatically identify and classify relevant features.

Establishing a classification framework for the TVHVE is important because it allows for the systematic identification of criteria and the accurate categorization of these elements. The rational design and application of classification rules not only contributes to an in-depth understanding of village characteristics and facilitates further analysis, but also serves as a foundation for subsequent research and conservation efforts. Figure 2 shows the classification and label settings of TVHVE. The TVHVE are classified into three categories: environmental elements, architectural elements, and cultural elements. Meanwhile, the indicators for TVHVE include 26 detailed elements that meticulously characterize the essence of traditional villages. This scientifically rigorous categorization framework provides a robust tool and methodology for advancing research on TVHVE [38].

Compared to studies that rely on internet-sourced image data of traditional villages, the database utilized in this study is built upon a substantial collection of field-photographed images specific to Hubei. This aspect lends a higher degree of reliability and relevance to our research. To ensure the integrity of the image data, researchers meticulously screened and cleaned tens of thousands of traditional village images, adhering to the principles outlined for categorizing the elements of heritage value as described earlier. Furthermore, this screening method took into consideration the regional characteristics specific to traditional villages in Hubei. The image dataset was categorized and extracted based on these characteristics, effectively reducing the duplication rate of similar images and minimizing computational time required for modeling.

During dataset construction, we meticulously selected training and test sets at a 4:1 ratio, ensuring label consistency [22, 32]. We also considered the quantity of data for each sample type, meeting model computational requirements (Table 2). To enhance recognition accuracy, we took the following steps:

1.
Diverse image types: on-site photography captured heritage elements from various angles for comprehensive coverage.
2.
Varied image backgrounds: the dataset included images with diverse lighting and weather conditions, improving adaptability.
3.
Diverse target scenes: within the same classification, targets of varying sizes were included, enhancing scene recognition.
4.
Data enhancements: we use two data enhancement techniques during model training: random rotation and cropping. These techniques can increase the diversity of training data and improve the generalisation ability of the model.

The overall workflow of this study, as shown in Fig. 3, consists of four main steps: data collection and classification, data processing, model comparison and selection, and analysis of identification results including data interpretation.

In the data collection phase, extensive research and photography of traditional villages was conducted to establish a comprehensive sample database. A detailed classification of the TVHVE was carried out during this phase. Next, in the data pre-processing step, the collected data were manually screened and organized according to the 26 categories of TVHVE. This process involved careful data selection and preparation to construct the corresponding data set. In the model comparison and selection step, the training effects of four CNN models (ResNet18, Visual Geometry Group Network19 (VGG19), ResNet152, and Dense Convolutional Network121 (DenseNet121)) were examined. The purpose of this step was to identify the most appropriate model for TVHVE detection. Finally, in the data and interpretability analysis step, the recognition results of the test set data on the trained models were evaluated. In addition, interpretability analysis techniques such as semantic clustering and Grad-CAM heat map were used to gain insights and interpret the results in a meaningful way.

Model selecting

The CNN is a widely used DL framework specifically designed for image classification tasks. It consists of a feature extraction layer, which performs convolutional computations, and multiple hidden layers. The CNN is capable of automatically extracting low-level features from the original input and integrating them into high-level features that serve as the basis for target recognition. This network framework exhibits powerful recognition performance. In the field of image classification, several classical CNN models have gained significant popularity. These models include ResNet, VGG, DenseNet, and others. These models have demonstrated their effectiveness in various image recognition tasks.

The VGG model utilizes 3 × 3 convolution kernels and successive 3 × 3 convolutions to maintain a consistent receptive field while increasing network depth, improving feature capture efficiency. In contrast, ResNet introduces skip connections and residual learning to tackle deep neural network optimization challenges, alleviating the vanishing gradient problem. DenseNet enhances model performance and reduces parameters through feature reuse and dense connections, maximizing information flow between layers, representing significant advancements in deep learning techniques.

In this study, we have selected four CNN image recognition models commonly used in the architectural field, ResNet152, VGG19, ResNet18, and DenseNet121. They represent the classical and commonly used model architectures in deep learning. resNet152 and VGG19 are relatively deep networks with a large number of layers and parameters, whereas ResNet18 and DenseNet121 are relatively shallow with fewer parameters. By comparing these models of different depth and complexity, their trade-off between performance and resource consumption can be evaluated.

Model training

All model training and testing procedures in this study were performed on a cloud computing platform provided by FEATURIZE [39]. The rented computer used in this study had an Intel Xeon Gold Xeon Gold 6142 CPU model and a GeForce RTX 3080 GPU model. The available video memory of the GPU was 10.5 GB. TensorFlow and PyTorch, which are popular DL programming frameworks, were used for the experiments.

To ensure a fair comparison between different models, the hyperparameters for model training were standardized in the experiments. The Adam optimization algorithm was used as the gradient optimization algorithm for training all models, with a learning rate of 0.001. The number of training iterations was set to 100, and the loss function chosen was Cross Entropy.

Evaluation criterion

Accuracy is a commonly used metric to evaluate the correctness of a model. However, when dealing with unbalanced data sets, accuracy alone may not be an appropriate metric to evaluate the results. Therefore, in this study, we used four evaluation metrics: Accuracy, Precision, Recall, and F1 Score. Accuracy represents the proportion of correctly predicted positive samples out of all samples. Precision measures the proportion of correctly predicted positive samples out of all samples predicted to be positive. Recall quantifies the proportion of correctly predicted positive samples out of all actual positive samples. The F1 score combines precision and recall, seeking a balance between the two to achieve the optimal trade-off.

By calculating the average of these evaluation metrics, we can select the best performing model based on its overall performance. This approach provides a comprehensive evaluation of model effectiveness and takes into account the impact of sample imbalance.

$$Accuracy=\frac{{T}_{P}+{T}_{N}}{{T}_{P}+{T}_{N}+{F}_{P}+{F}_{N}}$$

(1)

$$Precision=\frac{{T}_{P}}{{T}_{P}+{F}_{P}}$$

(2)

$$Recall=\frac{{T}_{P}}{{T}_{P}+{F}_{N}}$$

(3)

$$F1 score=2\times \frac{precision\times recall}{precision+recall}$$

(4)

where T_P (True Positive) represents positive samples are rated as positive by the model, T_N (True Negative) represents negative samples are rated as negative samples, F_P (False Positive) represents negative samples are rated as positive samples and F_N (False Negative) represents positive samples are rated as negative samples.

AUC and PRC are key metrics used to assess the classification performance of a model. AUC describes the relationship between true positive and false positive rates, while PRC demonstrates the trade-off between precision and recall. Both provide a comprehensive assessment of model performance and are particularly suitable for unbalanced datasets. The AUC, which ranges from 0 to 1, with 0.5 representing random guessing, 0 indicating poor performance, and 1 denoting excellent performance, AUC offers a comprehensive evaluation of a model's overall proficiency, independent of threshold choices. PRC also ranges from 0 to 1, with higher values indicating better model performance. Unlike AUC, PRC focuses more on positive classes and is particularly useful for evaluating unbalanced datasets.

Interpretability analysis based on image classification

Image classification interpretability analysis is a valuable approach utilizing visualization techniques to gain deeper insights into model classification outcomes. It aids in understanding the relationships between categories, pinpointing misclassifications, and exploring image features. In this study, we employed two common methods, semantic feature visualization and Grad-CAM heat maps, to demystify the inner workings of our deep learning model, shedding light on the "black box" and offering intuitive insights into the TVHVE classification and recognition task's similarities and distinctions among various elements.

Dimensionality reduction visualization of semantic features in image classification involves reducing high-dimensional image feature vectors to a lower-dimensional space (e.g., 2D or 3D) and visualizing them. This method aims to provide visual insights into the clustering, distribution, and distinctions among image data categories. It encompasses four key steps: feature extraction using a trained CNN model to obtain high-dimensional feature vectors, applying a dimensionality reduction algorithm (e.g., t-SNE) to condense these vectors while preserving essential information, using visualization techniques like scatter plots or 3D graphs to display feature distributions, and analyzing the results to interpret classification outcomes, feature representations, category relationships, and potential outliers.

In addition, Grad-CAM heat map is an interpretable method used to interpret the prediction results of CNN in image classification tasks. It generates a heat map that visualizes the attention paid by the model to different regions of the image during the classification process. Higher heat values are typically associated with regions that have a strong influence on the predicted categories. Therefore, heat maps help to understand how the model makes classification decisions and serve as a visual and explanatory tool for the model's prediction process. It is important to note that Grad-CAM is an interpretability technique that explains the prediction results of a trained CNN model for a specific image. It does not modify or tune the model itself, but provides an interpretation of the model's predictions.

Materials and methods

Study area area

Data collection and processing

Model selecting

Model training

Evaluation criterion

Interpretability analysis based on image classification

Why Egyptian pyramids are not mentioned in Old Testament

Yoruba vs Igbo are two distinct ethnic groups in Nigeria, with rich cultures and histories.

8 Amazing Fractal Tattoo Ideas And Meanings

A 3,300-year-old shipwreck discovered off Israeli coast could rewrite history. Here's why