Jun-Jie Zhu, Meiqi Yang, Jinyue Jiang, Yiming Bai, Zhiyong Jason Ren
Department of Civil and Environmental Engineering and Andlinger Center for Energy and the Environment, Princeton University
Research Gap
Our discussion (Zhu et al., 2023) on the general use of ChatGPT on environmental research was the Most Read Article of the past year in the environmental flagship journal Environmental Science & Technology, highlighting the great interest in large language models (LLMs) within the environmental science and engineering (ESE) community. Several perspective papers have identified promising opportunities and great benefits of applying LLMs to environmental topics, such as water treatment (Egbemhenghe et al., 2023), water resource management (Ray, 2023), hydrology and earth sciences (Foroumandi et al., 2023), carbon capture (Lei et al., 2023), global warming (Biswas, 2023), life cycle assessment (Preuss et al., 2024), and environmental psychology (Yuan et al., 2024). However, research on LLMs in ESE fields remains a gap. An essential question arises: “How proficient are LLMs in addressing expert-level environmental questions?”. Question-answering (QA) is a core strength of LLMs, so they must demonstrate a high capability in answering domain-specific questions in ESE to be considered reliable for specialized environmental applications. To date, no research has evaluated the performance of LLMs in expert-level, environmental-specific QA. This study provides the first fundamental analysis in the ESE field using expert-level ESE domain-specific textbooks. Our main objectives are to understand the LLMs’ performance, and to identify patterns and insights that can guide further optimization and development.
Domain Data Curation, Model Development, and Evaluation
In this early study, a QA dataset was curated from the open-access textbook, Biological Wastewater Treatment: Examples and Exercises (Lopez-Vazquez et al., 2023). A total of 286 long-form QAs (i.e., what/why/how questions without calculations) were selected for the evaluation queries. These questions span various aspects of environmental engineering, such as treatment development, nitrogen removal, and process modeling.
We are continuously compiling QA data and training materials, and the early study (partial results exhibited in this blog) was conducted using a small-scale dataset of 22 open-access books from the International Water Association (IWA). To convert these books into trainable materials, textual noises were reduced by data cleaning, lengthy texts were divided into shorter segments, each serving as a sample for GPT to learn wastewater knowledge by predicting the next token. Three fine-tuning datasets (FTDs) were created from 1, 6, or 22 books. For example, the training set includes up to about 6800 samples (2,900,000 tokens) and the validation set includes up to 440 samples. These data facilitated the development of fine-tine models (FTMs, 3 epochs) accordingly based on GPT-3.5. We are expanding the dataset to a moderate-scale (estimating 60 million tokens), containing over 150 books, 10,000 research articles, and other relevant textual data (e.g., Wikipedia and Tulu V2 Mix datasets) in the environmental domains. Based on the insights and patterns learned from the small-scale study, we expect to achieve better outcomes in the future.
Three human experts evaluated model performance using two key metrics: relevance (if key topics are presented) and factuality (if the content is factually correct), reflecting the core aspects of model generation. To assess performance from different perspectives, we also evaluated the QAs’ difficulty levels (simple, medium, and hard) based on knowledge depth and logical complexity.
Results from The Early Study
Performance variability in domain topics
GPT-4 achieved relevance and factuality scores of 0.64 and 0.79, respectively, across 286 QAs (Figure 1(a)). The model’s performance varies across sub-domains, reflecting its sensitivity to topic complexity and problem-solving capacities. Both relevance and factuality rankings indicate that the model struggles with “nitrogen removal” questions (relevance = 0.49, factuality = 0.71). In contrast, “wastewater treatment development” questions are simpler, likely due to their straightforward nature. Further analysis suggests that higher performance is associated with less challenging questions (Figure 1(b)). For example, “nitrogen removal” includes 10 medium-medium and 3 hard-hard QAs, whereas “wastewater treatment development” features 16 simple-simple QAs (53% of its sample) and only 4 medium-medium QAs. Interestingly, “process control” has 58% of QAs in medium-medium and harder levels but still achieves relatively high performance, suggesting that while difficulty level significantly affects performance, it is not the sole determinant.
Figure 1. (a) Performance (relevance and factuality) of GPT-4 in responding to expert-level QA across seven wastewater-domain topics (sample size > 20). The star mark represents GPT-4’s overall performance. (b) Distribution of QA counts across seven domain topics categorized by different difficulty levels (knowledge depth and logical complexity).
Performance degradation to difficulty levels
A tradeoff between relevance and factuality is evident across the three FTMs. FTM1 (fine-tuned with 1 book) maintains higher relevance, whereas FTM3 (fine-tuned with 22 books) exhibits higher factuality, suggesting that increasing the volume of fine-tuning materials may enhance factuality while potentially reducing relevance (Figure 2(a)). A further improvement is needed to mitigate issues related to catastrophic interference by introducing instruction materials. Comparing their performance on a subset of QAs selected based on GPT-4 scores reveals a shift in this tradeoff. For QAs (n = 85) where GPT-4 scores indicate lower relevance and factuality (≤ 0.8), all three FTMs show similar relevance (≈ 0.35), but FTM3 excels in factuality (≈ 0.54), likely due to its larger fine-tuning dataset optimizing model weights to handle more challenging questions.
There is a clear decline in performance across all models from simple to hard subset QAs (Figure 2(b)). However, this decline varies in nature. Specifically, GPT-4 sees a significant drop in relevance (from 0.77 to 0.54) compared to a minor drop in factuality (from 0.86 to 0.84). A similar pattern is observed in FTM3 (relevance change = -0.19, factuality change = -0.03). In contrast, FTM1 experiences substantial decreases in both relevance (-0.22) and factuality (-0.18), highlighting the crucial role of fine-tuning material volume in mitigating factuality degradation.
Figure 2. (a) Degradation in performance (relevance and factuality) of the three FTMs based on GPT-4’s varying levels of performance across all questions, questions where both metrics ≤ 0.9, and questions where both metrics ≤ 0.8. (b) Performance (relevance and factuality) degradation across QAs categorized by difficulty levels from simple-simple (orange color) to medium-hard (green color) based on knowledge depth and logical complexity.
A QA Case of Hard Question
The example highlights distinct responses of FTM1 and FTM3 to a medium-hard question (Figure 3). Although the scenario presented is extreme, the underlying knowledge required parallels that of a real membrane bioreactor (MBR) process. FTM1 failed to provide relevant MBR information, made a factually incorrect statement, and had numerous formatting issues. In contrast, FTM3 showcased a better capability in providing accurate definition and relevant analysis, generally covering more key points, and providing specific MBR descriptions. While issues in relevance and factuality still remain, the example demonstrates a potential improvement from FTM1 to FTM3, especially for harder domain questions.
Figure 3. A QA example exhibits varying model responses to a medium-hard question.
Future Work
This early investigation provides crucial insights and lessons to minimize errors and enhance accuracy in the further research. It establishes a foundational framework for advancing robust, domain-specific LLMs in the ESE field. The study highlights that even advanced LLMs like GPT-4 have opportunities for improvement in providing accurate answers to expert-level environmental questions, particularly those of higher difficulty levels. A comprehensive manuscript has been prepared and submitted for peer-review, with updates available on the lead author’s webpage. Our ultimate goal is to create models capable of delivering accurate key points and information-rich responses for designing treatment processes and technological applications in ESE. Addressing the challenge of balancing accuracy and richness in responses is critical for future investigations. We aim to continually curate high-quality domain materials and well-structured instructional data to enhance relevance, mitigate catastrophic interference issues, and improve factuality.