This paper conducted a novel study to explore the capabilities of ChatGPT, a state-of-the-art LLM, in static analysis tasks such as static bug detection and false positive warning removal. In our evaluation, we focused on two types of typical and critical bugs targeted by static bug detection, i.e., Null Dereference and Resource Leak, as our subjects. We employ Infer, a well-established static analyzer, to aid the gathering of these two bug types from 10 open-source projects. Consequently, our experiment dataset contains 222 instances of Null Dereference bugs and 46 instances of Resource Leak bugs. Our study demonstrates that ChatGPT can achieve remarkable performance in the mentioned static analysis tasks, including bug detection and false-positive warning removal. In static bug detection, ChatGPT achieves accuracy and precision values of up to 68.37% and 63.76% for detecting Null Dereference bugs and 76.95% and 82.73% for detecting Resource Leak bugs, improving the precision of the current leading bug detector, Infer by 12.86% and 43.13% respectively. For removing false-positive warnings, ChatGPT can reach a precision of up to 93.88% for Null Dereference bugs and 63.33% for Resource Leak bugs, surpassing existing state-of-the-art false-positive warning removal tools.
Journal
History-Driven Fuzzing For Deep Learning Libraries
Nima Shiri Harzevili, Mohammad Mahdi Mohajer, Moshi Wei, Hung Viet Pham, and Song Wang
ACM Transactions on Software Engineering and Methodology (TOSEM 2024), Aug 2024
Recently, many Deep Learning (DL) fuzzers have been proposed for API-level testing of DL libraries. However, they either perform unguided input generation (e.g., not considering the relationship between API arguments when generating inputs) or only support a limited set of corner-case test inputs. Furthermore, many developer APIs crucial for library development remain untested, as they are typically not well documented and lack clear usage guidelines, unlike end-user APIs. This makes them a more challenging target for automated testing.To fill this gap, we propose a novel fuzzer named Orion, which combines guided test input generation and corner-case test input generation based on a set of fuzzing heuristic rules constructed from historical data known to trigger critical issues in the underlying implementation of DL APIs. To extract the fuzzing heuristic rules, we first conduct an empirical study on the root cause analysis of 376 vulnerabilities in two of the most popular DL libraries, PyTorch and TensorFlow. We then construct the fuzzing heuristic rules based on the root causes of the extracted historical vulnerabilities. Using these fuzzing heuristic rules, Orion generates corner-case test inputs for API-level fuzzing. In addition, we extend the seed collection of existing studies to include test inputs for developer APIs.Our evaluation shows that Orion reports 135 vulnerabilities in the latest releases of TensorFlow and PyTorch, 76 of which were confirmed by the library developers. Among the 76 confirmed vulnerabilities, 69 were previously unknown, and 7 have already been fixed. The rest are awaiting further confirmation. For end-user APIs, Orion detected 45.58% and 90% more vulnerabilities in TensorFlow and PyTorch, respectively, compared to the state-of-the-art conventional fuzzer, DeepRel. When compared to the state-of-the-art LLM-based DL fuzzer, AtlasFuz, and Orion detected 13.63% more vulnerabilities in TensorFlow and 18.42% more vulnerabilities in PyTorch. Regarding developer APIs, Orion stands out by detecting 117% more vulnerabilities in TensorFlow and 100% more vulnerabilities in PyTorch compared to the most relevant fuzzer designed for developer APIs, such as FreeFuzz.
Workshop
Fairness Analysis of Machine Learning-Based Code Reviewer Recommendation
Mohammad Mahdi Mohajer, Alvine Boaye Belle, Nima Shiri Harzevili , Junjie Wang, Hadi Hemmati, Song Wang, and Zhen Ming Jiang
5th International Workshop on Algorithmic Bias in Search and Recommendation (Bias@SIGIR2024), Aug 2024
Assessing the Impact of GPT-4 Turbo in Generating Defeaters for Assurance Cases
Kimya Khakzad Shahandashti, Mithila Sivakumar, Mohammad Mahdi Mohajer, Alvine Boaye Belle, Song Wang, and Timothy Lethbridge
In Proceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering (Forge 2024), Lisbon, Portugal, Aug 2024
Assurance cases (ACs) are structured arguments that allow verifying the correct implementation of the created systems’ non-functional requirements (e.g., safety, security). This allows for preventing system failure. The latter may result in catastrophic outcomes (e.g., loss of lives). ACs support the certification of systems in compliance with industrial standards, e.g., DO-178C and ISO 26262. Identifying defeaters —arguments that challenge these ACs — is crucial for enhancing ACs’ robustness and confidence. To automatically support that task, we propose a novel approach that explores the potential of GPT-4 Turbo, an advanced Large Language Model (LLM) developed by OpenAI, in identifying defeaters within ACs formalized using the Eliminative Argumentation (EA) notation. Our preliminary evaluation assesses the model’s ability to comprehend and generate arguments in this context and the results show that GPT-4 turbo is very proficient in EA notation and can generate different types of defeaters.