This appendix provides a detailed overview of the literature review process and the intermediate results for our study on LLM-based autonomous testing agents.
The data is organized into directories representing different stages of the review process.
- Contents:
- Search queries used for each digital library.
- Excel sheets containing all retrieved studies and decisions on inclusion/exclusion for the initial screening.
- Purpose:
Documents the first step of our review: identifying potential papers using systematic search queries.
- Contents:
- Excel sheet listing all studies that passed the initial screening (43 studies).
- Decisions on whether each study was included after full-text screening.
- Purpose:
Tracks the detailed evaluation of each candidate paper to determine eligibility for the review.
- Contents:
- Excel sheet listing all studies from the detailed screening (18 studies).
- Papers considered during backwards snowballing (i.e., references of included papers) and inclusion/exclusion decisions.
- Purpose:
Ensures coverage of additional relevant studies that may not have appeared in the initial search.
- Contents:
- Excel sheet listing all final included studies (21 studies).
- Purpose:
Provides a consolidated record of the studies that form the basis of our review and analysis.
The Excel sheet (Primary_Sources.xlsx) contains the detailed classification of all 21 studies included in our systematic literature review.
It provides a structured view of study metadata, agent characteristics, testing targets, and evaluation details.
This sheet provides full transparency of our classification process, enabling:
- Easy replication of the literature review.
- Filtering studies by agent type, autonomy, oracle, or other dimensions.
- Reference for future meta-analyses or research synthesis.
| Column | Description |
|---|---|
| Paper Nr. | Sequential number assigned to each paper. |
| Paper Link | Direct link to the paper. |
| Title | Full title of the paper. |
| Year | Year of publication. |
| Publication Venue | Conference, journal, or workshop name. |
| Publication Type | Type of publication (conference, journal, workshop, preprint). |
| Application domain | Domain where AI is applied for testing (e.g., web, mobile, API, embedded). |
| Testing Target | Target system or component under test. |
| Testing focus | Functional focus, usability, performance, etc. |
| Agent Framework | Whether an agent framework is used (e.g., AutoGen). |
| Number of LLM Agents | Total number of LLM instances used. |
| Testing framework / automation library | Automation tools used for executing tests. |
| Other tools | Additional tools integrated into the testing process. |
| LLM used | The Large Language Model(s) employed. |
| Fine-tuning done | Whether LLMs were fine-tuned for the study. |
| Agent architecture | One of: - Single-agent iterative - Single autonomous agent + auxiliary LLM utilities - Multi-agent collaborative - Multi-agent independent |
| Agent collaboration | For multi-agent collaborative setups: - message passing - shared memory - orchestrator |
| Level of autonomy | One of: - Fully human-specified goals - Semi-autonomous - Fully autonomous |
| Oracle | One or more of: - Explicit - LLM intrinsic - System specification - Simple crash detection - Human-In-the-Loop - Metric-based |
| Granularity of Actions | Low-level (click, API call) or high-level (scenario execution, task). |
| State Representation for LLM | One or more of: - Complete Structural State - Filtered Context State - Visual State - Symbolic/Abstracted State |
| Comments | Any additional notes about the study. |
| Number of systems evaluated on | How many systems the study evaluated their approach on. |
| System type | Industrial, open-source, or academic systems. |
| Evaluation Metric | Metrics used to assess testing performance (e.g., coverage, faults found, execution time). |
| Baseline comparison | Whether the study compares results to existing methods or baselines. |