A Streamlit-based web application for finding grey literature on the web using Google Custom Search API. Greylitsearcher helps researchers discover and collect relevant documents from specific websites with advanced search capabilities.
Greylitsearcher is designed to systematically search for grey literature (reports, white papers, technical documents, etc.) across multiple websites. It supports up to three priority-based search queries per website and can retrieve up to 40 results per site, making it ideal for comprehensive literature reviews and research data collection.
- Multi-Query Search: Up to 3 different search queries with priority levels (1-3)
- Multi-Website Support: Search across multiple websites simultaneously
- Advanced Search Options:
- All these words (AND operator)
- Exact phrase matching
- Any of these words (OR operator)
- Exclude words (NOT operator)
- Rate Limit Handling: Automatic fallback across multiple Google Search API keys
- CSV Export: Download search results as CSV files
- Airtable Integration: Optional integration to send results directly to Airtable via data processor
- Priority-Based Results: Results are tagged with priority levels for easy filtering
- Python 3.8+
- Google Custom Search API keys (at least 1, recommended 3 for rate limit handling)
- Google Custom Search Engine IDs (CX)
- Clone or navigate to the Greylitsearcher directory:
cd Greylitsearcher- Install dependencies:
pip install -r requirements.txt-
Set up Google Custom Search API:
- Go to Google Cloud Console
- Create a new project or select an existing one
- Enable the Custom Search API
- Create API keys (recommended: 3 keys for rate limit handling)
- Create Custom Search Engines at Google Programmable Search
- Note your Search Engine IDs (CX)
-
Configure Streamlit secrets:
Create a
.streamlit/secrets.tomlfile in your project root:
# App Password (required for access)
APP_PASSWORD = "your_secure_password_here"
# Google Custom Search API
GS1_CX = "your_search_engine_id_1"
GS1_KEY = "your_google_api_key_1"
GS2_KEY = "your_google_api_key_2"
GS3_KEY = "your_google_api_key_3"
# Airtable Integration (optional)
AIRTABLE_TOKEN = "your_airtable_personal_access_token"
AIRTABLE_BASE_ID = "your_base_id"
AIRTABLE_TABLE_NAME = "raw_results" # Optional: table name or table IDNote: The .streamlit directory should be in your project root, and secrets.toml should not be committed to git (add it to .gitignore).
Or if using Streamlit Cloud, add these as secrets in your app settings.
streamlit run main.pyThe application will open in your default web browser at http://localhost:8501.
When you first open the app, you'll be prompted to enter a password. This password is set in your secrets.toml file as APP_PASSWORD. Once authenticated, you can use the app normally. A logout button is available in the sidebar if you need to log out.
-
Configure Search Queries:
- Search 1 (Priority 1): Primary search query (expanded by default)
- Search 2 (Priority 2): Secondary search query (optional)
- Search 3 (Priority 3): Tertiary search query (optional)
-
Enter Websites:
- Add one website per line in the "Websites to search" text area
- Example:
example.com another-site.org research-institute.edu
-
Execute Search:
- Click the "Search" button
- The application will:
- First try Search 1 queries (up to 4 pages = 40 results)
- If fewer than 40 results, try Search 2 queries (up to 8 pages)
- If still fewer than 40 results, try Search 3 queries (up to 10 pages)
- Automatically deduplicate results by URL
-
View and Export Results:
- Results are displayed in an interactive table
- Download results as CSV using the download button
- Each website's results are shown separately
Each search query supports four types of search terms:
- All these words: Documents must contain all specified terms (AND)
- This exact word or phrase: Documents must contain the exact phrase
- Any of these words: Documents must contain at least one term (OR)
- None of these words: Documents must not contain these terms (NOT)
Results are automatically tagged with priority levels:
- Priority 1: Results from Search 1 queries
- Priority 2: Results from Search 2 queries
- Priority 3: Results from Search 3 queries
This helps you identify which search strategy found each result.
Greylitsearcher uses direct Airtable integration to save search results. The integration is built into the app and works automatically once configured.
-
Create an Airtable Personal Access Token:
- Go to Airtable Account Settings
- Click "Create new token"
- Give it a name (e.g., "Greylitsearcher")
- Grant the following scopes:
data.records:readdata.records:write
- Grant access to your base
- Copy the token
-
Get your Airtable Base ID and Table ID/Name:
- Open your Airtable base
- Go to Airtable API Documentation
- Select your base
- The Base ID is shown at the top (starts with
app...) - You can use either:
- Table Name: The display name of your table (e.g., "raw_results")
- Table ID: The unique identifier (starts with
tbl...) - more reliable if table name might change
-
Add credentials to Streamlit secrets:
Create or edit
.streamlit/secrets.toml:
AIRTABLE_TOKEN = "your_personal_access_token"
AIRTABLE_BASE_ID = "your_base_id" # e.g., "appHrhJQHkZz4c82U"
AIRTABLE_TABLE_NAME = "raw_results" # Optional: table name or table ID (e.g., "tblb9eEVPpV4Qqo4u")Or if using Streamlit Cloud, add these as secrets in your app settings.
Your Airtable table should have the following fields:
Required Fields (must exist):
title- Single line textlink- URL
Optional Fields (will be populated if they exist):
snippet- Long textsource_domain- Single line textsearch_query- Single line textpriority- Number (1, 2, or 3)scraped_at- Date (date only, no time)status- Single select (must include "Todo" as an option)
Note: Field names are case-sensitive and must match exactly (lowercase with underscores).
Once configured, the app will automatically show a "Save All Results to Airtable" button after you perform a search. The integration:
- Saves all search results to Airtable
- Checks for duplicates (optional, can be toggled)
- Preserves priority levels (1, 2, 3)
- Tracks search queries
- Shows real-time progress during save
- Provides detailed statistics (created, duplicates, errors)
- Sets status to "Todo" for new records
- Sets scraped_at to current date (YYYY-MM-DD format)
- Handles rate limiting (5 requests/second)
Records are saved with the following structure:
title: Document title from search resultslink: Full URL to the documentsnippet: Search result snippet/descriptionsource_domain: Extracted domain from URLsearch_query: The search query that found this resultpriority: Priority level (1, 2, or 3)scraped_at: Current date in YYYY-MM-DD formatstatus: Set to "Todo"
If you prefer to use the data processor service instead of direct integration, see airtable_integration.py for the processor-based approach.
- Free tier: 100 queries per day per API key
- Paid tier: Up to 10,000 queries per day per API key
Greylitsearcher automatically handles rate limits by:
- Trying the first API key
- If rate limited, trying the second key
- If rate limited, trying the third key
- If all keys are exhausted, displaying an error message
Recommendation: Use 3 API keys to maximize daily query capacity.
- Start with fewer websites to test your setup
- Monitor your API usage in Google Cloud Console
- Consider scheduling searches during off-peak hours
- Use multiple API keys for higher volume searches
Each search result contains:
title: Document titlelink: URL to the documentsnippet: Brief description/snippetpriority: Priority level (1, 2, or 3)- Additional Google Search API metadata
Solution:
- Wait 24 hours for daily limits to reset
- Upgrade to paid Google API tier
- Add more API keys
- Reduce the number of websites or search queries
Possible causes:
- Search query too specific
- Website doesn't have matching content
- Google Search Engine not indexing the website
- API key or CX incorrect
Solution:
- Try broader search terms
- Verify website is accessible and indexed by Google
- Check API keys and CX values in secrets
Solution:
- Ensure
.streamlit/secrets.tomlexists in project root - For Streamlit Cloud, add secrets via app settings
- Restart Streamlit after changing secrets
Solution:
- Ensure
APP_PASSWORDis set in yoursecrets.tomlfile - Password is case-sensitive
- If you forget the password, check your
secrets.tomlfile - To log out, use the logout button in the sidebar
- After changing the password in secrets, restart Streamlit
Solution:
- Verify Airtable token is valid and not expired
- Check Base ID is correct (starts with
app...) - Ensure table name or table ID matches your Airtable table
- Verify token has
data.records:readanddata.records:writescopes - Check token has access to the base
- Verify Airtable API rate limits (5 requests/second)
- Ensure
statusfield has "Todo" as a select option - Ensure
scraped_atfield is a Date field (not Date with time)
Solution:
- Ensure
scraped_atfield in Airtable is a Date field (not Date with time) - The field should accept date format: YYYY-MM-DD
Solution:
- Ensure your
statusfield in Airtable has "Todo" as one of the select options - You cannot create new select options via API without proper permissions
- Add "Todo" as an option in your Airtable table settings
Greylitsearcher/
├── main.py # Main Streamlit application
├── direct_airtable_integration.py # Direct Airtable integration (default, built-in)
├── airtable_integration.py # Processor-based integration (optional alternative)
├── requirements.txt # Python dependencies
├── .streamlit/
│ └── secrets.toml # Configuration file (not in git)
└── README.md # This file
streamlit==1.31.0- Web application frameworkrequests==2.31.0- HTTP library for API callspandas- Data manipulation and CSV exportbeautifulsoup4==4.12.3- HTML parsing (for future enhancements)pyairtable>=2.3.0- Airtable API client for direct integration
Potential improvements:
- Save search configurations for reuse
- Export to other formats (JSON, Excel)
- Content extraction from result URLs
- Scheduled/automated searches
- Search history and saved results
- Advanced filtering options
- Batch processing for large-scale searches
Greylitsearcher will be part of a larger data pipeline system:
Greylitsearcher → Data Processor + LLM Screener → Airtable → Human Review → Vector DB → Chatbot
See ../ARCHITECTURE.md for the complete system architecture.
MIT