Supported Databases
Five Databases for API Access and PDF Availability
Diverga's I-category pipeline integrates with Semantic Scholar, OpenAlex, arXiv, Scopus, and Web of Science to provide automated paper retrieval with 50-60% overall PDF success rate.
Why These Databases?
Traditional academic databases don't support automated PDF access:
Traditional Databases Lack Automation
PubMed, Scopus, Web of Science, and ERIC require manual downloads
Open Access Focus
Combined 50-60% PDF retrieval success across all fields
REST API Access
All three provide generous rate limits for automation
No Subscriptions Required
Free access without institutional credentials
Primary Databases
Detailed comparison of the three integrated databases:
Semantic Scholar
~40% Open Access PDF URLs
AI-powered academic search engine with citation analysis and influential paper detection. Best for computer science, AI, and machine learning research.
API ENDPOINT
api.semanticscholar.org/graph/v1/paper/searchRATE LIMIT
100 requests per 5 minutesKey Fields
Pros
Cons
OpenAlex
~50% Open Access
Open scholarly data platform covering 250M+ works across all disciplines. Replaces Microsoft Academic Graph with comprehensive metadata and institution tracking.
API ENDPOINT
api.openalex.org/worksRATE LIMIT
10 requests per secondKey Fields
Pros
Cons
arXiv
100% PDF Access
Preprint server for physics, mathematics, computer science, and more. Every paper has a freely accessible PDF with standardized URLs.
API ENDPOINT
export.arxiv.org/api/queryRATE LIMIT
3 seconds between requests (required)Key Fields
Pros
Cons
Database Comparison
Quick reference for selecting the right database:
| Database | Open Access | API Key | Rate Limit | Best For |
|---|---|---|---|---|
| Semantic Scholar | 40% | Optional | 100/5min | CS, AI, ML |
| OpenAlex | 50% | No | 10/sec | All fields |
| arXiv | 100% | No | 3s delay | Preprints |
API Integration Examples
How the I-category pipeline integrates with each database:
Semantic Scholar
REQUEST
GET https://api.semanticscholar.org/graph/v1/paper/searchPARAMETERS
query: "machine learning education"fields: title,abstract,authors,openAccessPdflimit: 100RESPONSE
{
"papers": [
{
"paperId": "abc123",
"title": "Machine Learning in Education",
"openAccessPdf": {
"url": "https://arxiv.org/pdf/2001.00000.pdf"
}
}
]
}OpenAlex
REQUEST
GET https://api.openalex.org/worksPARAMETERS
search: "chatbot language learning"filter: publication_year:2020-2024mailto: researcher@university.eduRESPONSE
{
"results": [
{
"id": "W123456789",
"title": "Chatbots for Language Learning",
"open_access": {
"oa_url": "https://example.com/paper.pdf"
}
}
]
}arXiv
REQUEST
GET http://export.arxiv.org/api/queryPARAMETERS
search_query: all:"deep learning"start: 0max_results: 100RESPONSE
{
"entry": [
{
"id": "http://arxiv.org/abs/2001.00000v1",
"title": "Deep Learning Survey",
"pdf_url": "https://arxiv.org/pdf/2001.00000.pdf"
}
]
}PDF Retrieval Workflow
The I-category pipeline implements retry logic and fallback chains:
Fetch Metadata
Query all three databases in parallel
Deduplicate
Remove duplicates by DOI, arXiv ID, title similarity
Download PDFs
Retry logic with exponential backoff, fallback chains
Validate
Check PDF integrity, file size, readability
PDF Retrieval Success Rates
Expected outcomes across different research domains:
Best Practices
Optimize your database strategy:
Use All Three Databases
Maximize coverage by querying Semantic Scholar, OpenAlex, and arXiv together
Add Polite Pool Parameters
Include mailto parameter for OpenAlex to access faster rate limits
Respect Rate Limits
Implement exponential backoff and 3-second delays for arXiv
Validate PDF URLs
Check HTTP status codes before downloading to avoid broken links
Ready to Automate Your Literature Search?
The I-category pipeline handles database integration, deduplication, and PDF retrieval automatically.