Back to Systematic Review

Supported Databases

Five Databases for API Access and PDF Availability

Diverga's I-category pipeline integrates with Semantic Scholar, OpenAlex, arXiv, Scopus, and Web of Science to provide automated paper retrieval with 50-60% overall PDF success rate.

Why These Databases?

Traditional academic databases don't support automated PDF access:

Traditional Databases Lack Automation

PubMed, Scopus, Web of Science, and ERIC require manual downloads

Open Access Focus

Combined 50-60% PDF retrieval success across all fields

REST API Access

All three provide generous rate limits for automation

No Subscriptions Required

Free access without institutional credentials

Primary Databases

Detailed comparison of the three integrated databases:

Semantic Scholar

~40% Open Access PDF URLs

AI-powered academic search engine with citation analysis and influential paper detection. Best for computer science, AI, and machine learning research.

API ENDPOINT

api.semanticscholar.org/graph/v1/paper/search

RATE LIMIT

100 requests per 5 minutes

Key Fields

openAccessPdf.url - Direct PDF URLs
title, abstract, authors - Metadata
citationCount, influentialCitationCount - Impact
fieldsOfStudy - Topic classification

Pros

No API key required for basic access
Citation network analysis built-in
Influential papers detection
Fast response times

Cons

Lower open access rate (~40%)
CS/ML bias in coverage
Rate limits for bulk requests

OpenAlex

~50% Open Access

Open scholarly data platform covering 250M+ works across all disciplines. Replaces Microsoft Academic Graph with comprehensive metadata and institution tracking.

API ENDPOINT

api.openalex.org/works

RATE LIMIT

10 requests per second

Key Fields

open_access.oa_url - Open access PDFs
authorships, institutions - Author data
concepts - Topic classification
cited_by_count - Citation metrics

Pros

Highest open access rate (~50%)
Broad disciplinary coverage
Rich metadata and affiliations
Polite pool for faster access

Cons

Newer database (quality varies)
Some metadata gaps
Requires mailto for best limits

arXiv

100% PDF Access

Preprint server for physics, mathematics, computer science, and more. Every paper has a freely accessible PDF with standardized URLs.

API ENDPOINT

export.arxiv.org/api/query

RATE LIMIT

3 seconds between requests (required)

Key Fields

entry.id - arXiv identifier
entry.title, entry.summary - Metadata
entry.author - Author information
entry.published - Publication date

Pros

100% PDF availability
Direct PDF URLs (arxiv.org/pdf/{id}.pdf)
Fast preprint access
No rate limit restrictions

Cons

Preprints only (not peer-reviewed)
Limited to STEM fields
Requires 3-second delay

Database Comparison

Quick reference for selecting the right database:

DatabaseOpen AccessAPI KeyRate LimitBest For
Semantic Scholar40%Optional100/5minCS, AI, ML
OpenAlex50%No10/secAll fields
arXiv100%No3s delayPreprints

API Integration Examples

How the I-category pipeline integrates with each database:

Semantic Scholar

REQUEST

GET https://api.semanticscholar.org/graph/v1/paper/search

PARAMETERS

query: "machine learning education"fields: title,abstract,authors,openAccessPdflimit: 100

RESPONSE

{
  "papers": [
    {
      "paperId": "abc123",
      "title": "Machine Learning in Education",
      "openAccessPdf": {
        "url": "https://arxiv.org/pdf/2001.00000.pdf"
      }
    }
  ]
}

OpenAlex

REQUEST

GET https://api.openalex.org/works

PARAMETERS

search: "chatbot language learning"filter: publication_year:2020-2024mailto: researcher@university.edu

RESPONSE

{
  "results": [
    {
      "id": "W123456789",
      "title": "Chatbots for Language Learning",
      "open_access": {
        "oa_url": "https://example.com/paper.pdf"
      }
    }
  ]
}

arXiv

REQUEST

GET http://export.arxiv.org/api/query

PARAMETERS

search_query: all:"deep learning"start: 0max_results: 100

RESPONSE

{
  "entry": [
    {
      "id": "http://arxiv.org/abs/2001.00000v1",
      "title": "Deep Learning Survey",
      "pdf_url": "https://arxiv.org/pdf/2001.00000.pdf"
    }
  ]
}

PDF Retrieval Workflow

The I-category pipeline implements retry logic and fallback chains:

1

Fetch Metadata

Query all three databases in parallel

10-20 min
2

Deduplicate

Remove duplicates by DOI, arXiv ID, title similarity

1-2 min
3

Download PDFs

Retry logic with exponential backoff, fallback chains

20-60 min
4

Validate

Check PDF integrity, file size, readability

5-10 min

PDF Retrieval Success Rates

Expected outcomes across different research domains:

Computer Science60-70%
Physics & Mathematics70-80%
Biomedical Sciences40-50%
Social Sciences30-40%
Humanities20-30%

Best Practices

Optimize your database strategy:

Use All Three Databases

Maximize coverage by querying Semantic Scholar, OpenAlex, and arXiv together

Add Polite Pool Parameters

Include mailto parameter for OpenAlex to access faster rate limits

Respect Rate Limits

Implement exponential backoff and 3-second delays for arXiv

Validate PDF URLs

Check HTTP status codes before downloading to avoid broken links

Ready to Automate Your Literature Search?

The I-category pipeline handles database integration, deduplication, and PDF retrieval automatically.