Supported Databases

Five Databases for API Access and PDF Availability

Diverga's I-category pipeline integrates with Semantic Scholar, OpenAlex, arXiv, Scopus, and Web of Science to provide automated paper retrieval with 50-60% overall PDF success rate.

Why These Databases?

Traditional academic databases don't support automated PDF access:

Traditional Databases Lack Automation

PubMed, Scopus, Web of Science, and ERIC require manual downloads

Open Access Focus

Combined 50-60% PDF retrieval success across all fields

REST API Access

All three provide generous rate limits for automation

No Subscriptions Required

Free access without institutional credentials

Primary Databases

Detailed comparison of the three integrated databases:

Semantic Scholar

~40% Open Access PDF URLs

AI-powered academic search engine with citation analysis and influential paper detection. Best for computer science, AI, and machine learning research.

API ENDPOINT

api.semanticscholar.org/graph/v1/paper/search

RATE LIMIT

100 requests per 5 minutes

Key Fields

openAccessPdf.url - Direct PDF URLs

title, abstract, authors - Metadata

citationCount, influentialCitationCount - Impact

fieldsOfStudy - Topic classification

Pros

No API key required for basic access

Citation network analysis built-in

Influential papers detection

Fast response times

Cons

Lower open access rate (~40%)

CS/ML bias in coverage

Rate limits for bulk requests

OpenAlex

~50% Open Access

Open scholarly data platform covering 250M+ works across all disciplines. Replaces Microsoft Academic Graph with comprehensive metadata and institution tracking.

API ENDPOINT

api.openalex.org/works

RATE LIMIT

10 requests per second

Key Fields

open_access.oa_url - Open access PDFs

authorships, institutions - Author data

concepts - Topic classification

cited_by_count - Citation metrics

Pros

Highest open access rate (~50%)

Broad disciplinary coverage

Rich metadata and affiliations

Polite pool for faster access

Cons

Newer database (quality varies)

Some metadata gaps

Requires mailto for best limits

arXiv

100% PDF Access

Preprint server for physics, mathematics, computer science, and more. Every paper has a freely accessible PDF with standardized URLs.

API ENDPOINT

export.arxiv.org/api/query

RATE LIMIT

3 seconds between requests (required)

Key Fields

entry.id - arXiv identifier

entry.title, entry.summary - Metadata

entry.author - Author information

entry.published - Publication date

Pros

100% PDF availability

Direct PDF URLs (arxiv.org/pdf/{id}.pdf)

Fast preprint access

No rate limit restrictions

Cons

Preprints only (not peer-reviewed)

Limited to STEM fields

Requires 3-second delay

Database Comparison

Quick reference for selecting the right database:

Database	Open Access	API Key	Rate Limit	Best For
Semantic Scholar	40%	Optional	100/5min	CS, AI, ML
OpenAlex	50%	No	10/sec	All fields
arXiv	100%	No	3s delay	Preprints

API Integration Examples

How the I-category pipeline integrates with each database:

Semantic Scholar

REQUEST

GET https://api.semanticscholar.org/graph/v1/paper/search

PARAMETERS

query: "machine learning education"fields: title,abstract,authors,openAccessPdflimit: 100

RESPONSE

{
  "papers": [
    {
      "paperId": "abc123",
      "title": "Machine Learning in Education",
      "openAccessPdf": {
        "url": "https://arxiv.org/pdf/2001.00000.pdf"
      }
    }
  ]
}

OpenAlex

REQUEST

GET https://api.openalex.org/works

PARAMETERS

search: "chatbot language learning"filter: publication_year:2020-2024mailto: researcher@university.edu

RESPONSE

{
  "results": [
    {
      "id": "W123456789",
      "title": "Chatbots for Language Learning",
      "open_access": {
        "oa_url": "https://example.com/paper.pdf"
      }
    }
  ]
}

arXiv

REQUEST

GET http://export.arxiv.org/api/query

PARAMETERS

search_query: all:"deep learning"start: 0max_results: 100

RESPONSE

{
  "entry": [
    {
      "id": "http://arxiv.org/abs/2001.00000v1",
      "title": "Deep Learning Survey",
      "pdf_url": "https://arxiv.org/pdf/2001.00000.pdf"
    }
  ]
}

PDF Retrieval Workflow

The I-category pipeline implements retry logic and fallback chains:

Fetch Metadata

Query all three databases in parallel

10-20 min

Deduplicate

Remove duplicates by DOI, arXiv ID, title similarity

1-2 min

Download PDFs

Retry logic with exponential backoff, fallback chains

20-60 min

Validate

Check PDF integrity, file size, readability

5-10 min

PDF Retrieval Success Rates

Expected outcomes across different research domains:

Computer Science60-70%

Physics & Mathematics70-80%

Biomedical Sciences40-50%

Social Sciences30-40%

Humanities20-30%

Best Practices

Optimize your database strategy:

Use All Three Databases

Maximize coverage by querying Semantic Scholar, OpenAlex, and arXiv together

Add Polite Pool Parameters

Include mailto parameter for OpenAlex to access faster rate limits

Respect Rate Limits

Implement exponential backoff and 3-second delays for arXiv

Validate PDF URLs

Check HTTP status codes before downloading to avoid broken links

Ready to Automate Your Literature Search?

The I-category pipeline handles database integration, deduplication, and PDF retrieval automatically.

I-Category Agents PRISMA Workflow View on GitHub