API — Work In Progress¶
🚧 Work In Progress
The REST API is a Work In Progress and will be updated soon.
- The API container exists and basic endpoints are implemented, but the API has not been tested and has not been updated since host management was added to the pipeline.
- It is currently not the recommended way to interact with PBI data.
- For efficient data access and machine learning workflows, use the analysis container with the
pbiPython package directly (5–50× faster for bulk operations).
The PBI API provides a REST interface for querying the phage database and retrieving sequences programmatically. It may be useful for lightweight external integrations where only a few records need to be retrieved. Full documentation and host-related endpoints will be added once the API is updated.
Current Status¶
| Feature | Status | Notes |
|---|---|---|
| Database Connection | Implemented | Connects to DuckDB database |
| Health Endpoints | Implemented | /health and /stats |
| SQL Query Endpoint | Implemented | /query with basic safety checks |
| Phage Retrieval | Implemented | Query and ID-based retrieval |
| Protein Retrieval | Implemented | Query and ID-based retrieval |
| FASTA Export | Implemented | Export sequences to FASTA format |
| Host Endpoints | ❌ Not yet added | API not updated for host management |
| Testing | ❌ Not done | API has not been validated |
| Authentication | Planned | No auth currently |
| Rate Limiting | Planned | Not yet implemented |
| Batch Operations | Planned | Bulk data operations |
See Future Steps for planned API enhancements.
Starting the API¶
Docker (recommended)¶
# Build and start the API container
docker compose build api
docker compose up -d api
# API available at http://localhost:8000
Local (development)¶
# From the project root
cd api
uvicorn app:app --host 0.0.0.0 --port 8000
# Or with auto-reload
uvicorn app:app --host 0.0.0.0 --port 8000 --reload
Base URL¶
Interactive Documentation¶
The API provides auto-generated interactive documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
👉 We recommend using the Swagger UI for exploring and testing the API interactively.
Endpoints¶
Health & Status¶
GET /¶
API information and available endpoints.
Response:
GET /health¶
Health check — verifies database connection.
Response:
Status Codes:
200: API and database are healthy503: Database connection failed
GET /stats¶
Database statistics.
Response:
{
"phages": 873718,
"proteins": 43088582,
"trna_tmrna": 702607,
"terminators": 6462417,
"anti_crispr": 307329,
"virulent_factors": 41609,
"transmembrane": 4020770
}
Data Querying¶
POST /query¶
Execute a custom SQL query against the database.
⚠️ Warning: Use with caution — custom SQL queries are not fully sanitized.
Request Body:
Response:
Example:
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "SELECT Source_DB, COUNT(*) as count FROM fact_phages GROUP BY Source_DB"}'
POST /phages¶
Retrieve phage sequences and metadata.
Request Body (by query):
Request Body (by IDs):
Response:
{
"phages": [
{
"phage_id": "NC_000866",
"sequence": "ATCG...",
"length": 48502,
"metadata": {...}
}
],
"count": 3
}
POST /proteins¶
Retrieve protein sequences and metadata.
Request Body:
FASTA Export¶
POST /phages/fasta¶
Export phage sequences in FASTA format.
Request Body:
Response: (text/plain)
Example:
curl -X POST http://localhost:8000/phages/fasta \
-H "Content-Type: application/json" \
-d '{"query": "SELECT Phage_ID FROM fact_phages WHERE Length > 100000 LIMIT 10"}' \
> large_phages.fasta
POST /proteins/fasta¶
Export protein sequences in FASTA format.
Request Body:
Database Schema Reference¶
For API queries, the following tables are accessible:
Fact Table¶
fact_phages— Main phage metadata
Dimension Tables¶
dim_proteins— Protein annotationsdim_terminators— Transcription terminatorsdim_anti_crispr— Anti-CRISPR proteinsdim_virulent_factors— Virulence factorsdim_transmembrane_proteins— Transmembrane predictionsdim_trna_tmrna— tRNA/tmRNA featuresdim_crispr_array— CRISPR arraysdim_antimicrobial_resistance_genes— AMR genes
Host data not yet exposed
Host-phage link files (phage_host_links.csv, host_fasta_mapping.json) are not yet accessible via the API. They will be added once the API is updated for host management. In the meantime, use the pbi Python package in the analysis container.
See the Database Overview for detailed schema information including host-phage link files.
Query Examples¶
Get Phages by Host¶
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{
"query": "SELECT Phage_ID, Host, Length FROM fact_phages WHERE Host LIKE '\''%Staphylococcus%'\'' LIMIT 10"
}'
Get Large Phages with Many Proteins¶
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{
"query": "SELECT f.Phage_ID, f.Length, COUNT(p.Protein_ID) as protein_count FROM fact_phages f JOIN dim_proteins p ON f.Phage_ID = p.Phage_ID WHERE f.Length > 200000 GROUP BY f.Phage_ID, f.Length HAVING COUNT(p.Protein_ID) > 200 LIMIT 20"
}'
Export Specific Phages to FASTA¶
curl -X POST http://localhost:8000/phages/fasta \
-H "Content-Type: application/json" \
-d '{
"phage_ids": ["NC_000866", "NC_001895"]
}' > my_phages.fasta
Using Python¶
import requests
# Health check
response = requests.get('http://localhost:8000/health')
print(response.json())
# Get statistics
response = requests.get('http://localhost:8000/stats')
print(response.json())
# Query phages
response = requests.post(
'http://localhost:8000/query',
json={"query": "SELECT * FROM fact_phages LIMIT 5"}
)
print(response.json())
Running in Development Mode¶
# With auto-reload
cd api
uvicorn app:app --reload --host 0.0.0.0 --port 8000
# With custom database path
DATABASE_PATH=/path/to/database.duckdb uvicorn app:app --reload
Environment Variables¶
# Database path
export DATABASE_PATH=/data/processed/databases/phage_database_optimized.duckdb
# Phage FASTA path
export PHAGE_FASTA=/data/processed/sequences/all_phages.fasta
# Protein FASTA path
export PROTEIN_FASTA=/data/processed/sequences/all_proteins.fasta
# API port
export PORT=8000
Error Handling¶
The API returns standard HTTP status codes:
- 200: Success
- 400: Bad Request (invalid query, missing parameters)
- 404: Not Found (resource doesn't exist)
- 500: Internal Server Error (database error, unexpected issue)
- 503: Service Unavailable (database connection failed)
Error Response Format:
Known Limitations¶
- No Authentication: API is open — not suitable for public deployment
- No Rate Limiting: Can be overwhelmed by many requests
- Query Safety: Custom SQL queries not fully sanitized
- No Pagination: Large result sets may cause timeouts
- No Host Endpoints: API not updated for host management yet
Support¶
For API issues or feature requests:
Note: This API is a Work In Progress. Features and endpoints will change, including the addition of host-related endpoints. Always check the latest documentation at /docs once the API is updated.