API Engine
API Engine
The Crawlab API Engine provides a RESTful interface for managing and interacting with the Crawlab platform's components. Built on top of the Gin web framework, it follows a structured design with controllers, middlewares, and utilities that facilitate clean, maintainable, and extensible API endpoints.
Architecture
The API Engine follows a layered architecture with clear separation of concerns:
Key Components
-
HTTP Server Layer:
- Handles HTTP connections, request routing, and middleware integration
- Implemented in the
Apistruct within theappspackage - Manages lifecycle events (initialization, start/stop)
-
Middleware Layer:
- Processes requests before they reach controllers
- Handles cross-cutting concerns like authentication, CORS, and logging
- Defined in the
middlewarespackage
-
Router Layer:
- Maps HTTP routes to controller handlers
- Organizes endpoints by authentication requirements
- Implemented in the
controllers/router.gofile
-
Controller Layer:
- Implements API endpoint handlers and business logic
- Uses generic Base Controller for common operations
- Specialized controllers for specific resources (Spider, Task, etc.)
-
Service Layer:
- Provides data access and domain logic operations
- Abstracts database operations
- Implements business rules
-
Database Layer:
- MongoDB storage for persistent data
- File system for spider code and related assets
Request Processing Flow
The following sequence diagram illustrates how requests flow through the system:
Processing Steps
- Request Reception: The server receives an incoming HTTP request
- Middleware Processing:
- CORS headers are applied
- Authentication tokens are validated
- User information is attached to request context
- Controller Execution:
- Parameters are extracted and validated
- Business logic is executed through service layer
- Appropriate response is prepared
- Response Generation:
- Standardized JSON format is used
- Success/error status is determined
- Data payload is formatted consistently
Controller System
Crawlab uses a generic controller system that provides standardized CRUD operations for all resources:
Base Controller
The BaseController is a generic type that handles common operations:
type BaseController[T any] struct {
modelSvc *service.ModelService[T] // Data service
actions []Action // Custom actions
}
It provides implementations for:
GetById: Retrieve a specific resource by IDGetList: Retrieve a paginated list of resourcesPost: Create a new resourcePutById: Update a specific resourcePatchList: Batch update resourcesDeleteById: Delete a specific resourceDeleteList: Batch delete resources
Controller Hierarchy
Specialized controllers extend the base controller with resource-specific operations:
Authentication System
Authentication is a critical component of the API engine, providing security for protected resources:
Authentication Flow
Implementation Details
The authentication system supports:
-
Bearer Token Authentication:
- JWT-style tokens in Authorization header
- Tokens carry user identity and claims
- Implementation in
middlewares.AuthorizationMiddleware()
-
Sync Authentication:
- Special authentication for node synchronization
- Uses an application auth key
- Implementation in
middlewares.SyncAuthorizationMiddleware()
-
Role-Based Access:
- Users have assigned roles (admin, regular user)
- Permissions are determined by role
- Resource access is controlled by permissions
Router System and API Structure
The router system organizes endpoints by authentication requirement and resource type:
Router Groups
type RouterGroups struct {
AuthGroup *gin.RouterGroup // Requires authentication
SyncAuthGroup *gin.RouterGroup // Special sync authentication
AnonymousGroup *gin.RouterGroup // No authentication required
}
API Resource Categories
Crawlab's API is organized into several logical resource categories:
-
Authentication:
POST /login: User authenticationGET /me: Current user information
-
Users & Projects:
GET/POST/PUT/DELETE /users: User managementGET/POST/PUT/DELETE /projects: Project organization
-
Spiders:
GET/POST/PUT/DELETE /spiders: Spider managementGET /spiders/:id/files: Spider file listingPOST /spiders/:id/run: Run a spider
-
Tasks:
GET/POST /tasks: Task managementGET /tasks/:id/logs: Task log retrievalPOST /tasks/:id/cancel: Cancel a running task
-
Schedules:
GET/POST/PUT/DELETE /schedules: Schedule managementPOST /schedules/:id/enable: Enable a schedule
-
Nodes:
GET /nodes: Node listingPOST /nodes/:id/enable: Enable a node
Response Handling
The API uses standardized response formats for consistency:
Response Structure
type Response struct {
Status string `json:"status"` // "ok" for all responses
Message string `json:"message"` // "success" or "error"
Data interface{} `json:"data"` // Payload for success
Error string `json:"error"` // Error message
}
type ListResponse struct {
Status string `json:"status"`
Message string `json:"message"`
Total int `json:"total"` // Total count for pagination
Data interface{} `json:"data"` // Array of items
Error string `json:"error"`
}
Response Generation
Helper functions ensure consistent response formatting:
HandleSuccess: General success responseHandleSuccessWithData: Success with data payloadHandleSuccessWithListData: Success with paginated listHandleErrorBadRequest: 400 Bad Request responseHandleErrorUnauthorized: 401 Unauthorized responseHandleErrorForbidden: 403 Forbidden responseHandleErrorNotFound: 404 Not Found responseHandleErrorInternalServerError: 500 Internal Server Error response
Error Handling
The API implements a comprehensive error handling strategy:
-
Validation Errors:
- Return 400 Bad Request with validation details
- Example: Missing required fields, invalid format
-
Authentication Errors:
- Return 401 Unauthorized with error message
- Example: Invalid or expired token
-
Permission Errors:
- Return 403 Forbidden with error message
- Example: Insufficient privileges for operation
-
Not Found Errors:
- Return 404 Not Found with error message
- Example: Resource with specified ID doesn't exist
-
Server Errors:
- Return 500 Internal Server Error with details
- Debug information only in development mode
Extensibility
The API engine is designed for extensibility:
-
Generic Controllers:
- Easy creation of new resource endpoints
- Type-safe operations with Go generics
-
Custom Actions:
- Addition of non-standard operations to controllers
- Registered via the
actionsparameter:
RegisterController(groups.AuthGroup, "/spiders", NewController[models.Spider]([]Action{
{
Method: http.MethodGet,
Path: "/:id/files",
HandlerFunc: GetSpiderFiles,
},
// more custom actions...
}...)) -
Middleware Registration:
- Custom request processing
- Cross-cutting concerns
Server Configuration
The HTTP server is configured with sensible defaults and can be customized:
- Address: Configurable host and port (default: 0.0.0.0:8000)
- TLS: Optional TLS configuration for HTTPS
- Timeouts:
- Read timeout: 30 seconds
- Write timeout: 30 seconds
- Idle timeout: 120 seconds
Performance Considerations
The API engine includes several optimizations:
-
Request Pagination:
- Limits large result sets
- Configurable page size and number
-
Database Query Optimization:
- Efficient MongoDB query building
- Index utilization
-
Context Cancellation:
- Proper handling of client disconnects
- Resource cleanup
Security Considerations
Security is a primary concern for the API engine:
-
Token-based Authentication:
- JWT or similar token validation
- Stateless design
-
Role-based Access Control:
- Permission verification for operations
- Principle of least privilege
-
Input Validation:
- Strict request validation
- Prevention of injection attacks
-
Error Information Limitation:
- Detailed errors only in development mode
- Generic error messages in production
Conclusion
The Crawlab API Engine provides a robust, consistent, and extensible interface for interacting with the Crawlab platform. Its well-structured design facilitates maintenance and evolution, while its security features ensure that resources are properly protected. The standardized response format and error handling provide a consistent experience for API consumers.