
Alibaba Cloud Full Stack (7): SLS, CloudMonitor, and Observability
Build full-stack observability: SLS for log collection and querying, CloudMonitor for metrics and alerts, ARMS for distributed tracing. Set up a complete monitoring stack for a production web application.
The worst production outage I ever caused took three hours to diagnose. A Node.js service was returning 502s intermittently — maybe 5% of requests — and I had nothing. No centralized logs (each ECS instance had its own /var/log/ and I was SSH-ing into them one at a time). No metrics dashboards (I was running top and df -h in terminals). No tracing (I was adding console.log timestamps to try to figure out which downstream call was hanging). Three hours later, I found the issue: a connection pool to RDS was exhausting under load because a forgotten cron job was holding connections open. The fix was two lines of code. The diagnosis took three hours of misery because I had zero observability.
The lesson was simple and costly: observability isn’t something you set up after your app is stable. You set it up before deploying to production. Ideally, before you even write the application code, because the observability stack shapes how you structure your logging, propagate request IDs, and instrument your dependencies. Set it up last, and you’ll have to retrofit everything. Set it up first, and everything will fit naturally.
This article covers the full observability stack on Alibaba Cloud: SLS for logs, CloudMonitor for metrics, and ARMS for traces. By the end you will have a working monitoring setup for the production web application we have been building throughout this series. The ECS instances come from Part 2 , the network from Part 3 . For the Terraform approach to provisioning these monitoring resources, see Terraform Part 7: Observability and Cost Control .
The Three Pillars of Observability#
The industry has converged on three signals that together provide a complete picture of what your system is doing:

Logs tell you what happened. A log line says “at 14:32:07, user abc123 requested /api/orders, which returned a 500 because the database connection timed out after 30 seconds.” Logs are discrete events, timestamped and structured. They are the forensic evidence you examine after something goes wrong.
Metrics tell you what is happening right now. A metric says “the P99 latency of /api/orders is currently 2.3 seconds, CPU utilization across the app tier is 78%, and the RDS connection pool is 90% exhausted.” Metrics are numerical time series. They are the vital signs you watch on a dashboard to spot problems before users report them.
Traces tell you why it happened. A trace says “this specific request spent 15ms in the API gateway, 200ms in the order service, 1800ms waiting for a database query, and 50ms serializing the response.” Traces follow a single request as it traverses multiple services. They are the X-ray that reveals which component in a distributed system is the bottleneck.
You need all three. Metrics tell you something is wrong (e.g., error rate spiked). Logs tell you what is wrong (e.g., database timeout errors). Traces tell you why it’s wrong (e.g., one specific query on the orders table is doing a full table scan because an index was dropped).
On Alibaba Cloud, the mapping is straightforward:
| Pillar | Alibaba Cloud Service | AWS Equivalent | What It Does |
|---|---|---|---|
| Logs | SLS (Simple Log Service) | CloudWatch Logs + OpenSearch | Log collection, indexing, querying, analytics |
| Metrics | CloudMonitor | CloudWatch Metrics | Infrastructure and custom metrics, alerting |
| Traces | ARMS (Application Real-Time Monitoring) | X-Ray + CloudWatch APM | APM, distributed tracing, service topology |
These three services integrate with each other. CloudMonitor can trigger alerts based on SLS query results. ARMS traces link to SLS log entries. SLS dashboards can pull CloudMonitor metric data. The integration isn’t as seamless as Datadog’s unified platform, but it covers 90% of what you need without third-party tooling.
SLS: Simple Log Service#
SLS is the backbone of observability on Alibaba Cloud. Despite the name, it’s not simple—it’s a fully-featured log analytics platform that combines collection, storage, indexing, querying, visualization, and alerting in one service. Think of it as AWS CloudWatch Logs and Elasticsearch combined with a SQL query engine on top.

Core Concepts#
SLS organizes everything into two levels:
Project — A top-level container, usually one per environment or application. A project is region-specific. All the logstores, dashboards, and alerts within a project share the same billing account and access control.
Logstore — A table of log data within a project. Each logstore has its own schema, retention period, and indexing configuration. You typically create one logstore per log source: one for nginx access logs, one for application logs, one for system logs.
| |
Create a project and logstores via the CLI:
| |
The shardCount determines write throughput. Each shard handles 5 MB/s write and 10 MB/s read. Two shards gives you 10 MB/s write capacity. With autoSplit enabled, SLS automatically adds shards when write pressure exceeds the threshold, up to maxSplitShard.
SLS vs AWS: What Is Different#
If you are coming from AWS, the mapping is worth clarifying because SLS is not a 1:1 CloudWatch Logs equivalent:
| Capability | SLS | AWS |
|---|---|---|
| Log collection agent | Logtail (SLS-native) | CloudWatch Agent |
| Full-text search | Built-in, sub-second latency | CloudWatch Logs Insights (slower) |
| SQL analytics | Full SQL syntax on log data | CloudWatch Logs Insights (limited SQL) |
| Dashboards | Built into SLS | CloudWatch Dashboards (separate) |
| Long-term storage | Built-in tiered storage | Export to S3 + Athena |
| Schema-on-read | Yes, with indexing | Partially (Insights) |
| Real-time streaming | Built-in consumer groups | Kinesis Data Streams (separate) |
The biggest difference: SLS combines log storage, search, and analytics in one service. On AWS, you would use CloudWatch Logs for collection, possibly export to S3, set up Elasticsearch (OpenSearch) for search, and use Athena for SQL analytics. SLS does all of that in one service.lace. The tradeoff is vendor lock-in — SLS query syntax is not standard across clouds.
Log Query Syntax#
SLS supports three query modes, and understanding them saves a lot of frustration.
Full-text search — Just type a keyword. SLS searches across all indexed fields.
| |
This returns every log line containing the word “ERROR” anywhere.
Key-value search — Use field names with operators for precise filtering.
| |
This returns log entries where the HTTP status code is 500 or above AND the request method is POST. The colon (:) is a contains operator; >= is numeric comparison.
SQL analytics — Append a pipe | after a search expression and write standard SQL.
| |
This finds all 5xx errors, then groups them by minute to show the error count and number of unique affected users over time. The __time__ field is the built-in log timestamp. The approx_distinct function is a HyperLogLog approximation — fast and memory-efficient for high-cardinality fields.

Here are real queries I use daily:
| |
Enabling Indexes#
SLS does not index fields by default. Before you can use key-value queries or SQL analytics, you need to create an index configuration. Without indexes, only full-text search on raw log content works, and even that requires a full-text index.
| |
The line section enables full-text indexing with the specified token delimiters. The keys section defines field-level indexes. Setting doc_value: true enables SQL analytics on that field. Every indexed field costs storage, so only index the fields you actually query.
Cost note: Indexing roughly doubles your storage cost. For high-volume logs where you only need full-text search, skip per-field indexing and rely on the
lineindex. For access logs where you run SQL dashboards, per-field indexing is worth the cost.
Setting Up Logtail#
Logtail is SLS’s log collection agent. It runs on your ECS instances, watches log files, parses them according to your configuration, and ships them to SLS. It is lightweight (typically 50-100 MB RAM, <1% CPU), reliable (handles network interruptions with local buffering), and tightly integrated with SLS.

Installation#
On an ECS instance in the same region, installation is one command:
| |
The install script detects whether you are on a VPC internal network or the public internet and configures the endpoint accordingly. VPC-internal communication is free — there are no data transfer charges for log shipping within the same region.
After installation, create a machine group in SLS to identify which instances should receive which log collection configs:
| |
For auto-scaling groups where IPs change, use user-defined identity instead of IP-based identification. Create a file /etc/ilogtail/user_defined_id on each instance containing a group identifier like prod-app-servers, and set machineIdentifyType to userdefined.
Collecting Nginx Access Logs#
The most common collection setup is parsing nginx access logs with a custom format. First, configure nginx to write structured logs:
| |
Then create a Logtail collection config that parses this format:
| |
Apply the config via CLI:
| |
Within a minute, logs start flowing. You can verify in the SLS console or via CLI:
| |

Collecting Application Logs (JSON Format)#
For application logs, I strongly recommend JSON format. It eliminates the regex-parsing fragility and makes field indexing automatic.
Configure your application to emit JSON logs. Here is a Node.js example with pino:
| |
This produces log lines like:
| |
The Logtail config for JSON logs is much simpler — no regex needed:
| |
Collecting System Logs#
For syslog, journald, and system-level events, Logtail has built-in support:
| |
Building Dashboards#
A dashboard that nobody looks at is worse than useless — it gives false confidence. The key is building dashboards around the questions you actually ask during incidents, not the metrics that look impressive.

The Five Essential Panels#
Every production web application needs exactly these panels on the primary dashboard:
| Panel | SLS Query | What It Tells You |
|---|---|---|
| QPS trend | * | SELECT date_trunc('minute', __time__) as t, count(*)/60.0 as qps GROUP BY t ORDER BY t | Traffic pattern — is a spike causing the problem, or did traffic drop (upstream failure)? |
| Error rate | * | SELECT date_trunc('minute', __time__) as t, round(count_if(status>=500)*100.0/count(*),2) as err_pct GROUP BY t ORDER BY t | Is the error rate elevated? Anything above 0.1% deserves investigation. |
| P99 latency | * | SELECT date_trunc('minute', __time__) as t, approx_percentile(request_time, 0.99) as p99 GROUP BY t ORDER BY t | Is the service getting slower? P99 catches tail latency that averages hide. |
| Top endpoints | * | SELECT request_uri, count(*) as cnt, approx_percentile(request_time, 0.50) as p50 GROUP BY request_uri ORDER BY cnt DESC LIMIT 10 | Where is traffic going? Which endpoints are slow? |
| Status code distribution | * | SELECT status, count(*) as cnt GROUP BY status ORDER BY cnt DESC | Are you seeing unusual 4xx/5xx patterns? |

Creating a Dashboard#
SLS dashboards are defined as JSON. Here is a stripped-down but functional ops dashboard:
| |
Create it via CLI:
| |
Practical tip: Start with the SLS console’s visual editor to build charts interactively, then export the JSON definition for version control. Editing dashboard JSON by hand is tedious. The console’s query explorer lets you test SLS queries with instant feedback before committing them to a dashboard panel.
CloudMonitor: Infrastructure Metrics and Alerting#
While SLS handles logs, CloudMonitor handles metrics — the numerical time series that track the health of your infrastructure. CloudMonitor is automatically enabled for all Alibaba Cloud resources. The moment you create an ECS instance, RDS database, or SLB load balancer, CloudMonitor starts collecting basic metrics.
Built-in Metrics#
CloudMonitor collects these metrics out of the box for every ECS instance:
| Metric | Description | Collection Interval |
|---|---|---|
CPUUtilization | CPU usage percentage | 60 seconds |
MemoryUsedPercent | Memory usage percentage | 60 seconds |
DiskReadBPS / DiskWriteBPS | Disk I/O throughput | 60 seconds |
DiskReadIOPS / DiskWriteIOPS | Disk I/O operations | 60 seconds |
InternetInRate / InternetOutRate | Network throughput | 60 seconds |
IntranetInRate / IntranetOutRate | VPC internal network throughput | 60 seconds |
disk_usage_percent | Disk space used (requires agent) | 60 seconds |
load_5m | 5-minute load average (requires agent) | 60 seconds |
The first six come from the hypervisor and require no agent. The last two require the CloudMonitor agent installed on the instance. Install it alongside Logtail:
| |
For other services, CloudMonitor provides metrics without any agent:
| Service | Key Metrics |
|---|---|
| RDS | CPU, memory, connections, IOPS, disk usage, slow queries per second |
| SLB | Active connections, new connections, QPS, healthy host count, latency |
| OSS | Request count, bandwidth, availability, first-byte latency |
| Redis (Tair) | CPU, memory usage, connections, QPS, hit rate, evictions |
| NAT Gateway | Active connections, bandwidth, packet rate |
Custom Metrics#
For application-level metrics that CloudMonitor does not collect automatically, push custom metrics via the API:
| |
In application code, batch custom metrics and push them on a schedule (every 60 seconds) rather than per-request:
| |
Event Monitoring#
CloudMonitor also tracks system events — things that happen to your resources outside of normal metric collection. ECS instance restarts, disk errors, scheduled maintenance, security alerts. These are discrete events, not continuous time series.
Key events to watch:
| Event | What It Means | Recommended Action |
|---|---|---|
Instance:SystemFailure.Reboot | Alibaba Cloud rebooted your instance due to host failure | Check if your app recovered cleanly |
Disk:Stalled | Disk I/O stalled, likely a storage backend issue | Monitor for data corruption |
Instance:PerformanceLimited | Burstable instance (t-series) exhausted its CPU credits | Upgrade instance type or switch to non-burstable |
SecurityGroup:AuthorizeFailed | A connection was blocked by security group rules | Verify if this is expected or a misconfiguration |
Subscribe to events to receive notifications:
| |
Alert Configuration#
Alerts are the bridge between observability and action. The right alert wakes you up at 3 AM when the error rate spikes. The wrong alert wakes you up at 3 AM because CPU briefly hit 81% during a scheduled backup and went back down 30 seconds later. Getting alert thresholds right is an art, but the following rules of thumb have served me well.

Alert Design Principles#
- Alert on symptoms, not causes. Alert on “error rate > 1%” not “CPU > 80%.” High CPU is only a problem if it causes user-visible impact. Error rate is the user-visible impact itself.
- Use sustained thresholds. Never alert on a single data point. Require the condition to persist for 3-5 minutes to filter out transient spikes.
- Have exactly three severity levels. Critical (pages someone now), Warning (needs investigation within hours), Info (logged for review). More than three and nobody knows what each level means.
- Mute during known maintenance. Nothing destroys alert trust faster than alerts firing during a deployment you announced in advance.

Setting Up Alert Rules#
Here are the four alerts every production system needs:
1. High Error Rate (Critical)
| |
The total > 100 condition prevents false alerts during low-traffic periods. If only 3 requests came in and 1 failed, that is 33% error rate — alarming numerically, meaningless practically.
2. High CPU Sustained (Warning)
| |
Times: 5 means the condition must be true for 5 consecutive evaluation periods (5 minutes at 60-second intervals). A brief CPU spike from a burst of traffic will not trigger this.
3. Disk Space Low (Warning)
| |
This has two escalation levels: warn at 80% disk usage (sustained), critical at 90% (immediate). Disks filling up is the most preventable and most common cause of outages I have seen.
4. Slow Database Queries (Warning)
| |
Contact Groups and Notification Channels#
CloudMonitor routes alert notifications through contact groups. Create a group and add notification channels:
| |
Supported notification channels:
| Channel | Use Case |
|---|---|
| Non-urgent warnings, daily summaries | |
| DingTalk webhook | Team-visible alerts, incident coordination |
| SMS | Critical alerts that need immediate attention |
| Phone call | Production-down severity (use sparingly) |
| Webhook (HTTP) | Integration with PagerDuty, Slack, custom systems |

Mute periods: For scheduled maintenance windows, set a mute period on the alert rule to suppress notifications. This is better than disabling the alert entirely because the alert still fires and records the event — you just do not get woken up for something you already know about.
ARMS: Application Real-Time Monitoring#
ARMS completes the observability picture by providing the third pillar: traces. While SLS tells you what happened and CloudMonitor tells you the system-level impact, ARMS tells you exactly where in your application the problem occurs.

What ARMS Does#
ARMS is an APM (Application Performance Monitoring) platform that provides:
- Distributed tracing — Follow a request across services, databases, caches, and message queues. See exactly where time is spent.
- Service topology — Auto-discovered map of how your services communicate. See dependencies, call volumes, and error rates at a glance.
- Exception diagnostics — Automatic capture and aggregation of exceptions with stack traces, frequency, and affected users.
- Slow transaction analysis — Drill into specific slow requests to see the full call chain, including database queries and external API calls.
ARMS supports automatic instrumentation for:
| Language | Agent Type | What Gets Instrumented |
|---|---|---|
| Java | ByteBuddy agent | Spring, Dubbo, gRPC, JDBC, Redis, HTTP clients |
| Node.js | npm package | Express, Koa, MySQL, Redis, HTTP, gRPC |
| Python | pip package | Django, Flask, SQLAlchemy, Redis, requests |
| Go | SDK | net/http, gRPC, database/sql, go-redis |
| PHP | Extension | Laravel, ThinkPHP, MySQLi, cURL |
“Automatic instrumentation” means you do not need to modify your application code. The agent intercepts framework-level calls and generates trace spans automatically. You add the agent to your startup command and traces appear.
Installing the ARMS Agent (Node.js)#
For the Node.js application we have been running on our ECS instances:
| |
Add the agent require at the very top of your application entry point, before any other imports:
| |
For Java applications, it is even simpler — just add a JVM flag:
| |
Reading Traces#
Once the agent is running, ARMS starts generating traces for every incoming request. Each trace consists of spans — one span per operation (HTTP call, database query, cache lookup). The spans form a tree that shows the complete request lifecycle.
A typical trace for an API request looks like this:
| |
From this trace you can see that the payment service’s call to Alipay takes 89ms — that is an external dependency you cannot optimize. The database INSERT takes 67ms — worth investigating if that number is normally lower. The total 234ms is acceptable for a checkout flow, but if it was 2340ms, you would know exactly which span to look at.
Linking Traces to Logs#
The real power comes from linking ARMS traces to SLS log entries. When a trace shows that a specific database query was slow, you want to see the corresponding application log to understand the context — what user triggered it, what parameters were passed, what the query plan was.
Enable trace-log correlation by including the trace ID in your log output:
| |
Now in SLS, you can search for all logs associated with a specific trace:
| |
And in ARMS, each trace span links back to the corresponding SLS log entries. This bidirectional link is what makes debugging production issues fast.

Solution: Full-Stack Observability Setup#
Let me bring everything together into a complete setup sequence. This assumes you have ECS instances running behind an SLB load balancer with an RDS database — the architecture from the previous articles in this series.
Step 1: Install Agents on All ECS Instances#
Create a cloud-init script or Ansible playbook that installs both agents on every app server:
| |
Step 2: Configure Log Collection#
Apply Logtail configs for all log sources:
| |
Step 3: Set Up CloudMonitor Alerts#
| |
Step 4: Configure SLS Dashboard and Alerts#
| |
Step 5: Verify Everything Works#
| |
The Complete Architecture#
After completing all five steps, your observability stack looks like this:
| |
Costs#
Observability is not free, and costs can sneak up on you. Here is a realistic cost estimate for a small production setup (2 ECS instances, moderate traffic):

| Component | Free Tier | Typical Monthly Cost |
|---|---|---|
| SLS ingestion | 500 MB/day | 50-200 CNY (depends on log volume) |
| SLS storage | Included in ingestion | Included |
| SLS indexing | Included | Roughly 2x storage cost |
| CloudMonitor | Basic metrics free | 0 for built-in; 10-50 CNY for custom metrics |
| ARMS | 15-day free trial | 100-500 CNY (depends on trace volume) |
Cost optimization tips:
- Set appropriate retention periods. Access logs rarely need more than 30 days. System logs can be 7 days. Slow query logs keep for 90 days. Reducing retention from 90 to 30 days cuts storage cost by 66%.
- Index only fields you query. Every indexed field doubles storage for that field. If you never query
http_user_agentin SQL, do not create a field index for it. - Use sampling for ARMS. In high-traffic applications, trace 10% of requests instead of 100%. You still catch anomalies, but at 1/10 the cost.
- Aggregate before storing. For metrics you only need at 5-minute granularity, aggregate in your application and push the aggregate rather than pushing per-request data points.
Summary#
Set up observability before you deploy your application, not after. The cost of instrumenting retroactively — restructuring logs, adding trace propagation, rebuilding dashboards — is always higher than doing it from the start. Install Logtail, CloudMonitor agent, and ARMS agent as part of your instance provisioning script.
The three pillars are complementary, not redundant. Metrics tell you something is wrong (error rate spike on the dashboard). Logs tell you what is wrong (database timeout in the application log). Traces tell you why it is wrong (one specific query path takes 3 seconds because of a missing index). You need all three to debug production issues efficiently.
SLS is the Swiss Army knife. It handles log collection, search, SQL analytics, dashboards, and alerting in one service. Learn the query syntax — the
search | SQLpattern with full-text search on the left and analytics on the right. The five essential dashboard panels (QPS, error rate, P99 latency, top endpoints, status distribution) cover 80% of incident triage.Alert on symptoms, not causes. “Error rate > 1% for 5 minutes” is a better alert than “CPU > 80%.” Always require sustained thresholds (3-5 consecutive data points) to avoid alert fatigue from transient spikes. Set up mute periods for planned maintenance.
Start with the minimum viable monitoring stack. Logtail for nginx and application logs, CloudMonitor for ECS/RDS/SLB built-in metrics, four alert rules (error rate, CPU, disk, DB connections), one ops dashboard. You can add ARMS tracing, custom metrics, and advanced dashboards incrementally as your application grows. Perfect observability on day one is not the goal — having something that pages you when the site is down is.
In the next article, we tackle containers with ACK and SAE — and you will be glad you set up observability first, because debugging a misbehaving Kubernetes cluster without centralized logging is a special kind of pain.
Alibaba Cloud Full Stack 12 parts
- 01 Alibaba Cloud Full Stack (1): The Ecosystem Map — What Alibaba Cloud Actually Is
- 02 Alibaba Cloud Full Stack (2): ECS — Compute That Actually Makes Sense
- 03 Alibaba Cloud Full Stack (3): VPC, SLB, and the Network Layer
- 04 Alibaba Cloud Full Stack (4): OSS — Object Storage Done Right
- 05 Alibaba Cloud Full Stack (5): RDS and PolarDB — The Database Layer
- 06 Alibaba Cloud Full Stack (6): RAM, KMS, and Cloud Security
- 07 Alibaba Cloud Full Stack (7): SLS, CloudMonitor, and Observability you are here
- 08 Alibaba Cloud Full Stack (8): Serverless — Function Compute and EventBridge
- 09 Alibaba Cloud Full Stack (9): OpenSearch and AI Search
- 10 Alibaba Cloud Full Stack (10): Bailian and DashScope — The LLM Layer
- 11 Alibaba Cloud Full Stack (11): PAI — The ML Platform
- 12 Alibaba Cloud Full Stack (12): End-to-End — One Terraform Apply for Everything