Network Observability & SRELab-first · Mentored · Portfolio-backed

Network Observability & SRE Internship

Turn your OS and DBMS courses into the NOC-to-SRE path — instrument a live network, build the dashboards and alerts that run it, and handle a real incident under pressure.

8 modules21 labs3 formatsCredit-mappable

Overview

What this internship makes you able to do.

Every network, every datacenter and every AI training cluster in India is watched by somebody — and today most of that watching is a NOC engineer staring at a screen of red and green icons, escalating tickets they don't understand. The industry is replacing that model with observability and Site Reliability Engineering: metrics, logs and traces collected continuously, stored in time-series databases, queried with PromQL, and turned into SLOs, error budgets and automated alerts that page the right engineer with the right context. This internship takes the Operating Systems and DBMS theory you already studied — processes, schedulers, I/O, indexing, query planning — and turns it into that job: you learn why a time-series database is shaped the way it is, why an exporter reads /proc, and why a badly written query can take down the monitoring itself.

You will not watch dashboards someone else built. From Week 1 you run your own observability stack against a live containerlab network: SNMP and streaming telemetry (gNMI over gRPC) feeding Telegraf and Prometheus, metrics landing in InfluxDB, logs flowing through Loki and the ELK stack, everything rendered in Grafana dashboards you design and defend. You write PromQL and Flux queries, build recording and alerting rules in Alertmanager, define SLIs and SLOs with real error-budget math, and then do the part almost no fresher has ever done: take an on-call shift in a graded incident drill — triage a failure you've never seen, follow and improve a runbook, communicate under pressure, and write a blameless postmortem afterwards. That incident-response muscle, applied to AI-scale infrastructure where a single idle GPU hour is measurable money, is what this program is built to develop.

The internship is built for the Indian academic calendar and the AICTE/NEP internship mandate. Take it as a 4-week winter sprint, an 8-week summer internship, or a 6-month final-semester capstone that maps to your project/internship credits. Every track ends the same way: a graded capstone with a live incident drill, a portfolio of dashboards, alert rules and postmortems a hiring manager can actually open, an RKR completion certificate, and — for the strongest interns — a direct bridge into the RKR Certified DataCenter Engineer (RCDE) ladder and the hiring pipeline behind it.

Built on your syllabus

The courses this internship extends.

You've already studied these. Here's how each one becomes a deployable skill.

Operating SystemsCSE · IT
AICTE model OS / Anna Univ CS3451 / VTU 21CS44

Processes, scheduling, memory and I/O stop being exam answers: you read them live from /proc via node_exporter, explain a load-average spike from scheduler theory, and diagnose why an exporter is eating CPU.

Database Management SystemsCSE · IT · AI&ML
AICTE model DBMS / Anna Univ CS3492 / JNTU DBMS

Indexing, write paths and query optimisation become the internals of InfluxDB and Prometheus's TSDB — you learn why time-series engines abandon B-trees for LSM-style storage, and how cardinality kills queries.

Computer NetworksCSE · IT · ECE
AICTE model CN / Anna Univ CS3591 / VTU 21CS52

SNMP, interface counters and protocol state you memorised become the raw signal: you poll MIBs, subscribe to gNMI streams, and turn OIDs and YANG paths into dashboards that show a network breathing.

Software EngineeringCSE · IT
AICTE model SE / Anna Univ CS3452

SDLC, reviews and quality metrics become SRE practice — SLIs/SLOs as measurable quality contracts, runbooks as living documentation, and blameless postmortems as the review process for production failures.

Choose your format

Matched to the Indian academic calendar.

Winter Internship
4 weeks
20 hrs / week · Virtual — live evening mentoring + 24×7 cloud lab

Credit: Fits a 2–4 week AICTE winter/vacation internship; certificate + logbook for internal credit

Best for: Pre-final year students wanting a fast, intense first exposure to the monitoring stack

Summer Internship
8 weeks
25 hrs / week · Hybrid — live mentoring, cloud lab, weekly reviews

Credit: Maps to the standard 6–8 week AICTE summer internship required between 3rd and 4th year

Best for: The core track — 3rd-year students building an SRE-grade placement portfolio

Semester Capstone Internship
24 weeks
18 hrs / week · Hybrid — sustained project work with a dedicated mentor

Credit: Maps to the NEP 2020 full-semester / final-year internship-project credits (often 12–20 credits)

Best for: Final-semester students doing internship-in-lieu-of-project

The curriculum

8 modules. 21 labs. Week by week.

This is the full plan for the 8-week track (the winter and semester formats compress or extend the same arc). Every week ends in a deliverable your mentor reviews.

Week 1

The NOC problem & observability fundamentals

Understand why ping-and-pray monitoring fails, then anchor the three pillars — metrics, logs, traces — on a live Linux host and a small containerlab network you will instrument all program long.

You'll do
  • Cloud-lab onboarding; bring up the reference topology (3 routers, 2 hosts) in containerlab and baseline it manually with ping, ss, vmstat and journalctl
  • Map the three pillars to real signals: read /proc and /sys by hand, tail syslog, and trace a request with tcpdump to see where each pillar comes from
  • Run a 'dark NOC' exercise: diagnose an injected link failure using only raw CLI output, and log how long it takes — your before-picture for the whole internship
Deliverable: Observability gap report: what the manual diagnosis missed, how long it took, and which signals a proper stack would have surfaced in seconds
Week 2

Collection: SNMP & streaming telemetry (gNMI/gRPC)

Get data off the network both the legacy way and the modern way — poll SNMP MIBs, then subscribe to gNMI streams over gRPC and compare what each can and cannot see.

You'll do
  • Poll interface counters and system MIBs with snmpwalk/snmpget; decode OIDs against the MIB tree and script a poller in Python
  • Subscribe to interface and CPU telemetry with gNMIc (sample and on-change modes) against the lab routers; inspect the YANG paths behind each subscription
  • Run a side-by-side comparison: 30s SNMP polling vs 1s gNMI streaming during a traffic burst — measure what the poller missed
Deliverable: Telemetry comparison notebook: SNMP vs gNMI captures of the same event, with a one-page recommendation of when to use each
Week 3

Time-series storage & the TIG stack

Stand up Telegraf → InfluxDB → Grafana end to end, and connect your DBMS theory to why time-series databases are built the way they are.

You'll do
  • Deploy the TIG stack with Docker Compose; configure Telegraf inputs for SNMP, gNMI and host metrics, and verify points landing in InfluxDB
  • Study the write path: measurements, tags vs fields, series cardinality and retention policies — then deliberately create a cardinality explosion and watch memory climb
  • Write Flux queries for rates, percentiles and group-bys; build your first Grafana dashboard (interface utilisation, errors, host health) from them
Deliverable: Running TIG pipeline + a schema-design note explaining your tag/field choices and retention policy in DBMS terms
Week 4

Prometheus, exporters, PromQL & alerting

The industry-standard pull model: exporters, scrape configs, PromQL, recording rules and your first real alerts routed through Alertmanager.

You'll do
  • Deploy Prometheus with node_exporter, snmp_exporter and blackbox_exporter; write scrape configs and relabelling rules; then write a small custom exporter in Python for a lab-specific metric
  • Master core PromQL: rate() vs irate(), histogram_quantile() for latency, aggregation by label; convert your Week-3 dashboard panels to PromQL and compare
  • Write alerting rules with for-durations and severity labels; configure Alertmanager grouping, inhibition and a webhook receiver; fire a real alert by saturating a lab link
Deliverable: Prometheus config repo (scrape configs + recording rules + 6 alerting rules) with an alert-routing diagram and proof of a delivered page
Week 5

Log pipelines: ELK, Loki & correlation

Metrics tell you something is wrong; logs tell you why. Build both major log pipelines, parse real network syslog, and correlate a metric spike to its log evidence.

You'll do
  • Ship router and host syslog into Elasticsearch via Logstash; write grok patterns to parse BGP/OSPF adjacency messages into structured fields and explore them in Kibana
  • Deploy Loki + Promtail as the lightweight alternative; write LogQL queries and label strategies, and compare index-everything (ELK) vs index-labels-only (Loki) costs
  • Correlation drill: given a latency spike on a dashboard, find the causal log line within 5 minutes using linked Grafana panels (metrics → logs in one click)
Deliverable: Dual log pipeline (ELK + Loki) with parsed network logs and a documented correlation walkthrough from symptom to root-cause line
Week 6

Dashboards that decide, SLIs/SLOs & error budgets

Move from 'we have graphs' to 'we have a service-level contract': design dashboards for decisions, define SLIs, set SLOs, and compute error budgets with real math.

You'll do
  • Redesign your dashboards around the RED/USE methods; build a NOC overview, a per-device drill-down, and an on-call triage view with meaningful thresholds and drill-down links
  • Define SLIs for the lab network (packet loss, p99 latency, interface availability); set SLOs and implement multi-window multi-burn-rate alerts in PromQL
  • Run a month-simulation: replay recorded outages against your SLOs, compute error-budget burn, and decide — with evidence — whether the simulated team may ship changes or must freeze
Deliverable: SLO specification document (SLIs, targets, burn-rate alert rules) plus the three-tier Grafana dashboard suite implementing it
Week 7

Incident response, on-call, runbooks & reliability at AI scale

The human side of SRE: structured incident roles, runbooks that work at 3 a.m., blameless postmortems — and why all of this is amplified when the infrastructure is a GPU cluster where idle time is money.

You'll do
  • Write runbooks for your top 5 alerts (trigger, triage steps, escalation, rollback); peer-test them — another intern must resolve your alert using only your runbook
  • Take a graded 2-hour on-call shift: mentors inject faults (link flap, exporter crash, disk-full on InfluxDB, silent gNMI session drop) via chaos scripts; you triage, communicate on a timeline, and resolve
  • Write a blameless postmortem for your worst incident; then study an AI-fabric case: how observability of RDMA/GPU-network health maps onto the same stack, and what an SLO means when a training job's idle hour is quantifiable cost
Deliverable: On-call packet: 5 tested runbooks, your incident timeline, and a blameless postmortem reviewed against the RKR postmortem rubric
Week 8

Capstone: instrument, operate, survive, defend

A fresh, unseen network brief: build the full observability stack for it, define its SLOs, then survive a live graded incident drill and defend every design decision to a mentor panel.

You'll do
  • Instrument the capstone topology end to end: SNMP + gNMI collection, Prometheus + Loki storage, dashboards, SLO-based alerting and runbooks — from a written reliability brief
  • Sit the graded incident drill: two compound faults injected without warning during your defended on-call window; triage, page, resolve and document in real time
  • Present and defend the stack, the SLO choices and the postmortem to a mentor panel — including one 'why' question per architectural decision
Deliverable: Capstone observability stack + SLO spec + incident-drill postmortem + recorded defence
Tools & tech you'll use
Prometheus · Alertmanager · PromQLGrafana (dashboards, Loki, alerting)InfluxDB · Telegraf (TIG stack)Elasticsearch · Logstash · KibanagNMI / gNMIc · gRPC streaming telemetrySNMP (net-snmp, snmp_exporter)Python 3 (exporters, webhooks, chaos scripts)containerlab network pods · Docker Compose

The capstone

Full-Stack Observability for an Unseen Network

You receive a written reliability brief for a network you have never touched: topology, traffic profile, availability targets and an on-call requirement. You must instrument it end to end, define and implement its SLOs, write its runbooks — and then keep it alive through a live incident drill where mentors inject compound faults during your defended on-call window.

Dual collection layer: SNMP polling plus gNMI streaming telemetry, feeding Prometheus and InfluxDB
Log pipeline (Loki or ELK) with parsed, structured network logs correlated to metrics in Grafana
SLI/SLO specification with multi-window burn-rate alerting through Alertmanager, justified in the design doc
Three-tier dashboard suite (overview, drill-down, on-call triage) built on RED/USE principles
Runbooks for every paging alert, peer-tested before the drill
A survived incident drill: real-time triage timeline, resolution evidence and a blameless postmortem
How it's graded: Graded against a published rubric on stack correctness, SLO design quality, drill performance (time-to-detect, time-to-resolve, communication) and the live defence. A pass earns the RKR Network Observability & SRE certificate; a distinction earns a fast-track referral into the RCDE certification track and the RKR hiring pipeline.

Measurable outcomes

Walk out able to do this — on record.

Deploy and operate a complete observability stack — Telegraf, InfluxDB, Prometheus, Alertmanager, Grafana, Loki/ELK — from configuration files you wrote yourself

Collect network state via both SNMP and gNMI/gRPC streaming telemetry, and justify when each is the right tool

Write production-grade PromQL and LogQL: rates, percentiles, recording rules and multi-window burn-rate alerts

Define SLIs and SLOs for a network service and compute and act on error-budget burn

Correlate a metric anomaly to its causal log evidence within minutes using linked dashboards

Run a structured on-call shift: triage from a runbook, communicate on an incident timeline, and write a blameless postmortem

What you keep

Your portfolio artifacts.

Observability-stack repo (GitHub)

Every config you wrote — Telegraf, Prometheus, Alertmanager, Logstash, Promtail, Docker Compose — version-controlled and reproducible with one command.

Grafana dashboard suite + SLO specification

Exported NOC, drill-down and on-call dashboards with the SLI/SLO document and burn-rate alert rules behind them — the artifact that shows you think in reliability, not graphs.

Runbook & postmortem portfolio

Five peer-tested runbooks and blameless postmortems from your graded incidents — proof you can operate, not just build.

Custom Python exporter & chaos toolkit

A working Prometheus exporter and the fault-injection scripts you built, showing you can code against the monitoring stack itself.

RKR completion certificate

Verifiable certificate stating the graded outcome and hours — mappable to your AICTE/NEP internship credit.

Mentorship
  • Assigned mentor who has carried a production pager, not a content narrator
  • Weekly live review of your dashboards, alert rules and runbooks
  • Async help channel with 1-business-day response on blockers
  • Mock on-call debrief and interview-prep session: how to walk a panel through an incident like an SRE
Evaluation & certificate

Continuous assessment on weekly deliverables (60%) plus a graded capstone with a live incident drill and defence (40%). Every intern receives a verifiable RKR completion certificate with the graded outcome and logged hours, formatted for AICTE/NEP internship-credit submission. Distinction-grade interns receive a letter of recommendation and priority access to the RKR hiring pipeline and the RCDE certification bridge.

Career plan

Where this internship takes you.

This internship is engineered to skip the ticket-routing NOC-L1 detour and land a monitoring/SRE-track role directly. As Indian datacenters and GCCs scale for AI, the scarce skill is not watching dashboards — it is building the observability and reliability practice behind them, and companies consistently report monitoring and incident-response roles among their hardest to fill. Strong interns bridge straight into the RKR Certified DataCenter Engineer (RCDE) track, where the same stack is applied to AI-fabric and GPU-cluster reliability.

Roles unlocked
NOC Engineer (observability-track, L1/L2)Monitoring / Tools EngineerJunior Site Reliability EngineerAssociate Datacenter Operations EngineerNetwork Operations Analyst (GCC)
Entry band (post)
Rs 4–8 LPA entry on the monitoring/SRE track, with a credible 8–15 LPA step within 2–3 years as SRE skills compound
Stipend
Merit stipend during the internship for distinction-track interns; performance-based project stipend on the semester capstone

Conversion: Distinction-grade interns are referred into the RKR hiring-partner pipeline — GCCs and datacenter operators building 24×7 reliability teams — and fast-tracked into the RCDE certification that unlocks the AI-infrastructure premium.

Rung 1 · 0-1 yr
NOC / Monitoring Engineer
Rs 4-6.5 LPA
Rung 2 · 1-3 yrs
Observability / Tools Engineer
Rs 7-14 LPA
Rung 3 · 3-6 yrs
Site Reliability Engineer
Rs 14-26 LPA
Rung 4 · 6+ yrs
Senior SRE / Reliability Architect
Rs 24-45 LPA
Demand signal

73% of datacenter monitoring and incident-response roles are reported hard to fill, even as India's DC capacity scales from 1.7 GW toward 5–6.5 GW with ~100,000 datacenter jobs projected by 2030 — while entry-level generic IT roles shrank 20–25% to automation (EY, 2025). The reliability skills survive; the ticket-routing jobs don't.

8 modules. 21 labs. One credit-mappable certificate.

Build it on real gear, defend a capstone, and walk into placements with proof.