AI Datacenter InfrastructureFlagship trackLab-first · Mentored · Portfolio-backed

AI Datacenter Infrastructure Internship

Build the spine-leaf fabric India's AI datacenters actually run on — eBGP underlay, EVPN-VXLAN overlay, and lossless RoCEv2 for GPU clusters — on real NX-OS and Junos.

8 modules24 labs3 formatsCredit-mappable

Overview

What this internship makes you able to do.

India is building AI datacenters faster than it is producing engineers who can wire them. Over $60–70bn of datacenter investment has been announced for the next five years, capacity is scaling from roughly 1.7 GW toward 5–6.5 GW by 2030, and 38,000+ GPUs are being stood up under the IndiaAI Mission alone — yet the fabric that connects those GPUs is a spine-leaf Clos network that no Indian B.Tech syllabus teaches. Your Computer Networks course stopped at OSPF and a campus LAN. This internship starts where it stopped and takes you all the way to the network design pattern behind every hyperscale and GPU-cloud build: the routed Clos fabric.

The arc is the real engineering arc, in order. You begin with datacenter facilities and topology fundamentals — power, cooling, rack rows, and why the three-tier campus design collapses under east-west AI traffic. Then you build a spine-leaf fabric and do the oversubscription and ECMP math yourself. You bring up an eBGP IP-fabric underlay the way RFC 7938 prescribes it, with one ASN per leaf and unnumbered peering. On top of that underlay you deploy the multitenant overlay — VXLAN with a BGP-EVPN control plane — the same way a GPU-cloud provider isolates customers. Then the part almost nobody in India can do: lossless Ethernet for AI workloads. You configure PFC, ECN and DCQCN, run RoCEv2 traffic across the fabric, deliberately induce congestion, and read the pause frames and CNPs in Wireshark. You finish by designing a rail-optimized GPU-pod fabric, understanding how AllReduce collectives stress it, and automating and instrumenting the whole thing with Python and streaming telemetry. Everything is dual-NOS — Cisco NX-OS on Nexus 9000v and Juniper Junos on vJunos — inside containerlab/EVE-NG cloud pods you can open from a hostel laptop.

This is the RKR flagship, built for the Indian academic calendar and the AICTE/NEP internship mandate: a 4-week winter sprint, the core 8-week summer internship, or a full 24-week semester capstone that maps to your final-year project/internship credits. Every track ends with a graded, defended capstone — a working GPU-pod fabric you built, broke and repaired on camera — plus a version-controlled configuration portfolio, an RKR completion certificate formatted for credit submission, and for distinction-grade interns a direct bridge into the RKR Certified DataCenter Professional (RCDP) track and the hiring pipeline behind India's datacenter build-out.

Built on your syllabus

The courses this internship extends.

You've already studied these. Here's how each one becomes a deployable skill.

Computer NetworksCSE · IT · ECE · AI&ML
AICTE model CN / Anna Univ CS3591 / VTU 21CS52 / JNTU CN

You learned routing tables, BGP path selection and Ethernet as exam answers. Here BGP becomes the underlay of a 20-switch Clos fabric you bring up yourself, and Ethernet becomes a lossless transport you tune for GPU traffic.

Cloud ComputingCSE · IT
Anna Univ CCS335 / VTU 21CS72 / JNTU cloud elective

Virtualisation and multitenancy stop being diagrams — you build the actual EVPN-VXLAN overlay that lets one physical fabric carry isolated tenants, the mechanism under every cloud you studied.

Computer Organization & ArchitectureCSE · IT · ECE
AICTE model COA / Anna Univ CS3351

Buses, memory hierarchy and DMA map directly onto RDMA — you see why GPU-to-GPU transfers bypass the CPU, and why the network must be lossless for RoCEv2 to work at all.

High Performance Computing / Parallel ComputingCSE · AI&ML
VTU 21CS734 / Anna Univ elective / JNTU HPC

Collective operations you studied as MPI theory — AllReduce, AllGather — become the traffic pattern you engineer the fabric for, including rail-optimized designs that keep a training job's synchronisation off congested links.

Choose your format

Matched to the Indian academic calendar.

Winter Internship
4 weeks
20 hrs / week · Virtual — live evening mentoring + 24×7 cloud lab

Credit: Fits a 2–4 week AICTE winter/vacation internship; certificate + logbook for internal credit

Best for: Students with routing basics wanting an intense first pass over Clos + eBGP underlay

Summer Internship
8 weeks
25 hrs / week · Hybrid — live mentoring, cloud lab, weekly design reviews

Credit: Maps to the standard 6–8 week AICTE summer internship required between 3rd and 4th year

Best for: The core flagship track — the full underlay-to-GPU-pod arc with the defended capstone

Semester Capstone Internship
24 weeks
18 hrs / week · Hybrid — sustained project work with a dedicated mentor

Credit: Maps to the NEP 2020 full-semester / final-year internship-project credits (often 12–20 credits)

Best for: Final-semester students building a datacenter-fabric capstone in lieu of project

The curriculum

8 modules. 24 labs. Week by week.

This is the full plan for the 8-week track (the winter and semester formats compress or extend the same arc). Every week ends in a deliverable your mentor reviews.

Week 1

Inside the AI datacenter: facilities, racks & why Clos won

Ground the physical reality — power, cooling, rack elevations, cabling plant — then trace why east-west AI traffic killed the three-tier design and made the folded-Clos spine-leaf the only answer.

You'll do
  • Map a reference AI-pod rack elevation: leaf placement (ToR vs EoR), power/cooling budget per rack, and the 400G/800G cabling plan between leaf and spine
  • Cloud-lab onboarding: bring up a 2-spine/4-leaf topology in containerlab with Nexus 9000v and vJunos nodes; navigate both CLIs and save/verify config
  • Compare traffic patterns in Wireshark: a classic north-south client-server flow vs an east-west storage replication flow across the fabric
Deliverable: A one-page fabric siting document: rack elevation, port/cabling matrix, and a written argument for spine-leaf over three-tier for the given AI workload
Week 2

Clos fabric design: oversubscription math & ECMP

Do the arithmetic a fabric architect does daily — bisection bandwidth, oversubscription ratios, spine/leaf port budgets — then prove ECMP actually spreads flows with your own hash-polarisation experiment.

You'll do
  • Calculate leaf uplink requirements for 1:1, 2:1 and 3:1 oversubscription against a 32×400G leaf; size the spine count and validate against a written GPU-pod brief
  • Configure static ECMP across 4 equal-cost paths; generate parallel iPerf3 flows and verify per-flow distribution with interface counters on both NX-OS and Junos
  • Break ECMP on purpose: create a flow-hash polarisation scenario, observe the hot link, and fix it by changing hash inputs (L4 ports / symmetric hashing)
Deliverable: A fabric sizing worksheet (bisection bandwidth + oversubscription calc) plus a lab report proving balanced vs polarised ECMP with counter evidence
Week 3

The eBGP IP-fabric underlay (RFC 7938)

Build the routed underlay the way hyperscalers documented it: eBGP as the only routing protocol, one private ASN per leaf, fast fallover, and a loopback reachability model that the overlay will ride on.

You'll do
  • Assign the ASN plan (private 32-bit ASNs, one per leaf, shared spine ASN) and bring up eBGP on all fabric links using interface/unnumbered peering on NX-OS and Junos
  • Advertise loopbacks only; verify ECMP paths installed via BGP multipath (maximum-paths / multipath multiple-as) and confirm any-leaf-to-any-leaf loopback reachability
  • Fail a spine and a leaf uplink under a running traffic stream; measure reconvergence with timestamped probes and tune BFD + fast-external-fallover to cut it down
Deliverable: A working 2-spine/4-leaf eBGP underlay with an ASN/addressing design doc and a measured before/after reconvergence report
Week 4

Overlay & multitenancy: VXLAN with BGP-EVPN

Put tenants on the fabric. VXLAN gives the encapsulation, BGP-EVPN gives the control plane — you configure both, then read the route types on the wire until they stop being magic.

You'll do
  • Configure VTEPs, VNIs and an EVPN address family over the underlay; build L2VNIs for two isolated tenants and an L3VNI with anycast distributed gateways on every leaf
  • Capture and dissect EVPN Route Type 2 (MAC/IP) and Type 5 (IP prefix) advertisements; correlate them with the VXLAN-encapsulated frames in Wireshark
  • Prove tenant isolation: demonstrate that tenant A cannot reach tenant B, then deliberately misconfigure a VNI mapping, diagnose the symptom from the EVPN table, and repair it
Deliverable: A multitenant EVPN-VXLAN overlay (2 tenants, anycast gateway) with an annotated capture set mapping every EVPN route type to observed forwarding behaviour
Week 5

Lossless Ethernet for AI: RoCEv2, PFC, ECN & DCQCN

The week that separates a datacenter engineer from a campus engineer. RDMA over Converged Ethernet cannot tolerate drops — so you build the lossless machinery (PFC per priority, ECN marking, DCQCN reaction) and then attack it.

You'll do
  • Classify RoCEv2 traffic (UDP/4791) into a dedicated priority; configure PFC on that priority end-to-end and WRED/ECN thresholds on fabric queues on both NX-OS and Junos
  • Drive many-to-one incast congestion at a leaf port with parallel senders; capture PFC pause frames and ECN-marked packets / CNPs in Wireshark and explain the DCQCN feedback loop from the trace
  • Induce and then mitigate the pathologies: demonstrate PFC head-of-line blocking and a pause-storm scenario, then apply a PFC watchdog and tuned ECN thresholds to contain it
Deliverable: A lossless-fabric configuration set plus a congestion lab report: annotated captures of pause frames, ECN marks and CNPs, with measured throughput before and after DCQCN tuning
Week 6

GPU-cluster fabrics: rail-optimized design & collective traffic

Design for the actual customer of the fabric — a distributed training job. Understand how AllReduce collectives load the network, why GPU NICs get their own rails, and how a rail-optimized topology differs from a generic Clos.

You'll do
  • Model a ring and a tree AllReduce across an 8-GPU-per-node pod: compute the data volume per step for a given gradient size and identify which fabric links each phase stresses
  • Design a rail-optimized fabric for a 4-node GPU pod (8 NICs per node, one rail per GPU index) and build a scaled-down version in the lab; contrast its path diversity against the plain Clos from Week 3
  • Emulate collective traffic with synchronized many-to-many iPerf3 meshes across the rails; verify the lossless config from Week 5 holds, and document tail-latency impact when one rail degrades
Deliverable: A GPU-pod fabric design document — rail map, NIC-to-leaf port plan, oversubscription and failure-domain analysis — with lab evidence of collective-pattern traffic surviving a rail failure
Week 7

Fabric automation & streaming telemetry

Nobody configures a 64-leaf fabric by hand. Template the entire underlay/overlay as code, deploy it idempotently, and instrument the fabric with streaming telemetry that catches congestion before a training job stalls.

You'll do
  • Build Jinja2 templates for the eBGP underlay and EVPN overlay driven by a single YAML fabric-definition file; deploy to all nodes with Nornir/Netmiko and re-run to prove idempotence
  • Subscribe to gNMI/OpenConfig paths for interface counters, queue depths and PFC pause counters; stream them into a simple Python collector and plot a congestion event from Week 5 replayed
  • Write a pre/post-change validation suite: BGP session count, EVPN route count, and ECMP path checks that gate every automated deployment, all version-controlled in Git
Deliverable: A fabric-as-code repo on GitHub: YAML intent + Jinja2 templates + deployment and validation pipeline + a telemetry collector with a captured congestion trace
Week 8

Capstone: build, break & defend a GPU-pod fabric

A fresh written brief, a clean lab, and a mentor panel. Design and deploy a complete GPU-pod fabric — underlay, overlay, lossless transport, automation — then survive a live fault-injection drill and defend every decision.

You'll do
  • Design and deploy the full fabric from the brief using your Week-7 pipeline: eBGP underlay, EVPN-VXLAN tenants, RoCEv2 lossless class, rail-aware port plan
  • Structured break/fix drill: mentors inject faults (BGP misconfiguration, VNI mismatch, disabled PFC on one hop, polarised ECMP) under running traffic; diagnose from telemetry and captures against the clock
  • Present the design document and verification evidence to a mentor panel and defend the oversubscription, ASN, ECN-threshold and rail-design decisions live
Deliverable: The graded capstone: a working GPU-pod fabric + engineer-grade design document + fault-drill logbook + recorded defence
Tools & tech you'll use
Cisco NX-OS (Nexus 9000v)Juniper Junos (vJunos-switch)FRRouting (host-side BGP / route servers)containerlab / EVE-NG cloud podsWireshark (VXLAN, RoCEv2, PFC/CNP dissection)Python 3 · Netmiko · Nornir · Jinja2gNMI / OpenConfig streaming telemetryiPerf3 / traffic generators for congestion drillsGit & GitHub (fabric-as-code)

The capstone

GPU-Pod Fabric: Build, Break & Defend

You are handed the brief a GPU-cloud provider would give a fabric engineer: a 4-node GPU pod, two paying tenants, a RoCEv2 training workload that cannot tolerate loss, and an uptime requirement. Design the fabric, deploy it entirely from code, prove it lossless under incast congestion, survive a live fault-injection drill, and defend every design decision to a panel.

An eBGP IP-fabric underlay per RFC 7938 with a documented ASN/addressing plan and measured sub-second reconvergence (BFD-tuned)
An EVPN-VXLAN overlay with two isolated tenants, anycast distributed gateways, and capture evidence of Route Type 2/5 behaviour
A lossless RoCEv2 traffic class — PFC + ECN/DCQCN — proven under a many-to-one incast drill with annotated pause-frame and CNP captures
A rail-aware port plan and oversubscription analysis defended against the pod's AllReduce traffic pattern
Full deployment from the fabric-as-code pipeline with a passing pre/post-change validation suite and streaming-telemetry evidence
A live break/fix drill: at least three injected faults diagnosed to root cause from telemetry and captures, under time pressure
How it's graded: Graded against a published rubric on design correctness, losslessness under attack, automation quality, fault-drill performance and the live defence. A pass earns the RKR AI Datacenter Infrastructure certificate; a distinction earns a fast-track referral into the RCDP certification and the RKR datacenter hiring pipeline.

Measurable outcomes

Walk out able to do this — on record.

Size a spine-leaf Clos fabric from a workload brief — bisection bandwidth, oversubscription ratio, spine/leaf port budget — and defend the numbers

Deploy and troubleshoot an eBGP IP-fabric underlay (RFC 7938 pattern) with per-leaf ASNs, BGP multipath ECMP and BFD-tuned sub-second reconvergence on NX-OS and Junos

Build a multitenant EVPN-VXLAN overlay with anycast gateways and diagnose faults from EVPN Route Type 2/5 state

Configure and validate lossless Ethernet for RoCEv2 — PFC, ECN and DCQCN — and diagnose incast congestion, pause storms and head-of-line blocking from packet captures

Design a rail-optimized GPU-pod fabric and reason about AllReduce collective traffic, rail failures and tail latency

Deploy an entire fabric from code (YAML intent + Jinja2 + Nornir) and monitor it with gNMI streaming telemetry, with validation gates on every change

What you keep

Your portfolio artifacts.

Fabric-as-code portfolio (GitHub)

The complete underlay and overlay as YAML intent + Jinja2 templates + deployment pipeline — the exact artifact GPU-cloud and datacenter teams hire against.

Congestion & lossless-Ethernet lab dossier

Annotated Wireshark captures of PFC pause frames, ECN marks and CNPs with before/after DCQCN tuning results — proof you can engineer RoCEv2, not just spell it.

GPU-pod fabric design document

An engineer-grade design for a rail-optimized GPU cluster fabric: oversubscription math, ASN plan, rail map, failure-domain analysis and telemetry plan.

Break/fix drill logbook

Timestamped diagnoses of injected fabric faults — BGP, EVPN, PFC and ECMP failures — traced from telemetry and captures to root cause and repair.

RKR completion certificate

Verifiable certificate stating the graded outcome and hours — mappable to your AICTE/NEP internship credit.

Mentorship
  • Assigned mentor who has built and operated production datacenter fabrics, not a content narrator
  • Weekly live design review of your fabric configs, captures and telemetry evidence
  • Async help channel with 1-business-day response on blockers
  • Interview-prep session: whiteboarding a Clos design and defending oversubscription and lossless-Ethernet decisions the way a hiring panel will ask
Evaluation & certificate

Continuous assessment on weekly deliverables (60%) plus a graded, defended capstone with a live fault-injection drill (40%). Every intern receives a verifiable RKR completion certificate with the graded outcome and logged hours, formatted for AICTE/NEP internship-credit submission. Distinction-grade interns receive a letter of recommendation, a fast-track into the RKR Certified DataCenter Professional (RCDP) track, and priority access to the RKR hiring pipeline serving India's datacenter build-out.

Career plan

Where this internship takes you.

This is the flagship because it targets the single hottest infrastructure gap in India: engineers who can build and operate the fabrics inside the AI datacenters now being funded at unprecedented scale. Roughly 100,000 datacenter jobs are projected by 2030 as Indian capacity scales from ~1.7 GW toward 5–6.5 GW, and 73% of DC monitoring and incident-response roles are already hard to fill. A graded, defended GPU-pod fabric capstone is direct evidence for exactly those roles — and the bridge into the RCDP certification puts you on the specialist track that commands the AI-infrastructure premium.

Roles unlocked
Datacenter Network Engineer (Associate)DC Operations / Fabric NOC EngineerCloud Infrastructure Engineer (network)GPU-Cloud / AI-Infrastructure Support EngineerNetwork Automation Engineer (DC)
Entry band (post)
Rs 4.5–8 LPA entry for datacenter-fabric roles, with a credible 10–18 LPA step within 2–3 years on the AI-infrastructure specialist track
Stipend
Merit stipend during the internship for distinction-track interns; performance-based project stipend on the semester capstone

Conversion: Distinction-grade interns are referred into RKR's datacenter hiring-partner pipeline — operators and GCCs staffing the new AI-capacity build-outs — and fast-tracked into the paid RCDP certification that anchors the specialist salary premium.

Rung 1 · 0-1 yr
DC NOC / Fabric Operations Engineer
Rs 4.5-7 LPA
Rung 2 · 1-3 yrs
Datacenter Network Engineer
Rs 8-15 LPA
Rung 3 · 3-6 yrs
AI-Fabric / Network Automation Engineer
Rs 15-30 LPA
Rung 4 · 6+ yrs
Datacenter Fabric Architect
Rs 28-55 LPA
Demand signal

India has announced $60–70bn of datacenter investment over five years and >$250bn of AI-infrastructure commitments (India AI Impact Summit 2026), with ~100,000 datacenter jobs projected by 2030 as capacity scales from ~1.7 GW to 5–6.5 GW. 73% of DC monitoring and incident-response roles are hard to fill today, and niche AI-infrastructure specialists command a 1.7x salary premium — while generic entry-level IT roles have contracted 20–25% under automation (EY, 2025).

8 modules. 24 labs. One credit-mappable certificate.

Build it on real gear, defend a capstone, and walk into placements with proof.