Ryan
Danny
How we Migrated RCSB.org
at the San Diego Supercomputer Center
to Kubernetes:
lessons learned
Rotated
A little bit about me
by Igor Khokhriakov
aka Ingvord
for Kubernetes Community Day
SCaLE 20x
March, 9-12'23
Pasadena, CA
Feb, 1st'23
~3.5M requests daily
Well-seasoned CEO-Minded Software Developer/Architect
with 15+ yrs of experience;
Invited to SDSC RCSB PDB group to design and implement K8s platform
and distributed reactive backend for their search API;
Previously worked as Scientific Software Developer/Architect
at DESY, Hamburg, Germany
and ESRF, Grenoble, France;
with main focus in designing SCADA and DCS systems
as well as high level Meta Data acquisition systems;
and even before that was a Full Stack Developer
for a web based trading/analytics system;
Research Collaboratory for Structural Bioinformatics PDB aka rcsb.org
Lesson 1:
careful design requires a lot (like A LOT)
of learning/researching
Motivation:
- Retire legacy in-house solutions;
- Move towards true CI/CD
- Easily setup development environments
- Perform experiments with new features
Lesson 2:
choose your provider carefully
if you can
Lesson 3:
monitoring/controlling k8s resources
Lesson 4:
troubleshooting issues -> learn kubectl
Lesson 5:
choosing the
right
tools ain't an easy process
but rewarding
Lesson 6:
try with some non-critical/non-continous tasks
Lesson 8:
be prepared to invest time/resources
into internal learning
Lesson 7:
self hosted github/gitlab runners are great
Challenge 1:
Multicluster VS Cluster federation VS Single cluster
Challenge 3:
Costal distributed docker registry
Challenge 4:
Preserve exisiting log infrastructure
Challenge 5:
Ingress routing
Many thanks to
Conclusion
Challenge 2:
Storage
like CI/CD
otherwise what works today may stop working tomorrow
We ended up testing our K8s cluster with our CI/CD
Otherwise what works today may stop working tomorrow
* developers must know the basics
* architects must leverage new design principles
* devops must be pretty advanced
* using k8s implies architectural changes
* using k8s implies understanding of its capabilities
reuse open-source helm charts
allows quick learning in Helm; setting up secrets etc
unlimited possibilities in configuring/customizing runners
Harbor
Ceph
Skaffold: dev + devops
Helm over Kustomize (IDEA support)
Harbor also tests storage and etc
IDEA and Octant
Elastic
Main two categories to choose from:
on-permises VS cloudprovider
Doing costs estimations properly ain't an easy process:
salaries + delays VS e.g. AWS
Another important aspect:
k8s differs from anything else we used before
even experienced teams may have issues
In our case we bound to SDSC
basically means on-permises deployment i.e. pros and cons
K8s on bare metal took much longer
calico set up ruined internal networking: github web hook could not communicate to deployments
kubectl is your friend
know namespaces
get events
get logs
Multicluster
Ongoing effort
Have dedicated Elastic cluster
sub-domain aka host based routing e.g. arches.k8s.rcsb.org
alas path based routing ain't supported by our system, as otherwise does not require external DNS
Using carefully chosen 3rd party tools does make our transition less painful for sure.
Even though there were some complications we are very happy
with all the new posibilities wide open to us
Jeremy Henry; Henry Chao;
Jose Duarte
SDSC team
Things to consider:
- Migration process
A/B switch VS smooth migration
- Monitoring
- CI/CD
- Development
Required knowledge base
Tools/utilities
New mindset
- Infrastructure:
Storage
NFS -> CEPHFS*
- Technical challenges
- Choosing 3rd party tools/utilities/infrastructure
Concept maps help greatly!
* not affilated with any of them
Or how we learned the hard way that
"deployments do not create pods, replicasets do"
Yet another annoying thing
"too many open files"
Prepare an image with your favorite tools on board and do:
IDEA:
service tab -> k8s -> configuration -> open in new tab -> switch namespaces -> folow log -> show yaml -> console -> port forwarding -> iterate through resoruces -> switch context
Octant:
cluster overview -> namespaces -> nodes -> applications -> peak java -> namespace overview -> terminal -> log
$ kubectl run -it networktest --image=my-favorite bin/ash --restart=Never --rm
unlimted configuration possibilities!!!
VS
RCSB PDB Team
SDSC Team
Jose M. Duarte
Henry Chao
Jeremy Henry
Alyssa
Colby
Gavin
Thanks for listening!
And now to the fun part...
Questions and Answers,
Comments...
My contacts:
ingvord.ru
ikhokhryakov
ingvord
igor.khokhriakov@rcsb.org
ikhokhriakov@ucsd.edu
ingvord.mail@gmail.com
~20 services;
~80 instances per coast;
~160 running production service instances in total;
~1126 vCPUs; 9.95703125 TB of memory.
Communication between services internally seems to be blocked.
The new cluster looks like it's using a networking library called calico to handle network security,
and there might be some ports through that library they'll need to open for our deployments to work.
Those deployments are for integrating with Vault for secrets management and generating valid TLS certs,
so they're definitely needed before we can get other things deployed.
letsencrypt-staging
letsencrypt-prod
hashicorp-vault
Issuers
cert-manager
Certificates
Kubernetes
Secrets
signed keypair
foo.bar.com
Issuer:
venafi-tpp
example.com
www.example.com
Issuer:
letsencrypt-prod
venafi-as-a-service
signed keypair
venafi-tpp
*
borrowed from cert-manager/docs
-cert-manager
credentials
* not affilated with any of them
Makes it possible for a complete end-to-end automation!
1.
2.
3.
*
* RCSB Protein Data Bank: Efficient Searching and Simultaneous Access to One Million Computed Structure Models Alongside the PDB Structures Enabled by Architectural Advances
Journal of Molecular Biology · Feb 2, 2023
Igor Khokhriakov aka Ingvord
The RCSB PDB research-focused RCSB.org web portal serves more than
6M
unique users annually
across academic, government, industry, and public domains.
1
Titel
ChatGPT
ChatGPT
ChatGPT
About me
We are here
SDSC
RCSB
RCSB.intro.1
RCSB.intro.2
RCSB.intro.3
Ln1
Ln1.motivation
Ln1
K8s landscape
Migration
Concept maps
K8s platform project design
K8s platform
Technical challenges
CICD workflows
Ln2
Ln2.thoughts
Ln3
IDEA.integration
IDEA.deployment
Octant
Elastic
Ln4
deployment issue
deployment issue
deployment issue
useful tip
Ln5
Ln5.tools
IDEA+Helm
IDEA+Skaffold
IDEA+Skaffold2
cert-manager
cert-manager
Ln6
GitHub actions failure
GitHub actions failure zoom
GitHub actions failure reason
GitHub actions failure reason zoom
Ln7
Ln7.zoom
Ln8
Ln8
workshop
workshop.zoom.1
workshop.zoom.2
Ch1
Ch2
Ch2.zoom
Ch3
Ch4
Ch5
Ch5
Conclusion
Thanks
Acknowledgments
Contacts