Refer
1-Intro
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "bert-t1"
namespace: "example"
annotations:
"sidecar.istio.io/inject": "false"
spec:
predictor:
minReplicas: 0
maxReplicas: 2
scaleTarget: 1
scaleMetric: concurrency
model:
modelFormat:
name: huggingface
args:
- "--model_dir=/mnt/models/distilbert/distilbert-base-uncased-finetuned-sst-2-english"
- "--model_name=bert"
storageUri: "pvc://global-shared/models"
resources:
limits:
cpu: "6"
memory: 24Gi
requests:
cpu: "6"
memory: 24Gi(base) โ ~ curl -X POST "http://10.40.0.110:8081/v1/models/bert:predict" \
-H "Content-Type: application/json" \
-H "Host: bert-t1-predictor.yaww-ai-platform.svc.cluster.local" \
-d '{"inputs": "I love using KServe!"}'
{"predictions":[1]}%ไฝฟ็จ hey ๅๆต.
hey -z 30s -c 5 -m POST -H "Content-Type: application/json" -H "Host: bert-t1-predictor.yaww-ai-platform.svc.cluster.local" -d '{"inputs": "I love using KServe!"}' "http://10.40.0.110:8081/v1/models/bert:predict"2-ๆฉๅฎนๆๆ
ไธ้ข็ๆฉๅฎนๆๆ ๆฏๆฏ่พๆจ่็, ๅๆถ่ฟๆฏๆๅบไบ gpu-resource ็ญ็ญ . ้ป่ฎคๅฐฑๆฏ concurrency. ๅฏ้็ๅผ:
concurrency: ๅนถ่กๅบฆ๏ผๆ็ งๅนถๅrps:request per second, ๅฆๆไฝ ๆๅพ 50 ็ rps ๏ผ0.5s ็ latency๏ผ็ดๆฅๆ100ไธช, ๅฝ็ถๅฏไปฅ่ฎพ็ฝฎไธ้cpumemory