Refer

1-Intro

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "bert-t1"
  namespace: "example"
  annotations:
    "sidecar.istio.io/inject": "false"
spec:
  predictor:
    minReplicas: 0
    maxReplicas: 2
    scaleTarget: 1
    scaleMetric: concurrency
    model:
      modelFormat:
        name: huggingface
      args:
        - "--model_dir=/mnt/models/distilbert/distilbert-base-uncased-finetuned-sst-2-english"
        - "--model_name=bert"
      storageUri: "pvc://global-shared/models"
      resources:
        limits:
          cpu: "6"
          memory: 24Gi
        requests:
          cpu: "6"
          memory: 24Gi
(base) โžœ  ~ curl -X POST "http://10.40.0.110:8081/v1/models/bert:predict" \
    -H "Content-Type: application/json" \
    -H "Host: bert-t1-predictor.yaww-ai-platform.svc.cluster.local" \
    -d '{"inputs": "I love using KServe!"}'
 
{"predictions":[1]}%

ไฝฟ็”จ hey ๅŽ‹ๆต‹.

hey -z 30s -c 5 -m POST -H "Content-Type: application/json" -H "Host: bert-t1-predictor.yaww-ai-platform.svc.cluster.local" -d '{"inputs": "I love using KServe!"}' "http://10.40.0.110:8081/v1/models/bert:predict"

2-ๆ‰ฉๅฎนๆŒ‡ๆ ‡

ไธŠ้ข็š„ๆ‰ฉๅฎนๆŒ‡ๆ ‡ๆ˜ฏๆฏ”่พƒๆŽจ่็š„, ๅŒๆ—ถ่ฟ˜ๆ”ฏๆŒๅŸบไบŽ gpu-resource ็ญ‰็ญ‰ . ้ป˜่ฎคๅฐฑๆ˜ฏ concurrency. ๅฏ้€‰็š„ๅ€ผ:

  • concurrency : ๅนถ่กŒๅบฆ๏ผŒๆŒ‰็…งๅนถๅ‘
  • rps : request per second, ๅฆ‚ๆžœไฝ ๆœŸๅพ… 50 ็š„ rps ๏ผŒ0.5s ็š„ latency๏ผŒ็›ดๆŽฅๆ‹‰100ไธช, ๅฝ“็„ถๅฏไปฅ่ฎพ็ฝฎไธŠ้™
  • cpu
  • memory