浅谈目标识别与运算加速

介绍目标识别

Yolo是目前非常精准的目标识别算法（图像分割可以去隔壁ViT模型看看，效率更高），凭借他较为新颖的算法可以完成绝大多数人群的适配（毕竟那可是目标识别嘛），接下来针对Yolo算法的原理和使用进行详细说明。

R-CNN

作为目标识别的祖宗级别的网络，RCNN以它新颖的思路开创了深度目标识别算法的先河。

既然是目标识别那么我们可以写出如下思路：

列出所有目标
对每一个目标跑一遍卷积神经网络得到值
根据值向量判断类型

对于R-CNN而言，它的流程精髓如上，列出所有目标使用到了选择性搜索(Selective Search)算法，选择性搜索算法使用到的度量值为：

颜色相似度

统计颜色直方图，得到所有颜色的一个颜色向量，计算两点的L1范数即可。

$s_{color}=\sum_{k=1}^n min(c_i^k,c_j^k)$

纹理相似度

纹理相似度计算相邻点的导数，将导数计算L1范数后得到的值为纹理相似度

$s_{texture}=\sum_{k=1}^n min(t_i^k,t_j^k)$

尺寸相似度

尺寸相似度在计算的时候优先合并小的区域，如果仅仅是通过颜色和纹理特征合并的话，很容易使得合并后的区域不断吞并周围的区域，后果就是多尺度只应用在了那个局部，而不是全局的多尺度。因此我们给小的区域更多的权重，这样保证在图像每个位置都是多尺度的在合并。

$s_{size}=1-\frac{size(r_i)+size(r_j)}{size(im)}$

填充相似度

填充相似度用于衡量相邻区域的填充程度，给出一个矩形区域计算其中包含两点$r_i和r_j$的的矩形。

$s_{fill}=1-\frac{size(Box_{ij})-size(r_i)-size(r_j)}{size(im)}$

最终的相似度为前面所有数值的和。

$s=s_{color}+s_{texture}+s_{size}+s_{fill}$

由此我们得到了很多个可能的区域。尽管精度比较低。

因为区域过多且精度较低，那我们就需要去除那些重复过多的区域，这里使用到的计算方法为交并比IoU。

$IoU=\frac{重合面积}{总体面积}$

我们希望这个IoU尽可能的小，所以足够大的我们就将它融合。

NMS 非极大值抑制

这个环节是目标的精髓，它可以将IoU合适但是依然重复的那种临界状态区域进行合并计算：

将所有框的得分排序，选中最高分及其对应的框
遍历其余的框，如果和当前最高分框的重叠面积(IOU)大于一定阈值（常用的值为0.5左右），我们就将框删除。（为什么要删除，是因为超过设定阈值，认为两个框的里面的物体属于同一个类别，比如都属于狗这个类别。我们只需要留下一个类别的可能性框图即可。）
从未处理的框中继续选一个得分最高的，重复上述过程。

SPP-Net

这个网络模型比R-CNN优化了卷积部分，R-CNN是一个区域一个区域去计算，而SPP-Net是一口气全部计算。

缺点显而易见：

需要大量的训练集去训练CNN
训练时间长，训练阶段多

Yolo

下图为coco测试集的效果，可以看到精度非常高而且重叠区域一样可以识别出来，这就是Yolo的强大之处。

下图为onnx结构图：

局部可见与传统的CNN别无二至。同时多了Split环节，增加了C2F和C3单元。即半数通道卷积和循环卷积。首先介绍一个Yolo使用过的所有单元：

CSP

通过直接卷积与先求残差再求卷积的值进行融合，最后强关联打平化处理。得到最终的前向传播数值

C2F

这里使用到了一个Bottleneck结构，它包含了卷积打平层(CBL)，将输出与输入求和。

C2F通过CBL层与Bottleneck进行有限个串联，得到了最终的输出。

SPP

将CBL层与多个池化层进行串联，将所有输出与输入进行强关联操作，然后进行CBL层计算。

接下来使用ultralytics的包使用yolov8吧！

import cv2
from ultralytics import YOLO
from cv2 import getTickCount, getTickFrequency
import torch
# 设置运行cuda
torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 加载 YOLOv8 模型
model = YOLO("./yolov8n.pt")

# 获取摄像头内容，参数 0 表示使用默认的摄像头
cap = cv2.VideoCapture(0)

while cap.isOpened():
    loop_start = getTickCount()
    success, frame = cap.read()  # 读取摄像头的一帧图像

    if success:
        results = model.predict(source=frame) # 对当前帧进行目标检测并显示结果
    annotated_frame = results[0].plot()

    # 中间放自己的显示程序
    loop_time = getTickCount() - loop_start
    total_time = loop_time / (getTickFrequency())
    FPS = int(1 / total_time)
    # 在图像左上角添加FPS文本
    fps_text = f"FPS: {FPS:.2f}"
    font = cv2.FONT_HERSHEY_SIMPLEX
    font_scale = 1
    font_thickness = 2
    text_color = (0, 0, 255)  # 红色
    text_position = (10, 30)  # 左上角位置

    cv2.putText(annotated_frame, fps_text, text_position, font, font_scale, text_color, font_thickness)
    cv2.imshow('img', annotated_frame)
    # 通过按下 'q' 键退出循环
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()  # 释放摄像头资源
cv2.destroyAllWindows()  # 关闭OpenCV窗口

运行效果自行体会。

接下来我们需要对它进行加速处理，对于torch模型可以导出通用模型onnx，使用yolo训练过程也会输出onnx

TensorRT使用onnx

首先将onnx转换成trt文件,然后转化为engine文件,最后直接使用即可,我用C++实现Yolo加速吧!

#include<iostream>
#include<opencv2/opencv.hpp>
#include <opencv2/core/core.hpp>
#include <opencv2/imgproc.hpp>
#include<fstream>
#include "NvInfer.h"
#include "preprocessing.hpp"
//#include "logging.h"

using namespace nvinfer1;
using namespace std;

const int model_width = 640;
const int model_height = 640;

class MyLogger : public nvinfer1::ILogger
{
    public:
    explicit MyLogger(nvinfer1::ILogger::Severity severity =nvinfer1::ILogger::Severity::kWARNING) : severity_(severity) {}

    void log(nvinfer1::ILogger::Severity severity, const char *msg) noexcept override
    {
        if (severity <= severity_) {
            std::cerr << msg << std::endl;
        }
    }
    nvinfer1::ILogger::Severity severity_;
};

int main()
{
//一、图像处理
    string image_path = R"(D:\C++\YoloMe\cmake-build-debug\res.jpg)";
    cv::Mat input_image = cv::imread(image_path);

    float* input_blob = new float[model_height * model_width * 3];
    cv::Mat resize_image;
	//比例
    const float _ratio = std::min(model_width / (input_image.cols * 1.0f),
                            model_height / (input_image.rows * 1.0f));
    // 等比例缩放
    const int border_width = input_image.cols * _ratio;
    const int border_height = input_image.rows * _ratio;
    // 计算偏移值
    const int x_offset = (model_width - border_width) / 2;
    const int y_offset = (model_height - border_height) / 2;

    //将输入图像缩放至resize_image
    cv::resize(input_image, resize_image, cv::Size(border_width, border_height));
    //复制图像并且制作边界
    cv::copyMakeBorder(resize_image, resize_image, y_offset, y_offset, x_offset,
                        x_offset, cv::BORDER_CONSTANT, cv::Scalar(114, 114, 114));
    // 转换为RGB格式
    cv::cvtColor(resize_image, resize_image, cv::COLOR_BGR2RGB);

    //归一化
    const int channels = resize_image.channels();
    const int width = resize_image.cols;
    const int height = resize_image.rows;
    for (int c = 0; c < channels; c++) {
        for (int h = 0; h < height; h++) {
            for (int w = 0; w < width; w++) {
                input_blob[c * width * height + h * width + w] =
                    resize_image.at<cv::Vec3b>(h, w)[c] / 255.0f;  //at<Vec3b> 是 OpenCV 中用于访问图像像素的一种方法，使用 at<Vec3b> 获取彩色图像中特定位置的像素颜色值
            }
        }
    }

//二、模型反序列化
    MyLogger logger;
    //读取trt信息
    const std::string engine_file_path = R"(D:\C++\YoloMe\cmake-build-debug\yolov8n.trt)";  //填写自己trt文件路径(需要绝对路径)
    std::stringstream engine_file_stream;
    engine_file_stream.seekg(0, engine_file_stream.beg);  //从起始位置偏移0个字节，指针移动到文件流的开头
    std::ifstream ifs(engine_file_path);
    engine_file_stream << ifs.rdbuf();  //将读取到的数据流交给engine_file_stream
    ifs.close();

    engine_file_stream.seekg(0, std::ios::end); //先把文件输入流指针定位到文档末尾来获取文档的长度
    const int model_size = engine_file_stream.tellg();  //获取文件流的总长度
    engine_file_stream.seekg(0, std::ios::beg);
    void *model_mem = malloc(model_size);               //开辟一样长的空间
    engine_file_stream.read(static_cast<char *>(model_mem), model_size);    //将内容读取到model_mem中

    nvinfer1::IRuntime *runtime = nvinfer1::createInferRuntime(logger);
    nvinfer1::ICudaEngine *engine = runtime->deserializeCudaEngine(model_mem, model_size);

    free(model_mem);

//三、模型推理
    nvinfer1::IExecutionContext *context = engine->createExecutionContext();

    void *buffers[2];
    // 获取模型输入尺寸并分配GPU内存
    nvinfer1::Dims input_dim = engine->getBindingDimensions(0);
    int input_size = 1;
    for (int j = 0; j < input_dim.nbDims; ++j) {
        if(input_dim.d[j] < 0)
            input_size *= -input_dim.d[j];
        else
            input_size *= input_dim.d[j];
    }
    cudaMalloc(&buffers[0], input_size * sizeof(float));

    // 获取模型输出尺寸并分配GPU内存
    nvinfer1::Dims output_dim = engine->getBindingDimensions(1);

    int output_size = 1;
    for (int j = 0; j < output_dim.nbDims; ++j) {
        if(output_dim.d[j] < 0)
            output_size *= -output_dim.d[j];
        else
            output_size *= output_dim.d[j];
    }
    cudaMalloc(&buffers[1], output_size * sizeof(float));

    // 给模型输出数据分配相应的CPU内存
    float *output_buffer = new float[output_size];
    //数据投入
    cudaStream_t stream;
    cudaStreamCreate(&stream);
    // 拷贝输入数据
    cudaMemcpyAsync(buffers[0], input_blob, input_size * sizeof(float),
                    cudaMemcpyHostToDevice, stream);
    // 执行推理
    if(context->enqueueV2(buffers, stream, nullptr))
    {
        cout << "enqueueV2执行推理成功" << endl;
    }
    else{
        cout << "enqueueV2执行推理失败" << endl;
        return -1;
    }
    // 拷贝输出数据
    cudaMemcpyAsync(output_buffer, buffers[1], output_size * sizeof(float),
                    cudaMemcpyDeviceToHost, stream);

    cudaStreamSynchronize(stream);

    delete context;
    delete engine;
    delete runtime;
    delete[] input_blob;

//四、输出结果output_buffer，放入objs  xywh为中心点坐标 和宽高
    float *ptr = output_buffer;     // 1x84x8400  =  705600
    vector<vector<float>> temp(84, vector<float>(8400));
    vector<vector<float>> outVec(8400, vector<float>(84));
    for(int i = 0; i < 705600; i++)
    {
        temp[i/8400][i%8400] = *ptr;
        ptr++;
    }
    for(int i = 0; i < 84; i++)
    {
        for(int j = 0; j < 8400; j++)
        {
            outVec[j][i] = temp[i][j];
        }
    }
    std::vector<Object> objs;
    for (int i = 0; i < 8400; ++i)
    {
        const float objectness = *(std::max_element(outVec[i].begin() + 4, outVec[i].begin() + 83));
        if (objectness >= 0.45f)
        {
            const int label = std::max_element(outVec[i].begin() + 4, outVec[i].begin() + 83) - (outVec[i].begin() + 4);  //std::max_element返回范围内的最大元素
            const float confidence = outVec[i][label + 4] * objectness;
            if (confidence >= 0.25f) {
                const float bx = outVec[i][0];
                const float by = outVec[i][1];
                const float bw = outVec[i][2];
                const float bh = outVec[i][3];
                Object obj;
                // 还原图像尺寸中box的尺寸比例，这里要减掉偏移值，并把box中心点坐标xy转成左上角坐标xy
                obj.box.x = (bx - bw * 0.5f - x_offset) / _ratio;
                obj.box.y = (by - bh * 0.5f - y_offset) / _ratio;
                obj.box.width = bw / _ratio;
                obj.box.height = bh / _ratio;
                obj.label = label;
                obj.confidence = confidence;
                objs.push_back(std::move(obj));
            }
        }
    }  // i loop

//五、NMS非极大值抑制
    vector<Object> output;
    hardNMS(objs, output, 0.6, 10);

//六、画框
    vector<Object>::iterator it = output.begin();
    while(it != output.end()){
        cv::Point topLeft(it->box.x, it->box.y);
        cv::Point bottomRight(it->box.x + it->box.width, it->box.y + it->box.height);
        cv::rectangle(input_image, topLeft, bottomRight, cv::Scalar(0, 0, 255), 2);
        std::stringstream buff;
        buff.precision(2);  //覆盖默认精度,置信度保留2位小数
        buff.setf(std::ios::fixed);
        buff << it->confidence;
        string text =names[it->label] + " " + buff.str();
        cv::putText(input_image, text, topLeft, 0, 1, cv::Scalar(0, 255, 0), 2);
        it++;
    }
    cv::imwrite("detected.jpg", input_image);

    return 0;
}

preprocessing.hpp

#ifndef YOLOME_PREPROCESSING_HPP
#define YOLOME_PREPROCESSING_HPP

#include <iostream>
#include <vector>
#include <list>
using namespace std;

//以coco数据集为例
string names[] = {"person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat", "traffic light",
                  "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow",
                  "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee",
                  "'skis'", "'snowboard'", "'sports ball'", "'kite'", "'baseball bat'", "'baseball glove'", "'skateboard'", "'surfboard'",
                  "'tennis racket'", "'bottle'", "'wine glass'", "'cup'", "'fork'", "'knife'", "'spoon'", "'bowl'", "'banana'", "'apple'",
                  "'sandwich'", "'orange'", "'broccoli'", "'carrot'", "'hot dog'", "'pizza'", "'donut'", "'cake'", "'chair'", "'couch'",
                  "'potted plant'", "'bed'", "'dining table'", "'toilet'", "'tv'", "'laptop'", "'mouse'", "'remote'", "'keyboard'", "'cell phone'",
                  "'microwave'", "'oven'", "'toaster'", "'sink'", "'refrigerator'", "'book'", "'clock'", "'vase'", "'scissors'", "'teddy bear'",
                  "'hair drier'", "'toothbrush'"};

struct BOX
{
    float x;
    float y;
    float width;
    float height;
};

struct Object
{
    BOX box;    // lu点和wh
    int label;
    float confidence;  //这里的confidence实际指的是score 即 objectness*confidence
};

bool cmp(Object &obj1, Object &obj2){
    return obj1.confidence > obj2.confidence;
}

float iou_of(const Object &obj1, const Object &obj2)
{
    float x1_lu = obj1.box.x;
    float y1_lu = obj1.box.y;
    float x1_rb = x1_lu + obj1.box.width;
    float y1_rb = y1_lu + obj1.box.height;
    float x2_lu = obj2.box.x;
    float y2_lu = obj2.box.y;
    float x2_rb = x2_lu + obj2.box.width;
    float y2_rb = y2_lu + obj2.box.height;
    //交集左上角坐标i_x1, i_y1
    float i_x1 = std::max(x1_lu, x2_lu);
    float i_y1 = std::max(y1_lu, y2_lu);
    //交集右下角坐标i_x2, i_y2
    float i_x2 = std::min(x1_rb, x2_rb);
    float i_y2 = std::min(y1_rb, y2_rb);
    //交集框宽高
    float i_w = i_x2 - i_x1;
    float i_h = i_y2 - i_y1;
    //并集左上角坐标
    float o_x1 = std::min(x1_lu, x2_lu);
    float o_y1 = std::min(y1_lu, y2_lu);
    //并集右下角坐标
    float o_x2 = std::max(x1_rb, x2_rb);
    float o_y2 = std::max(y1_rb, y2_rb);
    //并集宽高
    float o_w = o_x2 - o_x1;
    float o_h = o_y2 - o_y1;

    return (i_w*i_h) / (o_w*o_h);
}

std::vector<int> hardNMS(std::vector<Object> &input, std::vector<Object> &output, float iou_threshold, unsigned int topk)
{  //Object只有confidence和label
    const unsigned int box_num = input.size();
    std::vector<int> merged(box_num, 0);
    std::vector<int> indices;

    if (input.empty())
        return indices;
    std::vector<Object> res;
    //先对bboxs按照conf进行排序
    std::sort(input.begin(), input.end(),
              [](const Object &a, const Object &b)
              { return a.confidence > b.confidence; });   //[]表示C++中的lambda函数

    unsigned int count = 0;
    for (unsigned int i = 0; i < box_num; ++i)
    {   //按照conf依次遍历bbox
        if (merged[i])
            continue;
        //如果已经被剔除，continue
        Object buf;
        buf = input[i];
        merged[i] = 1; //剔除当前bbox

        //由于后面的置信度低，只需要考虑当前bbox后面的即可
        for (unsigned int j = i + 1; j < box_num; ++j)
        {
            if (merged[j])
                continue;

            float iou = static_cast<float>(iou_of(input[j], input[i]));
            //计算iou
            if (iou > iou_threshold)
            { //超过阈值认为重合，剔除第j个bbox，
                merged[j] = 1;
            }
        }
        indices.push_back(i);
        res.push_back(buf); //将最高conf的bbox填入结果

        // keep top k
        //获取前k个输出，这个应该是针对密集输出的情况，此时input已经做了conf剔除
        count += 1;
        if (count >= topk)
            break;
    }
    output.swap(res);

    return indices;
}

float sigmoid(float x)
{
    return 1.0 / (exp(-x) + 1.0);
}

#endif //YOLOME_PREPROCESSING_HPP