[COMMA : 시각·청각 장애인을 위한 인공지능 기반 학습 보조 프로그램 ] Open AI 의 GPT-4o를 사용하여 대체텍스트 생성하기

카테고리 없음 2024. 11. 21. 19:47

COMMA는 시각장애인을 위해 강의 자료를 텍스트로 변환하는 대체텍스트 생성 기능을 제공합니다. 이 기능을 활용하면 강의 자료의 시각적 요소를 분석하고 텍스트로 변환하여 사용자에게 적합한 학습 환경을 제공합니다.

이번 튜토리얼에서는 COMMA의 대체텍스트 생성 기능을 사용하여 Dart와 Flutter로 jpg,pdf 파일을 업로드하면 대체텍스트를 자동 생성하여 출력해보겠습니다.

※ 대체텍스트 : 시각장애인의 웹 접근성을 위한 대표적인 방법으로, 이미지 형태의 정보를 시각장애인이 이해할 수 있도록 풀어 설명해 주는 설명이나 문구.

0. 개발 환경

이 튜토리얼의 코드는 모두 Flutter 3.24.3 및 Dart 3.5.3 환경에서 작성되었으며, Flutter 3.8 이상과 Dart 3.5 이상에서 호환 가능합니다.

1. Open AI 사용을 위한 api key 발급

1. OpenAI 계정 생성
- OpenAI 웹사이트로 이동: [https://platform.openai.com/]

2. API 키 발급 페이지로 이동
- 로그인 후, "API Keys"페이지로 이동: [https://platform.openai.com/account/api-keys]

3. 새로운 API 키 생성
- Create new secret key 버튼을 클릭하여 새 API 키 생성

발급받은 api key는 재발급이 안되니 다른곳(메모장)에 써두세요!

2. 사용한 AI 및 라이브러리

[ AI ]

대체텍스트를 생성할때 ‘GPT-4o : 자연어 정보 추출 및 생성 엔진’ 기능을 사용했습니다.

자연어 정보 추출 엔진은 업로드한 jpg, pdf 파일에서 문자를 스캔하여 기계가 읽을 수 있는 텍스트 포맷으로 변환합니다.

생성 엔진 은 업로드한 jpg, pdf 파일에 있는 그림, 그래프 , 도표 등 시각적 이미지를 음성으로만 들어도 이해할 수 있도록 설명하는 텍스트로 변환합니다.

이 기능들은 gpt-4o 모델을 프롬프팅하여 정확도 및 형식에 맞게 출력할 수 있도록 조정하였습니다.

해당 프롬프팅 내용은 아래 [alt.dart 코드분석] - callChatGPT4APIForAlternativeText()에서 자세하게 다룹니다.

[ 라이브러리 ]

이번 튜토리얼에서 사용하는 라이브러리는 터미널에서 아래 명령어를 실행하여 설치할 수 있습니다.

dart pub add dart_openai
dart pub add file_picker

dart_openai : OpenAI API와 상호작용을 위한 라이브러리
file_picker : 사용자의 디바이스에서 PDF, 이미지 등 다양한 파일 형식을 선택하기 위한 라이브러리

👉 file_picker를 통해 강의자료를 업로드하고 OpenAI API와 dart_openai를 사용하여 대체텍스트를 생성합니다.

3. 파일 구조 소개

lib/
├── env/
│   └── env.dart
└── alt.dart

env.dart : open ai api key 저장해두는 파일

alt.dart : 강의자료(이미지, pdf 등) 파일 업로드 및 gpt-4o를 사용해서 강의자료에 대한 대체텍스트를 생성하는 파일

4. 코드분석

[ env.dart 코드 ]

아까 발급 받은 api key 'api key'에 삽입하면 됩니다

import 'dart:io';

final class _Env {
  static const String apiKey =
      'api key';
}

[ alt.dart 코드 ]

이번 튜토리얼에서 가장 복잡한 코드이기 때문에, 본격적으로 함수를 하나씩 분석하기 전에 전체적인 코드 Flow를 먼저 살펴보려고 합니다.

코드 Flow

1. jpg, pdf 파일을 업로드하면 형식이 jpg 인지 pdf인지 구분합니다.

2. jpg 파일인 경우) gpt-4o로 넘겨서 대체텍스트를 생성합니다.

pdf 파일인 경우) pdf 파일들을 한장씩 다 이미지 파일로 쪼개어 jpg로 변환합니다.

► gpt -4o로 넘겨서 대체텍스트를 생성합니다.

3. gpt-4o에서 프롬프팅을 통해 생성된 대체텍스트를 출력합니다.

👉 jpg, pdf 경우가 다른 이유

gpt-4o는 이미지 파일만 업로드가 가능합니다!

pdf파일을 사용하는 경우, 자체적으로 pdf파일을 각각의 jpg(이미지)파일로 쪼개서 변환하여 넘겨주어야 하기 때문에 프로세스 경우를 나누었습니다.

_pickFile()

파일을 업로드하고, 업로드된 파일이 PDF인지 이미지 파일인지 판별하는 역할을 합니다.

해당 파일의 형식을 확인하여 PDF인 경우 _isPDF 값을 true로 설정하고, 이미지 파일인 경우 false로 유지합니다.

이를 통해 PDF 파일은 이후 단계에서 이미지로 변환하여 처리할 준비를 합니다.

Future<void> _pickFile() async {
    FilePickerResult? result = await FilePicker.platform.pickFiles();

    if (result != null) {
      Uint8List? fileBytes = result.files.first.bytes;
      String fileName = result.files.first.name;

      if (fileBytes == null) {
        String? filePath = result.files.first.path;
        if (filePath != null) {
          File file = File(filePath);
          fileBytes = await file.readAsBytes();
        } else {
          return;
        }
      }
      try {
        String mimeType = 'application/octet-stream';
        if (fileName.endsWith('.pdf')) {
          mimeType = 'application/pdf';
          _isPDF = true;
        } else if (fileName.endsWith('.png') ||
            fileName.endsWith('.jpg') ||
            fileName.endsWith('.jpeg')) {
          mimeType = 'image/png';
          _isPDF = false;
        }
        
        setState(() {
          _selectedFileName = fileName;
          _downloadURL = downloadURL;
          _isMaterialEmbedded = true;
          _isIconVisible = false;
          _fileBytes = fileBytes;

          if (_isPDF) {
            _pdfController = pdfx.PdfController(
              document: pdfx.PdfDocument.openData(fileBytes!),
            );
          }
        });

        print('File uploaded successfully: $downloadURL');
      } catch (e) {
        print('File upload failed: $e');
      }
    }
  }

handlePdfUpload()

PDF 파일을 처리하는 역할을 수행합니다.

사용자가 업로드한 파일이 PDF 형식일 경우, 내부적으로 convertPdfToImages 함수를 호출하여 PDF를 여러 이미지 파일로 변환합니다. convertPdfToImages함수가 pdf를 jpg파일로 변환하여 넘겨주면 이 jpg파일들의 URL이 담긴 imageUrls 리스트 형태로 반환합니다.

Future<List<String>> handlePdfUpload(Uint8List pdfBytes, int userKey) async {
    try {
      // PDF를 이미지로 변환
      print('Starting PDF to image conversion...');
      List<Uint8List> images = await convertPdfToImages(pdfBytes);
      print(
          'PDF to image conversion completed. Number of images: ${images.length}');

      // 이미지 URL 리스트 반환
      return imageUrls;
    } catch (e) {
      print('Error: $e');
      return [];
    }
  }

convertPdfToImages()

handlePdfUpload 함수에서 호출되며, 업로드된 PDF 파일을 이미지 파일로 변환하는 역할을 합니다.

이 함수는 PDF 한장씩 넘겨서 각 페이지를 jpg파일로 변환합니다.

  Future<List<Uint8List>> convertPdfToImages(Uint8List pdfBytes) async {
    final document = await pdfr.PdfDocument.openData(pdfBytes);
    final pageCount = document.pageCount;
    List<Uint8List> images = [];

    for (int i = 0; i < pageCount; i++) {
      final page = await document.getPage(i + 1);
      final pageImage = await page.render(
        width: page.width.toInt(),
        height: page.height.toInt(),
        x: 0,
        y: 0,
      );

      final image = await pageImage.createImageIfNotAvailable();
      final imageData = await image.toByteData(format: ui.ImageByteFormat.png);
      if (imageData != null) {
        images.add(imageData.buffer.asUint8List());
      }
    }
    return images;
  }

callChatGPT4APIForAlternativeText()

JPG 파일의 URL 리스트를 입력받아 각 이미지에 대한 대체텍스트를 생성합니다.

OpenAI GPT 모델을 사용하며 프롬프트 내용은 promptForAlternativeText 필드에 정의되어 있습니다.

프롬프트 내용

1. 시각장애인을 위한 최적의 대체텍스트를 생성하도록 설계되었으며, 텍스트는 수정 없이 작성한다.

2. 그래프와 도표 같은 시각적 자료는 위에서 아래 방향, 왼쪽에서 오른쪽 방향으로 상세히 설명되도록 작성한다.

3. 생성된 대체텍스트는 사용자가 음성으로 학습 자료를 이해할 수 있도록 작성한다.

…

Future<String> callChatGPT4APIForAlternativeText(
      List<String> imageUrls, int userKey, String lectureFileName) async {
    const String apiKey = Env.apiKey;
    final Uri apiUrl = Uri.parse('https://api.openai.com/v1/chat/completions');

     const String promptForAlternativeText = '''
    Please convert the content of the following lecture materials into text so that visually impaired individuals can recognize it using a screen reader. 
    Write all the text that is in the lecture materials as IT IS, with any additional description or modification.
    If there is a picture in the lecture material, please generate a alternative text which describes about the picture.
    Visually impaired individuals should be able to understand where and what letters or pictures are located in the lecture materials through this text.
    Please write all descriptions in Korean.
    Conditions: 
    1. Write the text included in the lecture materials without any modifications. 
    2. Write as clearly and concisely as possible.
    3. When creating alternative text for images, do not indicate the position of the image. Instead, describe the image from top to bottom.
    4. Determine the type of visual content (table, diagram, graph, or other) and specify the format as [표], [그림], [그래프], etc., followed by the descriptive text.
      After the description, mark the end with "[표 끝]","[그림 끝]", "[그래프 끝]".
    5. For each slide, format the text as follows: "이 페이지의 주제는 ~~~입니다."
    6. Write all text in the slides as continuous prose without special characters that are hard to read aloud. This includes excluding emoticons, emojis, and other symbols that are difficult to read aloud.
    7. Write numbers in words to ensure smooth reading. For example, "12번" should be written as "열두번" and "23번째" as "스물세 번째".
    8. For mathematical formulas and symbols, write them out in text form so that they can be read aloud properly by a screen reader. This includes symbols like sigma, square root, alpha, beta, etc.
    9.If mathematical symbols appear, convert them into text form based on your judgment, ensuring that the symbols are not written as they are but transformed into readable text.
    10. When generating alternative text for images, tables, or graphs, ensure that the description provides enough detail for visually impaired individuals to fully understand the content. Include details such as the structure, data values, trends, and key information to help them grasp the meaning of the table or graph as clearly as possible.
    11. For tables, graphs, or diagrams, specify the format as [표], [그림], [그래프], etc., followed by the descriptive text. Ensure that the description is detailed enough so that the visually impaired can understand the content as if they were seeing the table or graph themselves. Use words to explain key insights, trends, or important data points in graphs or tables.
   After the description, mark the end with "[표 끝]", "[그림 끝]", "[그래프 끝]".
  ''';

GPT-4o 모델을 사용하며, 최대 토큰 수는 1000으로 설정하였습니다.

초기에 다양한 토큰 수를 테스트한 결과, 대체텍스트의 품질과 상세함을 고려했을 때 1000이 최적의 값으로 판단되어 설정하였습니다.

모델과 토큰 수는 각각의 메모리 한계와 비용을 고려해 설정할 수 있으며, 사용 사례에 따라 조정 가능합니다.

var data = {"model": "gpt-4o", "messages": messages, "max_tokens": 1000};

        var apiResponse = await http.post(
          apiUrl,
          headers: {
            'Content-Type': 'application/json',
            'Authorization': 'Bearer $apiKey',
          },
          body: jsonEncode(data),
        );

5. 최종 실행 결과 및 소스코드

[ 최종 실행 결과 ]

이제 실제로 실행시켜봅시다!

1. 실행시키기 위해 "flutter run" 명령어를 작성합니다.

2. 파일을 업로드 합니다. (업로드한 파일은 pdf형식입니다)

3. 실행결과

► [대기와 해수의 순환] 대체텍스트

해당 이미지의 텍스트를 왜곡없이 그대로 인식한 모습을 확인할 수 있습니다.

그래프 설명을 할때도 [그림] [그림 끝]등 프롬프팅한대로 규격에 맞게 출력됩니다.

► [대기의 순환] 대체텍스트

시각적인 이미지 자료는 음성으로도 이해할 수 있는 텍스트로 출력된 모습을 볼 수 있습니다

그림 설명도 프롬프팅한대로 왼쪽에서 오른쪽 방향대로 대체텍스트가 생성되었습니다.

[소스 코드]

이번 블로그에서 소개한 내용을 직접 따라 해보고 싶은 분들을 위해 전체 코드를 아래에 첨부합니다.

import 'package:flutter/material.dart';
import 'package:file_picker/file_picker.dart';
import 'package:firebase_storage/firebase_storage.dart';
import 'dart:typed_data';
import 'dart:ui' as ui;
import 'package:path_provider/path_provider.dart';
import 'package:path/path.dart' as path;
import 'package:http/http.dart' as http;
import 'package:pdf_render/pdf_render.dart' as pdfr;

class LearningPreparation extends StatefulWidget {
  const LearningPreparation({super.key});

  @override
  _LearningPreparationState createState() => _LearningPreparationState();
}

class _LearningPreparationState extends State<LearningPreparation> {
  String? _selectedFileName;
  Uint8List? _fileBytes;

  Future<void> _pickFile() async {
    FilePickerResult? result = await FilePicker.platform.pickFiles();

    if (result == null) {
      print("User cancelled the picker request");
      ScaffoldMessenger.of(context).showSnackBar(
        SnackBar(content: Text("파일 선택을 취소하였습니다.")),
      );
      return;
    }

    Uint8List? fileBytes = result.files.first.bytes;
    String fileName = result.files.first.name;

    if (fileBytes == null) {
      String? filePath = result.files.first.path;
      if (filePath != null) {
        File file = File(filePath);
        fileBytes = await file.readAsBytes();
      } else {
        return;
      }
    }

    setState(() {
      _selectedFileName = fileName;
      _fileBytes = fileBytes;
    });

    print("File selected: $_selectedFileName");
  }

  Future<List<Uint8List>> convertPdfToImages(Uint8List pdfBytes) async {
    final document = await pdfr.PdfDocument.openData(pdfBytes);
    final pageCount = document.pageCount;
    List<Uint8List> images = [];

    for (int i = 0; i < pageCount; i++) {
      final page = await document.getPage(i + 1);
      final pageImage = await page.render(
        width: page.width.toInt(),
        height: page.height.toInt(),
        x: 0,
        y: 0,
      );

      final image = await pageImage.createImageIfNotAvailable();
      final imageData = await image.toByteData(format: ui.ImageByteFormat.png);
      if (imageData != null) {
        images.add(imageData.buffer.asUint8List());
      }
    }
    return images;
  }

  Future<List<String>> handlePdfUpload(Uint8List pdfBytes, int userKey) async {
    try {
      List<Uint8List> images = await convertPdfToImages(pdfBytes);
      print('PDF to image conversion completed. Number of images: ${images.length}');

      List<String> downloadUrls = [];
      for (int i = 0; i < images.length; i++) {
        final storageRef = FirebaseStorage.instance.ref().child(
            'uploads/$userKey/pdf_handle/page_$i.jpg');
        final uploadTask = storageRef.putData(images[i]);
        final taskSnapshot = await uploadTask;
        final downloadUrl = await taskSnapshot.ref.getDownloadURL();
        downloadUrls.add(downloadUrl);
      }

      print('Image upload to Firebase completed. Number of URLs: ${downloadUrls.length}');
      return downloadUrls;
    } catch (e) {
      print('Error: $e');
      return [];
    }
  }

  Future<String> callChatGPT4APIForAlternativeText(
      List<String> imageUrls, String apiKey) async {
    final Uri apiUrl = Uri.parse('https://api.openai.com/v1/chat/completions');
    final String promptForAlternativeText = '''
    Please convert the content of the following lecture materials into text for visually impaired individuals. Include all text and images in a readable format. 
    ''';

    try {
      List<String> allResponses = [];

      for (int i = 0; i < imageUrls.length; i++) {
        final data = {
          "model": "gpt-4o",
          "messages": [
            {"role": "system", "content": promptForAlternativeText},
            {"role": "user", "content": imageUrls[i]}
          ],
          "max_tokens": 1000,
        };

        final response = await http.post(
          apiUrl,
          headers: {
            'Content-Type': 'application/json',
            'Authorization': 'Bearer $apiKey',
          },
          body: jsonEncode(data),
        );

        if (response.statusCode == 200) {
          final responseBody = utf8.decode(response.bodyBytes);
          final decodedResponse = jsonDecode(responseBody);
          final gptResponse = decodedResponse['choices'][0]['message']['content'];
          allResponses.add(gptResponse);
        } else {
          print('Error calling ChatGPT-4 API: ${response.statusCode}');
        }
      }

      return allResponses.join('\n');
    } catch (e) {
      print('Error: $e');
      return 'Error: $e';
    }
  }

  @override
  Widget build(BuildContext context) {
    return Scaffold(
      appBar: AppBar(title: const Text('Learning Preparation')),
      body: Center(
        child: ElevatedButton(
          onPressed: _pickFile,
          child: const Text('Pick a File'),
        ),
      ),
    );
  }
}

6. 마무리

이번 블로그에서는 GPT-4o 모델을 활용해 강의 자료(jpg 파일)를 대체텍스트로 변환하는 방법에 대해 알아보았습니다.

COMMA는 장애 학우들이 학교에서 원활하게 수업을 듣고 학습할 수 있도록 돕기 위해 개발된 서비스입니다.

기존에는 학습 자료를 활용하기 위해 도우미 인력에 의존해야 했지만, COMMA는 기술적인 접근으로 이를 해결하고자 했습니다.

특히 오늘 구현한 자동 대체텍스트 생성기능은 시각장애 학우들을 위해 대체텍스트 자동 생성 기능을 통해 학습의 어려움을 덜어줄 수 있을 것으로 기대됩니다.

마지막으로 이번에 구현한 기능은 프롬프트를 조정함으로써 다양한 형식과 규격으로 COMMA의 대체텍스트 생성 기능을 확장하고 활용할수 있습니다!

긴 글 읽어주셔서 감사합니다 😊

ABOUT ME

oneaney oneaney