Florence-2: Advancing Multiple Vision Tasks with a Single VLM Model | by Lihi Gur Arie, PhD | Oct, 2024


Loading Florence-2 model and a sample image

After installing and importing the necessary libraries (as demonstrated in the accompanying Colab notebook), we begin by loading the Florence-2 model, processor and the input image of a camera:

#Load model:
model_id = ‘microsoft/Florence-2-large’
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, torch_dtype='auto').eval().cuda()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

#Load image:
image = Image.open(img_path)

Auxiliary Functions

In this tutorial, we will use several auxiliary functions. The most important is the run_example core function, which generates a response from the Florence-2 model.

The run_example function combines the task prompt with any additional text input (if provided) into a single prompt. Using the processor, it generates text and image embeddings that serve as inputs to the model. The magic happens during the model.generate step, where the model’s response is generated. Here’s a breakdown of some key parameters:

  • max_new_tokens=1024: Sets the maximum length of the output, allowing for detailed responses.
  • do_sample=False: Ensures a deterministic response.
  • num_beams=3: Implements beam search with the top 3 most likely tokens at each step, exploring multiple potential sequences to find the best overall output.
  • early_stopping=False: Ensures beam search continues until all beams reach the maximum length or an end-of-sequence token is generated.

Lastly, the model’s output is decoded and post-processed with processor.batch_decode and processor.post_process_generation to produce the final text response, which is returned by the run_example function.

def run_example(image, task_prompt, text_input=''):

prompt = task_prompt + text_input

inputs = processor(text=prompt, images=image, return_tensors=”pt”).to(‘cuda’, torch.float16)

generated_ids = model.generate(
input_ids=inputs[“input_ids”].cuda(),
pixel_values=inputs[“pixel_values”].cuda(),
max_new_tokens=1024,
do_sample=False,
num_beams=3,
early_stopping=False,
)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(
generated_text,
task=task_prompt,
image_size=(image.width, image.height)
)

return parsed_answer

Additionally, we utilize auxiliary functions to visualize the results (draw_bbox ,draw_ocr_bboxes and draw_polygon) and handle the conversion between bounding boxes formats (convert_bbox_to_florence-2 and convert_florence-2_to_bbox). These can be explored in the attached Colab notebook.

Florence-2 can perform a variety of visual tasks. Let’s explore some of its capabilities, starting with image captioning.

1. Captioning Generation Related Tasks:

1.1 Generate Captions

Florence-2 can generate image captions at various levels of detail, using the '<CAPTION>' , '<DETAILED_CAPTION>' or '<MORE_DETAILED_CAPTION>' task prompts.

print (run_example(image, task_prompt='<CAPTION>'))
# Output: 'A black camera sitting on top of a wooden table.'

print (run_example(image, task_prompt='<DETAILED_CAPTION>'))
# Output: 'The image shows a black Kodak V35 35mm film camera sitting on top of a wooden table with a blurred background.'

print (run_example(image, task_prompt='<MORE_DETAILED_CAPTION>'))
# Output: 'The image is a close-up of a Kodak VR35 digital camera. The camera is black in color and has the Kodak logo on the top left corner. The body of the camera is made of wood and has a textured grip for easy handling. The lens is in the center of the body and is surrounded by a gold-colored ring. On the top right corner, there is a small LCD screen and a flash. The background is blurred, but it appears to be a wooded area with trees and greenery.'

The model accurately describes the image and its surrounding. It even identifies the camera’s brand and model, demonstrating its OCR ability. However, in the '<MORE_DETAILED_CAPTION>' task there are minor inconsistencies, which is expected from a zero-shot model.

1.2 Generate Caption for a Given Bounding Box

Florence-2 can generate captions for specific regions of an image defined by bounding boxes. For this, it takes the bounding box location as input. You can extract the category with '<REGION_TO_CATEGORY>' or a description with '<REGION_TO_DESCRIPTION>' .

For your convenience, I added a widget to the Colab notebook that enables you to draw a bounding box on the image, and code to convert it to Florence-2 format.

task_prompt = '<REGION_TO_CATEGORY>'
box_str = '<loc_335><loc_412><loc_653><loc_832>'
results = run_example(image, task_prompt, text_input=box_str)
# Output: 'camera lens'
task_prompt = '<REGION_TO_DESCRIPTION>'
box_str = '<loc_335><loc_412><loc_653><loc_832>'
results = run_example(image, task_prompt, text_input=box_str)
# Output: 'camera'

In this case, the '<REGION_TO_CATEGORY>' identified the lens, while the '<REGION_TO_DESCRIPTION>' was less specific. However, this performance may vary with different images.

2. Object Detection Related Tasks:

2.1 Generate Bounding Boxes and Text for Objects

Florence-2 can identify densely packed regions in the image, and to provide their bounding box coordinates and their related labels or captions. To extract bounding boxes with labels, use the ’<OD>’task prompt:

results = run_example(image, task_prompt='<OD>')
draw_bbox(image, results['<OD>'])

To extract bounding boxes with captions, use '<DENSE_REGION_CAPTION>' task prompt:

task_prompt results = run_example(image, task_prompt= '<DENSE_REGION_CAPTION>')
draw_bbox(image, results['<DENSE_REGION_CAPTION>'])
The image on the left shows the results of the ’<OD>’ task prompt, while the image on the right demonstrates ‘<DENSE_REGION_CAPTION>’

2.2 Text Grounded Object Detection

Florence-2 can also perform text-grounded object detection. By providing specific object names or descriptions as input, Florence-2 detects bounding boxes around the specified objects.

task_prompt = '<CAPTION_TO_PHRASE_GROUNDING>'
results = run_example(image,task_prompt, text_input=”lens. camera. table. logo. flash.”)
draw_bbox(image, results['<CAPTION_TO_PHRASE_GROUNDING>'])
CAPTION_TO_PHRASE_GROUNDING task with the text input: “lens. camera. table. logo. flash.”

3. Segmentation Related Tasks:

Florence-2 can also generate segmentation polygons grounded by text ('<REFERRING_EXPRESSION_SEGMENTATION>') or by bounding boxes ('<REGION_TO_SEGMENTATION>'):

results = run_example(image, task_prompt='<REFERRING_EXPRESSION_SEGMENTATION>', text_input=”camera”)
draw_polygons(image, results[task_prompt])
results = run_example(image, task_prompt='<REGION_TO_SEGMENTATION>', text_input="<loc_345><loc_417><loc_648><loc_845>")
draw_polygons(output_image, results['<REGION_TO_SEGMENTATION>'])
The image on the left shows the results of the REFERRING_EXPRESSION_SEGMENTATION task with ‘camera’ text as input. The image on the right demonstrates REGION_TO_SEGMENTATION task with a bounding box around the lens provided as input.

4. OCR Related Tasks:

Florence-2 demonstrates strong OCR capabilities. It can extract text from an image with the '<OCR>' task prompt, and extract both text and its location with '<OCR_WITH_REGION>' :

results = run_example(image,task_prompt)
draw_ocr_bboxes(image, results['<OCR_WITH_REGION>'])

Florence-2 is a versatile Vision-Language Model (VLM), capable of handling multiple vision tasks within a single model. Its zero-shot capabilities are impressive across diverse tasks such as image captioning, object detection, segmentation and OCR. While Florence-2 performs well out-of-the-box, additional fine-tuning can further adapt the model to new tasks or improve its performance on unique, custom datasets.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here