Loading Florence-2 model and a sample image
After installing and importing the necessary libraries (as demonstrated in the accompanying Colab notebook), we begin by loading the Florence-2 model, processor and the input image of a camera:
#Load model:
model_id = ‘microsoft/Florence-2-large’
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, torch_dtype='auto').eval().cuda()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)#Load image:
image = Image.open(img_path)
Auxiliary Functions
In this tutorial, we will use several auxiliary functions. The most important is the run_example
core function, which generates a response from the Florence-2 model.
The run_example
function combines the task prompt with any additional text input (if provided) into a single prompt. Using the processor
, it generates text and image embeddings that serve as inputs to the model. The magic happens during the model.generate
step, where the model’s response is generated. Here’s a breakdown of some key parameters:
- max_new_tokens=1024: Sets the maximum length of the output, allowing for detailed responses.
- do_sample=False: Ensures a deterministic response.
- num_beams=3: Implements beam search with the top 3 most likely tokens at each step, exploring multiple potential sequences to find the best overall output.
- early_stopping=False: Ensures beam search continues until all beams reach the maximum length or an end-of-sequence token is generated.
Lastly, the model’s output is decoded and post-processed with processor.batch_decode
and processor.post_process_generation
to produce the final text response, which is returned by the run_example
function.
def run_example(image, task_prompt, text_input=''):prompt = task_prompt + text_input
inputs = processor(text=prompt, images=image, return_tensors=”pt”).to(‘cuda’, torch.float16)
generated_ids = model.generate(
input_ids=inputs[“input_ids”].cuda(),
pixel_values=inputs[“pixel_values”].cuda(),
max_new_tokens=1024,
do_sample=False,
num_beams=3,
early_stopping=False,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(
generated_text,
task=task_prompt,
image_size=(image.width, image.height)
)
return parsed_answer
Additionally, we utilize auxiliary functions to visualize the results (draw_bbox
,draw_ocr_bboxes
and draw_polygon
) and handle the conversion between bounding boxes formats (convert_bbox_to_florence-2
and convert_florence-2_to_bbox
). These can be explored in the attached Colab notebook.
Florence-2 can perform a variety of visual tasks. Let’s explore some of its capabilities, starting with image captioning.
1. Captioning Generation Related Tasks:
1.1 Generate Captions
Florence-2 can generate image captions at various levels of detail, using the '<CAPTION>'
, '<DETAILED_CAPTION>'
or '<MORE_DETAILED_CAPTION>'
task prompts.
print (run_example(image, task_prompt='<CAPTION>'))
# Output: 'A black camera sitting on top of a wooden table.'print (run_example(image, task_prompt='<DETAILED_CAPTION>'))
# Output: 'The image shows a black Kodak V35 35mm film camera sitting on top of a wooden table with a blurred background.'
print (run_example(image, task_prompt='<MORE_DETAILED_CAPTION>'))
# Output: 'The image is a close-up of a Kodak VR35 digital camera. The camera is black in color and has the Kodak logo on the top left corner. The body of the camera is made of wood and has a textured grip for easy handling. The lens is in the center of the body and is surrounded by a gold-colored ring. On the top right corner, there is a small LCD screen and a flash. The background is blurred, but it appears to be a wooded area with trees and greenery.'
The model accurately describes the image and its surrounding. It even identifies the camera’s brand and model, demonstrating its OCR ability. However, in the '<MORE_DETAILED_CAPTION>'
task there are minor inconsistencies, which is expected from a zero-shot model.
1.2 Generate Caption for a Given Bounding Box
Florence-2 can generate captions for specific regions of an image defined by bounding boxes. For this, it takes the bounding box location as input. You can extract the category with '<REGION_TO_CATEGORY>'
or a description with '<REGION_TO_DESCRIPTION>'
.
For your convenience, I added a widget to the Colab notebook that enables you to draw a bounding box on the image, and code to convert it to Florence-2 format.
task_prompt = '<REGION_TO_CATEGORY>'
box_str = '<loc_335><loc_412><loc_653><loc_832>'
results = run_example(image, task_prompt, text_input=box_str)
# Output: 'camera lens'
task_prompt = '<REGION_TO_DESCRIPTION>'
box_str = '<loc_335><loc_412><loc_653><loc_832>'
results = run_example(image, task_prompt, text_input=box_str)
# Output: 'camera'
In this case, the '<REGION_TO_CATEGORY>'
identified the lens, while the '<REGION_TO_DESCRIPTION>'
was less specific. However, this performance may vary with different images.
2. Object Detection Related Tasks:
2.1 Generate Bounding Boxes and Text for Objects
Florence-2 can identify densely packed regions in the image, and to provide their bounding box coordinates and their related labels or captions. To extract bounding boxes with labels, use the ’<OD>’
task prompt:
results = run_example(image, task_prompt='<OD>')
draw_bbox(image, results['<OD>'])
To extract bounding boxes with captions, use '<DENSE_REGION_CAPTION>'
task prompt:
task_prompt results = run_example(image, task_prompt= '<DENSE_REGION_CAPTION>')
draw_bbox(image, results['<DENSE_REGION_CAPTION>'])
2.2 Text Grounded Object Detection
Florence-2 can also perform text-grounded object detection. By providing specific object names or descriptions as input, Florence-2 detects bounding boxes around the specified objects.
task_prompt = '<CAPTION_TO_PHRASE_GROUNDING>'
results = run_example(image,task_prompt, text_input=”lens. camera. table. logo. flash.”)
draw_bbox(image, results['<CAPTION_TO_PHRASE_GROUNDING>'])
3. Segmentation Related Tasks:
Florence-2 can also generate segmentation polygons grounded by text ('<REFERRING_EXPRESSION_SEGMENTATION>'
) or by bounding boxes ('<REGION_TO_SEGMENTATION>'
):
results = run_example(image, task_prompt='<REFERRING_EXPRESSION_SEGMENTATION>', text_input=”camera”)
draw_polygons(image, results[task_prompt])
results = run_example(image, task_prompt='<REGION_TO_SEGMENTATION>', text_input="<loc_345><loc_417><loc_648><loc_845>")
draw_polygons(output_image, results['<REGION_TO_SEGMENTATION>'])
4. OCR Related Tasks:
Florence-2 demonstrates strong OCR capabilities. It can extract text from an image with the '<OCR>'
task prompt, and extract both text and its location with '<OCR_WITH_REGION>'
:
results = run_example(image,task_prompt)
draw_ocr_bboxes(image, results['<OCR_WITH_REGION>'])
Florence-2 is a versatile Vision-Language Model (VLM), capable of handling multiple vision tasks within a single model. Its zero-shot capabilities are impressive across diverse tasks such as image captioning, object detection, segmentation and OCR. While Florence-2 performs well out-of-the-box, additional fine-tuning can further adapt the model to new tasks or improve its performance on unique, custom datasets.