HomeArtificial IntelligenceMicrosoft introduces Florence-2, a unified model for tackling a wide range of...

Microsoft introduces Florence-2, a unified model for tackling a wide range of vision tasks

Today, Microsoft TheSheregarding AI The team presented a brand new vision foundation model called Florence-2 at Hugging face.

The model is out there under a permissive MIT license and might handle a wide selection of vision and vision language tasks using a unified, prompt-based representation. It is available in two sizes – 232M and 771M parameters – and already excels at tasks akin to subtitling, object detection, visual grounding, and segmentation, while acting on par with or higher than many large vision models available on the market.

While the model's performance stays to be tested in the sphere, the work is predicted to offer firms with a single, unified approach for handling several types of machine vision applications, saving investments in separate, task-specific machine vision models that transcend their core function without extensive fine-tuning.

What makes Florence-2 unique?

Today, large language models (LLMs) are at the guts of business operations. A single model can produce summaries, write marketing copy, and in lots of cases even handle customer support. The level of adaptability across domains and tasks is astonishing. But this success has also led researchers to ask: Can vision models, which have been largely task-specific until now, do the identical?

At their core, vision tasks are more complex than text-based natural language processing (NLP). They require comprehensive perceptual capabilities. To achieve a universal representation of varied vision tasks, a model essentially must have the option to grasp spatial data at different scales, from general image-level concepts akin to object position to fine-grained pixel details to semantic details akin to high-level captions and detailed descriptions.

When Microsoft attempted to resolve this problem, it encountered two major obstacles: the shortage of comprehensively annotated visual datasets and the shortage of a unified pre-training framework with a novel network architecture that integrates the power to grasp spatial hierarchy and semantic granularity.

To solve this problem, the corporate first used specialized models to generate a visible dataset called FLD-5B, which contained a complete of 5.4 billion annotations on 126 million images, covering details from high-level descriptions to specific regions and objects. It then used this data to coach Florence-2, which uses a sequence-to-sequence architecture (a sort of neural network designed for tasks with sequential data) that integrates a picture encoder and a multimodal encoder-decoder. This enables the model to handle various image processing tasks without requiring task-specific architecture changes.

“All annotations within the FLD-5B dataset are uniformly standardized into text outputs, enabling a unified multi-task learning approach with consistent optimization using the identical loss function as the target,” the researchers wrote within the Paper Detailing the model. “The result’s a flexible vision base model that may perform a wide range of tasks… all inside a single model driven by a unified set of parameters. Task activation is completed via text prompts, consistent with the approach of enormous language models.”

Performance higher than larger models

When prompted by images and text, Florence-2 performs a wide range of tasks including object recognition, labeling, visual grounding, and visual query answering. And, importantly, it does so at a top quality equal to or higher than many larger models.

For example, in a zero-shot subtitling test on the COCO dataset, each the 232M and 771M versions of Florence performed higher than Deepmind's 80B-parameter imagery model Flamingo with scores of 133 and 135.6, respectively. They even performed higher than Microsoft's own Kosmos-2 model, which focuses on visual grounding.

After fine-tuning with public, human-annotated data, Florence-2 was capable of compete closely with several larger, specialized models on tasks akin to visual query answering, despite its compact size.

“The pre-trained Florence-2 backbone improves performance on downstream tasks, akin to COCO object detection and instance segmentation and ADE20K semantic segmentation, outperforming each supervised and self-supervised models,” the researchers noted. “Compared to pre-trained models on ImageNet, our model improves training efficiency by 4x and achieves significant improvements of 6.9, 5.5, and 5.9 points on COCO and ADE20K datasets, respectively.”

Both pre-trained and fine-tuned versions of Florence-2 232M and 771M are actually available on Hugging Face under a permissive MIT license that permits unrestricted distribution and modification for business or personal use.

It can be interesting to see how developers will use it and eliminate the necessity for separate vision models for various tasks. Small, task-independent models cannot only save developers the work of using different models, but in addition significantly reduce computational costs.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read