image of transformers getting better

March 20, 2023

|

AI Industry Insights

ConvNeXt: A Transformer-Inspired CNN Architecture

No items found.

The transformer architecture is one of the most important design innovations in the last ten years of deep learning development. The Facebook AI Research group did an extensive series of ablation studies to determine which aspects of the transformer architecture make it so powerful and adopted many of those designs to modify existing convolutional neural networks (specifically ResNet). In this blog post, we will look at those changes and their pivotal effect on traditional convolutional networks.

It’s important to note that in addition to changes that specifically emulated the transformer architecture, the team that designed ConvNeXt also used a novel combination of data augmentation techniques, regularization schemes, and stage compute ratios to boost performance beyond that of vision transformers.

“Patchify” Stem

In computer vision, the stem of a network typically refers to the first layer that input images will be passed through, and almost always involves downsampling the image and increasing feature map count (i.e. number of filter channels). In traditional ResNet architectures, the stem consists of a 7x7 convolution with a stride of 2, followed by max pooling.

It’s important to note that if kernel size is greater than stride (assuming no padding) in a convolution, the results will overlap. This means that as the kernel convolves over the input, the surrounding information is captured within each iteration of the kernel across the input, ensuring any relational data is captured.

In vision transformers and ConvNeXt, a more straightforward “patchify” stem is used instead. This is accomplished via a 4x4 convolution with a stride of 4, followed by layer normalization. By setting the kernel size and stride to the same value, we end up with non-overlapping convolutions. This means that as the kernel convolves over the input, no information is shared between them, producing unique patches. It’s worth noting, however, that each of these receptive fields are ultimately combined in the final layers of the network.

Interestingly, converting to a “patchify” stem resulted in a 1.1% increase in model performance.

‍

‍Inverted Bottlenecks

In a vanilla residual block (ResNet block), the inputs pass through multiple convolutional layers, where the number of feature maps is downsampled upon entry and up-sampled upon exit. In parallel, a skip connection allows the original input to be added directly to the output of the final convolutional layer. This breakthrough innovation allowed neural networks to become much deeper while maintaining training stability, resulting in the ability to learn increasingly complex and abstract representations.

A design feature commonly found in transformer architectures is the Inverted Bottleneck block, a specialized type of residual block. The Inverted Bottleneck begins with a depth-wise separable convolution, which is the combination of two convolutional layers: a depthwise convolution followed by a 1x1 (pointwise) convolution where the feature map count is upsampled by a factor of 4. This is followed by another 1x1 convolution which downsamples the feature map count back to what it was upon entering the block, allowing for a skip connection to be made at the end.

ConvNeXt uses an Inverted Bottleneck block design that is essentially the same, with the addition of layer normalization and GeLU activation. This block design is extremely computationally efficient compared to a vanilla residual block.

Using an Inverted Bottleneck block inspired by transformers resulted in a 0.1% - 0.7% increase in model performance, dependent upon the training regime.

‍

Fewer & Different Activation Functions

Activation functions are mathematical functions that allow for nonlinearity and are commonly applied to the output of a node or layer in neural networks. They are an essential component of any deep learning algorithm as they allow for complex relationships to be found within the input data. The most utilized activation function is the Rectified Linear Unit, commonly known as ReLU, and is employed in ResNet and the original transformer architecture published in 2017.

ReLU is heavily implemented in conventional ResNets and is typically inserted after every convolution. Conversely, transformers use relatively few activation functions at all since the self-attention mechanism serves as a form of nonlinearity. ConvNeXt mirrored transformers in this respect by utilizing only a single activation function per block placed between the 1x1 convolutions.

Some of the more modern, sophisticated transformer models, such as GPT-2 and BERT, utilize a different activation function known as the Gaussian Error Linear Unit, commonly referred to as GeLU. GeLU is a variant of ReLU, which has been shown to perform slightly better in certain architectures (such as transformers). Any instance of ReLU was replaced by GeLU.

Replacing ReLU with GeLU and decreasing the amount of activation functions resulted in a 0.7% increase in model performance.

‍

Fewer & Different Normalization Layers

Transformers tend to have fewer normalization layers, as opposed to ResNet, which uses Batch Normalization after every convolution. ConvNeXt imitates transformers by removing most instances of normalization layers, only leaving them in place after the depthwise convolution in an inverted bottleneck block, in the classification head, and before the injection of a downsampling layer.

Additionally, transformers replace Batch Normalization with Layer Normalization, a simpler process that normalizes across the features of an individual input instead of the entire batch. ConvNeXt replaces any instance of Batch Normalization with Layer Normalization as well.

Replacing Batch Normalization with Layer Normalization and decreasing the amount of normalization layers resulted in a 0.1% increase in model performance.

‍

Separate Downsampling Layers

ResNet employs downsampling at the beginning of the residual block by using a 3x3 convolution with a stride of 2. Notably, popular vision transformer architectures (such as the Swin Transformer) inject a separate downsampling layer between blocks rather than implementing downsampling within the blocks themselves. ConvNeXt replicated this as well, injecting a downsampling layer that starts with Layer Normalization, followed by a 2x2 convolution with a stride of 2 between each layer of blocks.

Introducing separate downsampling layers resulted in a 0.5% increase in model performance.

‍

Conclusion

In the whitepaper, A ConvNet for the 2020s, researchers implemented transformer-inspired changes one by one on a ResNet-50, resulting in a new model architecture dubbed ConvNeXt. This new model was able to achieve an accuracy of 82.0% on the ImageNet classification task, outperforming the Swin Transformer. ConvNeXt can outperform vision transformers while retaining the straightforward, fully-convolutional architecture of ResNets for both training and testing, enabling simple deployment.

ConvNeXt is a high-performance computer vision architecture that can be applied to a wide array of applications. While there are many distinctive features not discussed in this post that make ConvNeXt so popular, these specific modifications inspired by transformers allowed ConvNeXt to outperform ResNet and existing vision transformers - all while using fewer parameters and requiring fewer computational resources.

You Made It to Production: Now What?

Rethinking the AI Development Lifecycle

Why 90% of AI Projects Fail Before They Launch

A Gold Medal Moment for AI

Part 3: How to Choose an AI Governance Model That Works for Your Organization

The Real Breakthrough Behind DeepSeek R1

Anthropic Cracks Open the Black Box of AI

Predicting Cancer Before It Starts: An AI Milestone in Women’s Health

Reinforcement Learning: AI’s Next Big Leap

Copyright, Fair Use, and the Fight Over AI Training Data

The Real Illusion in Apple’s “Illusion of Thinking” Paper

Part 2: Designing AI Governance That Works

Part 1: Why AI Governance is a Strategic Imperative

Most People Don't Expect AI to Benefit Them. What Can We Do About That?

From Brain to Machine: How Neuroscience Is Shaping the Future of AI

KUNGFU.AI Partners with NACD to Equip Boards for the Age of AI

What Does “Productivity” Mean in an AI-Enabled World?

The Emergence of Product Analytics: An Under-appreciated Yet Critical Part of AI Development

The Academic in Industry: A Cultural and Pragmatic Shift

AI & Authenticity—What Does It Mean to Be "Real" in 2025?

AI is Like a Road Trip: Why You Need a Flexible Strategy, Not Just a Destination

Why Most AI Implementations Fail—And How to Get It Right

Reclaiming Attention in the Age of AI

Are Agents the Future?

Tired of the Hype? Let’s Baseline 10 Commonly Misused AI Terms

KUNGFU.AI’s AI Hiring Survival Guide

Part 3: How to Procure AI Services Through an RFP Process

Data Science: Bridging the Gap Between Business and Analytics

From Consumerism to Sustainability: AI’s Role in Shaping the Future of Economic Growth

Part 2: Planning for Next Year’s AI Budget: A Strategic Guide for C-Level Executives

Part 1: Building vs. Buying an AI Team: What’s Best for Your Business?

Mash-Up: AI and Potatoes USA Join Forces Against Misinformation

KUNGFU.AI Updates Ethical Pledge on Facial Recognition

3 Steps to Designing AI That Fits Like a Glove

LLMs are Engines. It’s Time for Vehicles.

Product Sense: A Hidden Lynchpin in Data Science and AI

Not Budgeting for AI Today is like Having Bet on the Slide Rule, Calculator or Fax

The Top AI Events We’re Looking Forward to in 2024

2024 Will Be The Year of The AI Budget

Engineering Explained: GPT-4V(ision)

KUNGFU.AI and CDAO Collaborate on AI Strategy for Defense Enterprise Ecosystem

Engineering Explained: Opportunity Sizing and ROI Analysis

Engineering Explained: Bayesian Mechanics

Celebrating Our Success: We Made the Inc. 5000 List of Fastest-Growing Private Companies in America!

10 Things Companies Should Think About When Devising an AI Strategy

Engineering Explained: Large Language Models

Engineering Explained: Diffusion Models

Understanding Data Science and Related Sub Sciences

KUNGFU.AI Joins Tradewinds’ Marketplace, Empowering Businesses with Cutting-Edge AI Services

How to Navigate the AI Industry: Join our Career Workshops

Innovation in the Age of Regulation: Building AI with Federated Learning

AI is the Future. ChatGPT is the assistant.

KUNGFU.AI’s Approach to Developing an ‘AI Center of Excellence’

KUNGFU.AI Joins INSA to Expand Government Partnerships and Reach

Data-Driven Decision-Making: Making Confident and Proactive Business Decisions

Navigating the Ethical Implications of Data Interpretation

Overcoming Cognitive Bias in Data Analysis and Decision-Making

ConvNeXt: A Transformer-Inspired CNN Architecture

How to Build a Great AI Engineering Team

Engineering Explained: LayoutLMv3 and the Future of Document AI

Turning Away Our First Client

AI Simplified: An Introduction to Artificial Intelligence

Introducing KUNGFU.AI Lab Days

Large Language Models: Three Stages of Adoption

The Future of AI: Can Open-Source Community Keep Up with Large Corporations?

How to Use ChatGPT: Our Step by Step Guide

What is ChatGPT? Everything You Need to Know.

Savimbo and KUNGFU.AI Partner to Bring AI to Rainforest Conservation

Data, Security, and Ethical Risks of AI Use in Healthcare

Engineering Explained: OpenAI's ChatGPT

4 Ways to Mitigate Bias and Prioritize Patients

We Used ChatGPT to Figure Out How Businesses Can Use ChatGPT

Want to WFH? Check Out These 10 Flexible Remote Companies

Where We Are and What's Coming

Meet the Team: Benjamin Klein

The First Mile of Any AI Project is Most Critical

Edge Computing for Business: What You Should Know

What You Should Know Before Investing in Computer Vision

KUNGFU.AI Presents: Using Computer Vision to Solve Business Challenges with WM

KUNGFU.AI Presents: Unlocking Greater Business Intelligence with Graphs

How Multitask Learning in Computer Vision Can Solve Your Business Challenges

Now Is the Time to Invest in Computer Vision and Secure a Competitive Advantage

Designing Your First NLP Annotation Job

Autism Acceptance Day

KUNGFU.AI Announces Chief Growth Officer and Record Growth

5 Ways to Realize ROI on AI investments

Join Us for Giving Tuesday

KUNGFU.AI Achieves Machine Learning Partner Specialization in the Google Cloud Partner Advantage Program

KUNGFU.AI Presents: The Obstacles in Building Product AI and How to Overcome Them

KUNGFU.AI Presents: The AI Ethical Imperative

Want to win with AI? Focus on your leadership, not the competition.

KUNGFU.AI Partners with Parasanti to Support U.S. Navy Foreign Object Detection Project

KUNGFU.AI and makepath Partner to Demonstrate Power of Machine Learning and Data Visualization

Deadline 2024: Why you only have 3 years left to adopt AI

How to Determine if AI can Solve Your Business Problem

Infographic: 10 Artificial Intelligence Trends To Watch Out For In 2021

Building Internal AI Capabilities: How to incorporate AI Ops into your organization

Building Internal AI Capabilities: Bridge the gap between data science and DevOps

Building Internal AI Capabilities: How to execute A.I. at scale

Building Internal AI Capabilities: How to ensure you have the right infrastructure & expertise

Related resources

AI Industry Insights

Part 3: How to Procure AI Services Through an RFP Process

basket of potatoes

AI Industry Insights

Mash-Up: AI and Potatoes USA Join Forces Against Misinformation

women baking in boxing gloves

AI Industry Insights

3 Steps to Designing AI That Fits Like a Glove