JS monorepos in prod 5: merging Git repositories and preserve commit history

At Adaltas, we maintain several open-source Node.js projects organized as Git monorepos and published on NPM. We shared our experience to work with Lerna monorepos in a set of articles:

Now is the turn of our popular open-source Node CSV project to be migrated to a monorepo. This article will walk you through the available approaches, technics, and tools used to migrate multiple Node.js projects hosted on GitHub into the Lerna monorepo. At the end, we provide a bash script we used for migrating the Node CSV project. This script can be applied to a different project with just a little modification.

Requirements for migration

The Node CSV project combines 4 NPM packages to work with CSV files in Node.js wrapped by the umbrella csv package. Each NPM package has its rich commit history, and we wanted to save the maximum information from the old repositories. There are our requirements for migration:

  • preserve commit history with maximum information (such as tags, its messages, and merging commits)
  • ameliorate commit messages to follow the Conventional Commits specification
  • preserve GitHub issues

Monorepo structure

Well, we have 5 NPM packages to migrate to the Lerna monorepo:

We want to achieve a directory structure that looks like this:

packages/
  csv/
  csv-generate/
  csv-parse/
  csv-stringify/
  stream-transform/
lerna.json
package.json

Choosing Git log strategy

When migrating repositories into a monorepo, you merge their commit logs. There are 3 suggested strategies in the image below.


Git log strategies

  • Single branch
    It provides a straightforward log containing only commits on the default (master) branches of all packages. Different logs are joined sequentially by adding the latest commit of the previous package as a parent commit to the first commit of the next package. This strategy breaks the sorting of the log by the date of commits.
  • Multiple branches with a common parent
    This improves the visual perception of the log by splitting branches of different repositories. A new parent commit is added to all the first commits of the branches. In the end, all the branches are merged into the default branch.
  • Multiple branches with different parents
    This strategy doesn’t rewrite the first commits of old repositories. It requires minimal intervention into commit history and seems logically more correct because initially, the repositories didn’t have a common parent.

Merging commit logs

Lerna has a built-in mechanism for gathering existing standalone NPM packages into a monorepo preserving commit history. The lerna import command imports a package from an external repository into packages/. The sequence of commands is pretty simple: you need to initialize Git and Lerna repositories, make the first commit, and then start importing packages from locally cloned Git repositories. You can find basic usage instructions in the documentation here.

Using lerna import, you can only follow the 1st or the 2nd Git log strategy described above. For the 2nd one, you need to create a separate branch per importing repository like this:


git checkout -b package-1
lerna import /path/to/package-1

git checkout master

git checkout -b package-2
lerna import /path/to/package-2

lerna import provides an easy-to-use tool to migrate repositories to the Lerna monorepo. However, it flattens the commit history reducing merge commits, and it doesn’t migrate tags and their messages. Unfortunately, these limitations didn’t meet our requirement to save maximum information from existing repositories and we had to use a different tool.

The native git merge command provides merging unrelated histories using the --allow-unrelated-histories option. It preserves the full commit history of a targeted branch with its tags. In this case, you will achieve the 3rd Git log strategy.

Merging a commit history of an external repository into a current one using --allow-unrelated-histories as simple as running 2 commands:


git remote add -f <external-repo-name> <external-repo-path>

git merge --allow-unrelated-histories <external-repo-name>/<branch-name>

Rewriting commit messages

To put more order and transparency into the combined commit log, we prefix all commit messages with their package names. Additionally, we make them compatible with the Conventional Commits specification which we follow in our latest projects. This specification standardizes the commit messages making them more readable and easy to automate.

To implement this, we need to rewrite all commit messages by prefixing them with the string like chore(): .

We chose the chore type just to make it compatible with the specification, and we didn’t want to make complex regular expressions to fully support it.

There are 2 tools to rewrite commit messages:

Following the Git recommendation, we choose the git filter-repo. After installing the tool using these instructions, the command to rewrite the commit messages of a current repository is:

git filter-repo --message-callback 'return b"chore(<package-name>): " + message'

To see more usage examples of rewriting repository history with git filter-repo, you can follow this documentation.

Transferring GitHub issues

After migrating repositories and publishing a new monorepo to GitHub, we want to transfer existing GitHub issues from the old repositories. Issues can be transferred from one repository to another using the GitHub interface. You can follow this guide to learn the instructions.

Unfortunately, at the time of this writing, there is no possibility to make a bulk issues transfer. Issues must be transferred one by one. But this can give you an excuse to “forget” to transfer annoying pending issues created by the project community;)

What about GitHub pull requests? There will be a loss and we have to live with it. A good thing is that links between issues written in commentaries and linked pull requests will be saved thanks to redirecting.

Migration script

The migration bash script leverages the chosen approaches and tools described above. It generates the ./node-csv directory containing the Node CSV project files reorganized as a Lerna monorepo.

#!/bin/sh
set -e

REPOS=(
  https://github.com/adaltas/node-csv
  https://github.com/adaltas/node-csv-generate
  https://github.com/adaltas/node-csv-parse
  https://github.com/adaltas/node-csv-stringify
  https://github.com/adaltas/node-stream-transform
)
OUTPUT_DIR=node-csv
PACKAGES_DIR=packages

rm -rf $OUTPUT_DIR && mkdir $OUTPUT_DIR && cd $OUTPUT_DIR
git init .
git remote add origin $REPOS[0]

for repo in $REPOS[@]; do
  
  splited=($repo//// )
  package=$splited[$#splited[@]-1]/node-/
  
  rm -rf $TMPDIR/$package && mkdir $TMPDIR/$package && git clone $repo $TMPDIR/$package
  git filter-repo \
    --source $TMPDIR/$package \
    --target $TMPDIR/$package \
    --message-callback "return b'chore($package): ' + message"
  
  git remote add -f $package $TMPDIR/$package
  git merge --allow-unrelated-histories $package/master -m "chore($package): merge branch 'master' of $repo"
  
  mkdir -p $PACKAGES_DIR/$package
  files=$(find . -maxdepth 1 | egrep -v ^./.git$ | egrep -v ^.$ | egrep -v ^./$PACKAGES_DIR$)
  for file in $files// /[@]; do
    mv $file $PACKAGES_DIR/$package
  done
  git add .
  git commit -m "chore($package): move all package files to $PACKAGES_DIR/$package"
  
  git branch init/$package $package/master
done

rm $PACKAGES_DIR/**/CONTRIBUTING.md
rm $PACKAGES_DIR/**/CODE_OF_CONDUCT.md
rm -rf $PACKAGES_DIR/**/.github
git add .
git commit -m "chore: remove outdated packages files"

To run this script, simply create an executable file, for example with the name migrate.sh, paste the script’s content inside it, and run it with the command:

chmod u+x ./migrate.sh
./migrate.sh

Note! Don’t forget to install git-filter-repo before running the script.

Notes for each step of the script:

  • 1. Configure
    Configuration variables define the list of repositories to be migrated, the destination directory of the new Lerna monorepo, and the folder for packages inside it. You can modify these variables to reuse this script for your project.
  • 2. Initialize a new repository
    We initialize a new repository. The first repository is also registered as the remote origin repository.
  • 3. Migrate repositories
    • 3.1. Get package name
      It extracts package names from their repositories links. In our case, the repositories are prefixed with node- which we don’t want to keep.
    • 3.2. Rewrite commit messages via a temporary repository
      To add a prefix to the commits of each package using the pattern chore(): , we need to make it separately for every repository. This is possible via a repository locally cloned to a temporary folder.
    • 3.3. Merge the repository into monorepo
      At first, we add a locally cloned repository as a remote to the monorepo. Then, we merge its commit history specifying a merge commit message.
    • 3.4. Move repository files to the packages folder
      After merging, the files of the merged repository appear under the monorepo root directory. Following the structure we want to achieve, we move those files to the packages directory and commit it.
    • 3.5. Create a new branch
      The commit history is now associated with our monorepos through a remote repository. The history will be lost if the original repository is erased. To store the history in the monorepo, we create a branch which track the remote repository and prefixed it with init/.
  • 4. Cleanup and remove outdated files
    For the sake of illustration, we clean up some package files that are outdated thanks to the migration. Some of those file shall be moved to the repository root directory.

Further steps

The GIT repository is now ready and, as such, qualifies as a monorepo. To make it usuable, additionnal files must be created such as a root package.json file, the lerna.json configuration file if using Lerna and a README file. Refer to the first article of our serie to apply the necessary changes and initiliaze your monorepo with Lerna.

Conclusion

Migration of existing open-source projects requires you to be tidy and meticulous because a little mistake can ruin the job of your users. All the steps must be carefully analyzed and well tested. In this article, we have covered the scope of work to migrate multiple Node.js projects to the Lerna monorepo. We have considered different approaches, technics and available tools to automate the migration on the example of our Node CSV open-source project.

Leave a Reply