Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contributing: Decrease the size of the source code needed to be downloaded #29008

Closed
gziolo opened this issue Feb 15, 2021 · 26 comments
Closed
Labels
[Type] Enhancement A suggestion for improvement. [Type] Project Management Meta-issues related to project management of Gutenberg

Comments

@gziolo
Copy link
Member

gziolo commented Feb 15, 2021

Similar to #26993.

What problem does this address?

It takes ages to finish:

git clone [email protected]:WordPress/gutenberg.git

At the moment the size of the repository is over 2GB!!!!

Screen Shot 2021-02-11 at 15 17 11

If you add to the mix that you need to run on every brand new repository:

npm install

It adds another 1GB of data that needs to be downloaded as reported in #26993.

What is your proposed solution?

It makes me think that maybe gh-pages branch is one of the reasons why the size of the repository has grown so much. We replace the content of gh-pages with the new build of Storybook on every commit to the main branch.

I don't know how this sort of issues are usually solved in git-based repositories, but the comment from WordPress Slack (link requires registration at https://make.wordpress.org/chat/) authored by @ocean90 should be a good start:

https://wordpress.slack.com/archives/C5UNMSU4R/p1609864617204200?thread_ts=1609770083.149700&cid=C5UNMSU4R

Yes by creating a new orphan branch from gh-pages. You have to add the files there and the gh-pages branch needs to be deleted. Then rename the new branch to gh-pages which finally gets force-pushed.
This post documents the steps

The link included:
https://medium.com/@sangeethkumar.tvm.kpm/cleaning-up-a-git-repo-for-reducing-the-repository-size-d11fa496ba48

This article contains some techniques that could help with gh-pages where we don't care about history at all. There are also several interesting references to other similar articles that try to address similar issues.

@gziolo gziolo added [Type] Project Management Meta-issues related to project management of Gutenberg [Type] Enhancement A suggestion for improvement. labels Feb 15, 2021
@iandunn
Copy link
Member

iandunn commented Mar 4, 2021

This command lists files that are larger than 1MB (requires brew install coreutils on OS X):

git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sed -n 's/^blob //p' | awk '$2 >= 2^20' | sort --numeric-sort --key=2 | gcut -c 1-12,41- | $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

Results are at https://pastebin.com/hQaYcHwE

Those includes files in HEAD, though. I wasn't able to filter them out, but it should be possible.

I suspect there's a long tail of files < 1MB that could also be removed for a significant boost.

Removing files from the history would break the hashes, but might be worth it in this case.

@gziolo
Copy link
Member Author

gziolo commented Mar 5, 2021

@iandunn, that's very helpful. Thank you for doing a more in-depth investigation.

Results are at https://pastebin.com/hQaYcHwE

@hypest, I see a lot of mobile-related files with the highest impact that don't look like source code. Would it be possible to remove some of them from the git history so we could make the Gutenberg repository faster to download?

59MiB diffcheck.txt

This one alone looks like a quick win if we erase the full history.

@hypest
Copy link
Contributor

hypest commented Mar 8, 2021

59MiB diffcheck.txt

This one alone looks like a quick win if we erase the full history.

Aha, I'm not familiar with what that file is but I'm sure there are savings to be made along those lines. @ceyhun , you think you can take a look when you get some chance, possibly after HACK week of March 2021? Thanks!

@iandunn
Copy link
Member

iandunn commented Mar 8, 2021

git clone https://github.com/WordPress/gutenberg.git --depth 10 might also be interesting. That clones the repo with just the last 10 commits. It only took ~5 seconds to download on my machine.

I'm guessing someone could use that, then send a PR, and it wouldn't have any problems. I haven't tested that, though. The downside is that people would have to intentionally do it, since it's not the default behavior. Scripts and docs could be updated, though.

If we do remove stuff from history, I'd recommend getting rid of everything all at once if possible. Changing the history will break lots of stuff, so we'd probably only want to do it once every few years, at most.

@windhamdavid
Copy link

I tested the --depth. Faster but doen't do much for size. 2.2GB is 2much ✂️ Might consider something like SVN subtree using a filter-branch -f --prune-empty --subdirectory-filter and split off the docs(gh-pages) and tests. cc: #26993 (comment)

@ceyhun
Copy link
Member

ceyhun commented Mar 18, 2021

59MiB diffcheck.txt

This one alone looks like a quick win if we erase the full history.

Aha, I'm not familiar with what that file is but I'm sure there are savings to be made along those lines. @ceyhun , you think you can take a look when you get some chance, possibly after HACK week of March 2021? Thanks!

@gziolo I found the commits for this file using git log --all --full-history -- diffcheck.txt command. This seems to be the latest one deleting it: 397e645. It seems like an out of a git diff. I think we can safely erase this.

@hypest I also checked the pastebin and saw a lot of mobile bundles and binaries. I think it's also fine to erase these ones as we're not using them and they can also be regenerated if needed:

bundle/android/App.js
bundle/android/App.js.map
bundle/ios/App.js
bundle/ios/App.js.map
test/native/gutenberg-mobile-demo-app.apk
ios/Gutenberg.app.zip

@gziolo
Copy link
Member Author

gziolo commented Mar 19, 2021

@ceyhun, this is a great finding. How can we perform this cleanup? Is it something you could do yourself?

@hypest
Copy link
Contributor

hypest commented Mar 19, 2021

@hypest I also checked the pastebin and saw a lot of mobile bundles and binaries. I think it's also fine to erase these ones as we're not using them and they can also be regenerated if needed

+1 for removing the Android ones, the APK and the app.zip, no questions asked.

For the iOS ones though, it will probably make trying out/debugging older WPiOS versions harder, right? Recreating them will probably be cumbersome. I actually don't like that we had to commit the JS bundles at all so, if you feel confident about iOS debugging without having the bundles readymade @ceyhun then I'm +1.

@ceyhun
Copy link
Member

ceyhun commented Mar 23, 2021

@ceyhun, this is a great finding. How can we perform this cleanup? Is it something you could do yourself?

@gziolo I'm not really sure what git magic is needed for this to happen 🪄 I also do not consider myself a git magician 😃 So any help would be appreciated.

For the iOS ones though, it will probably make trying out/debugging older WPiOS versions harder, right? Recreating them will probably be cumbersome. I actually don't like that we had to commit the JS bundles at all so, if you feel confident about iOS debugging without having the bundles readymade @ceyhun then I'm +1.

@hypest I think WPiOS was always using the bundles on gutenberg-mobile repo and seems like that one goes back as far as 2018, so maybe that's enough?

@jonathanbossenger
Copy link
Contributor

@gziolo @ceyhun based on this SO answer, git filter-branch should allow you to completely remove those files from the repo history https://stackoverflow.com/questions/43762338/how-to-remove-file-from-git-history#43762489

@gziolo
Copy link
Member Author

gziolo commented Mar 24, 2021

We discussed options on WordPress Slack in the #meta channel (link requires registration at https://make.wordpress.org/chat/):
https://wordpress.slack.com/archives/C02QB8GMM/p1616519854024400

@dd32 shared the following:

Playing with git rev-list --disk-usage shows that the gh-pages branch is 10x the next:

26.67MB refs/tags/@wordpress/[email protected]
26.67MB refs/tags/@wordpress/[email protected]
112.10MB refs/remotes/origin/import-gutenberg-mobile
117.89MB refs/remotes/origin/rnmobile/import-mobile-lint
119.86MB refs/remotes/origin/rnmobile/import-mobile-fix-ci
158.28MB refs/remotes/origin/rnmobile/try-fix-android-build
169.98MB refs/remotes/origin/feat/import-gutenberg-mobile-no-squash-E2E-TESTS-fix-ios-ci
182.22MB refs/remotes/origin/rnmobile/experiment-monorepo-new-setup-update-node
213.09MB refs/tags/rnmobile/monorepo-commit-history
2053.26MB refs/remotes/origin/gh-pages

I started with the first step and rewrote the history of gh-pages branch:
https://github.com/WordPress/gutenberg/commits/gh-pages

It looks like it mostly generates new bundle files for the Storybook instance available at https://wordpress.github.io/gutenberg/.

Can you check if we can remove the mobile branches listed completely?

@hypest
Copy link
Contributor

hypest commented Mar 24, 2021

@hypest I think WPiOS was always using the bundles on gutenberg-mobile repo and seems like that one goes back as far as 2018, so maybe that's enough?

Oh, right @ceyhun. I don't think WPiOS was ever using the bundle directly from Gutenberg's repo, only from gutenberg-mobile. I see what you mean now so yeah, no need for the native mobile (RN) bundle inside Gutenberg's repo 👍.

@ceyhun
Copy link
Member

ceyhun commented Mar 24, 2021

Can you check if we can remove the mobile branches listed completely?

@gziolo I went ahead and deleted the following mobile branches:

112.10MB refs/remotes/origin/import-gutenberg-mobile
117.89MB refs/remotes/origin/rnmobile/import-mobile-lint
119.86MB refs/remotes/origin/rnmobile/import-mobile-fix-ci
158.28MB refs/remotes/origin/rnmobile/try-fix-android-build
169.98MB refs/remotes/origin/feat/import-gutenberg-mobile-no-squash-E2E-TESTS-fix-ios-ci
182.22MB refs/remotes/origin/rnmobile/experiment-monorepo-new-setup-update-node

But I'm not sure about deleting this tag: rnmobile/monorepo-commit-history. We kept it so we can view gutenberg-mobile git history from before monorepo merge. I suppose it's also the tag/branch where most of the large bundle/android/App.js and bundle/ios/App.js files live in. It would be nice if we can rewrite history in that branch to not include the bundle files, but I'm not sure how it can be done and I can imagine that it could be a complex task.

Also on second thought, I think we can use gutenberg-mobile to view git history before monorepo as well. It would be harder to search and find a specific file from gutenberg repo in gutenberg-mobile back again just for its history, but I think it's possible. I also don't remember using rnmobile/monorepo-commit-history tag before to check the history of a file, and I think after monorepo I modified many files from RN Bridge, RN Aztec code and E2E tests which were in gutenberg-mobile before monorepo. Any thoughts @hypest?

@hypest
Copy link
Contributor

hypest commented Mar 26, 2021

Also on second thought, I think we can use gutenberg-mobile to view git history before monorepo as well. It would be harder to search and find a specific file from gutenberg repo in gutenberg-mobile back again just for its history, but I think it's possible. I also don't remember using rnmobile/monorepo-commit-history tag before to check the history of a file, and I think after monorepo I modified many files from RN Bridge, RN Aztec code and E2E tests which were in gutenberg-mobile before monorepo. Any thoughts @hypest?

Good point Ceyhun. The commit history is indeed available in gutenberg-mobile's repo, but I think it's quite hard to connect the dots as that repo has also moved on. All in all, I'd prefer if we keep the rnmobile/monorepo-commit-history for some more time. Anecdotally, I did use that branch a couple of weeks ago while trying to understand the code history of how selection messages get triggered on the Aztec wrapper on Android (to fix an important regression).

@gziolo
Copy link
Member Author

gziolo commented Mar 26, 2021

It looks like the changes applied so far had an impressive impact on the repository size:
Screen Shot 2021-03-26 at 12 34 25

Do you think we can further decrease the size or is it fine to close this issue for now?

@ceyhun
Copy link
Member

ceyhun commented Mar 29, 2021

Do you think we can further decrease the size or is it fine to close this issue for now?

We're thinking of keeping a fork of gutenberg just for the rnmobile/monorepo-commit-history tag and maybe we can delete it here then. It would be worth keeping this open a little while longer while we figure this out.

Thanks @mchowning for coming up with that idea!

@gziolo
Copy link
Member Author

gziolo commented Mar 30, 2021

We can wait a few more weeks, no worries. The smaller size of the download necessary to clone the repository is worth it 😄

Thank you for all the help so far 🙇🏻

@ceyhun
Copy link
Member

ceyhun commented Apr 16, 2021

@gziolo just created a fork wordpress-mobile/gutenberg-rnmobile-monorepo-commit-history to keep the history and deleted the rnmobile/monorepo-commit-history tag. Seems like this lowered the size even more:

gutenberg-clone

@gziolo
Copy link
Member Author

gziolo commented Apr 17, 2021

This is great. The only remaining task would be to improve the GitHub workflow that uses gh-pages to update Storybook to always recreate the branch from scratch to ignore its history.

@gziolo
Copy link
Member Author

gziolo commented Apr 22, 2021

@ockham, how much work it would be to run on gh-pages branch in GitHub workflow something like:

git checkout — orphan latest_branch
git add -A
git commit -am “Initial commit message” #Committing the changes
git branch -D master #Deleting master branch
git branch -m master #renaming branch as master
git push -f origin master #pushes to master branch
git gc — aggressive — prune=all # remove the old files

I don't remember what I used exactly before, but it was similar and it remove all git history for gh-pages and ideally we would run it every time we update Storybook. The alternative would be to use another repository.

@ockham
Copy link
Contributor

ockham commented Apr 22, 2021

@ockham, how much work it would be to run on gh-pages branch in GitHub workflow something like:

git checkout — orphan latest_branch
git add -A
git commit -am “Initial commit message” #Committing the changes
git branch -D master #Deleting master branch
git branch -m master #renaming branch as master
git push -f origin master #pushes to master branch
git gc — aggressive — prune=all # remove the old files

I don't remember what I used exactly before, but it was similar and it remove all git history for gh-pages and ideally we would run it every time we update Storybook.

Looks like it shouldn't be too much work; basically, any workflow that uses @actions/checkout automatically gets a GH token that enables it to perform git operations. Would we want to add that to the .github/workflows/storybook-pages.yml workflow?

For me, the bigger question seems to be if we really want to routinely rewrite the history of our gh-pages branch 🤔 Which brings us to your alternative suggestion...

The alternative would be to use another repository.

Wouldn't that maybe make more sense? If we've identified that:

  • the gh-pages branch is too big
  • it is only ever generated by a script

... why not keep things nicely separated, create a dedicated wordpress.github.io repository, and have the workflow deploy to that?

@gziolo
Copy link
Member Author

gziolo commented Apr 22, 2021

Would we want to add that to the .github/workflows/storybook-pages.yml workflow?

Yes.

... why not keep things nicely separated, create a dedicated wordpress.github.io repository, and have the workflow deploy to that?

It was discussed as well. Whatever works best here 😄

@ockham
Copy link
Contributor

ockham commented Apr 22, 2021

... why not keep things nicely separated, create a dedicated wordpress.github.io repository, and have the workflow deploy to that?

It was discussed as well. Whatever works best here 😄

I'm leaning towards the latter, TBH. Seems fairly straight-forward. The main questions are probably if creating a new wordpress.github.io repo (at org level) will collide with the existing wordpress.github.io/gutenberg/ pages (at GB repo level, created from the gh-pages branch); and if we'll be able to retain the /gutenberg path somehow 🙂

@ockham
Copy link
Contributor

ockham commented Apr 22, 2021

... why not keep things nicely separated, create a dedicated wordpress.github.io repository, and have the workflow deploy to that?

It was discussed as well. Whatever works best here 😄

I'm leaning towards the latter, TBH. Seems fairly straight-forward. The main questions are probably if creating a new wordpress.github.io repo (at org level) will collide with the existing wordpress.github.io/gutenberg/ pages (at GB repo level, created from the gh-pages branch); and if we'll be able to retain the /gutenberg path somehow 🙂

Looks like we might even be able to continue using the same GH action we're using now: It supports both deploying to a different repo, and to a subdir (not entirely sure if those can be combined). For the different repo, we need a personal access token -- rather than GITHUB_TOKEN -- but we can simply use one for the @gutenbergplugin user account.


Oh, I just noticed that if we wanna go ahead with pruning the history of the gh-pages branch instead, the GH action might support that as well OOTB.

@gziolo
Copy link
Member Author

gziolo commented Apr 23, 2021

I see https://github.com/peaceiris/actions-gh-pages#%EF%B8%8F-force-orphan-force_orphan. This is exactly what we want and it makes it so much easier to approach this way. I will merge directly to trunk and see if it works. Great discovery @ockham!

@gziolo
Copy link
Member Author

gziolo commented Apr 23, 2021

It worked with d4bef28:

Screen Shot 2021-04-23 at 08 33 28

We are now at 200-ish MB, which is 10% of the initial size:

Screen Shot 2021-04-23 at 08 59 42

Many thanks to everyone involved.

@gziolo gziolo closed this as completed Apr 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Type] Enhancement A suggestion for improvement. [Type] Project Management Meta-issues related to project management of Gutenberg
Projects
None yet
Development

No branches or pull requests

7 participants