* mirror: Add job timeout to mirror configurations (PROJQUAY-7249)
Previous global job timeout of 5 minutes was inadequate for big images. The timeout should now be configurable in much the same way as sync is. Minimum job length is 300 seconds/5 minutes.
The PR is still work in progress.
* Fix init db, remove reference to user data in logs
* Fix tests, change repo mirror configuration
* Fix tests, make mirroring cancellable through UI
* Add cancel mirror test, change HTML document to reflect mirror timeout
* Flake8 doesn't like when '==' is used with 'None'
* Fix mirror registry tests
* Add new cypress data to fix cypress tests
* Added ability to define upload chunk size to RADOS driver, small changes to repo mirror HTML page
* Fix database migration to follow HEAD
* Upload new database data for Cypress tests
* Make skopeo_timeout_interval mandatory on API calls
---------
Co-authored-by: Ivan Bazulic <ibazulic@redhat.com>
* storage: Enable multipart upload for Google Cloud Storage (PROJQUAY-6862)
This PR removes the `_stream_write_internal` function override that caused excessive memory consumption and defaults to the old one which chunks uploads. Server assembly is still not suppored by GCS, so we have to assemble everything locally. However, GCS does support the copy function, so a reupload is not needed.
~~~
REPOSITORY TAG IMAGE ID CREATED SIZE
registry.fedoraproject.org/fedora latest ecd9f7ee77f4 2 days ago 165 MB
quay.skynet/ibazulic/big-mirror-test size138gb 8e6ba9ff13c0 3 days ago 148 GB
quay.skynet/quay-mirror/big-mirror-test size138gb 8e6ba9ff13c0 3 days ago 148 GB
quay.skynet/ibazulic/mfs-image-test latest ab14f2230dd9 7 days ago 5.96 GB
quay.skynet/ibazulic/azure-storage-big-file-test latest ede194b926e0 7 days ago 16.1 GB
quay.skynet/ibazulic/minio/minio latest 76ed5b96833a 6 weeks ago 532 B
Getting image source signatures
Copying blob 9d9c3d76c421 done |
Copying blob fce7cf3b093c skipped: already exists
Copying config 8e6ba9ff13 done |
Writing manifest to image destination
~~~
For uploading extremely big layers, 5 MiB as the default chunk size is not enough. The PR also enables support for user-defined chunk sizes via `minimum_chunk_size_mb` and `maximum_chunk_size_mb` which default to 5 Mib and 100 MiB respectively.
* Remove maximum_chunk_size_mb as it's not needed
When deploying Quay in a Secure AWS environment, we can't use IAM Access Keys or Secrets since these credentials are often blocked for multiple reasons (credentials are long-lived, can be shared / stolen, etc.). So the preferred deployment method is to use an alternative method, like the Web Identity Token files that are automatically created in a Kubernetes cluster that has a federation link with IAM using the OIDC provider federation.
The current code of Quay force the use of an IAM account that is then used to assume another role that has S3 access to store the image files. The current pull request removes the need to use that IAM account and allows to directly assume the correct role using Web Identity Tokens while retaining compatibility with the old method of using IAM credentials.
The code relies on the automatic detection of the correct configurations using environment variables where possible. The code has been tested on an OpenShift cluster deployed using manual mode with AWS STS.
fixing the error seen with signature_v2/v4 patch #3041 when using STSS3Storage. The STSS3Storage Class is using the connect_kwargs dictionary to initialze the S3Storage Class where all other use that dict for the connection parameters which is misleading and I did not catch that when submitting the patch for the signature v2/v4
Pre-signed URL's are only on the S3Storage Class configured vor s3v4 (hard coded). This adds the attribute signature_version to all StorageClass definitions to be configured individually. The behavior when not set defaults back to v2 for all StorageClasses except S3Storage which defaults to s3v4.
* storage: use managed copy for single chunk uploads (PROJQUAY-7328)
We do a multi-part copy from the staging location to the
final blob location in 5GB chunks sequentially. For large
layers this is extremely slow. Use managed `copy` to
move the blob to the final location faster
This adds the optimization in CloudFlare where if a request is from the primary region then instead of redirecting to the CDN, we return the S3 URL to save egress cost
* storage: Increase GCP timeout (PROJQUAY-6819)
Currently, Boto GCP timeout is set to 60 seconds which causes a problem in pushing big layers. This will increase boto timeout to 10 minutes to be more aligned with our other S3 engines. Result:
~~~
root@cyberdyne:~# time { docker push quay.skynet/ibazulic/gcp-test; }
Using default tag: latest
The push refers to repository [quay.skynet/ibazulic/gcp-test]
4335316598de: Pushed
d101c9453715: Pushed
latest: digest: sha256:c6ffbd16c2ef43496ff13c130e31be84ceccdb5408e4f0d3b0f06ae94d378ff9 size: 744
real 7m9.881s
user 0m0.204s
sys 0m0.077s
root@cyberdyne:~#
~~~
* Fix isort sorting
* Made `boto_timeout` configurable, defaults to 60
* Made `boto_timeout` configurable, fix isort issues
* Remove reference to `self.boto_timeout`
* storage: Fix big layer uploads for Ceph/RADOS driver (PROJQUAY-6586)
Current uploads of large images usually fail on Ceph/RADOS compatible implementations (including Noobaa) because during the last assembly, copy is done all at once. For large layers, this takes a long while and Boto times out. With this patch, we limit the size of the used chunk to 32 MB so the final copy is done in parts of up to 32 MB each. The size can be overridden by specifying the parameter `maximum_chunk_size_mb` in the driver settings. For backwards compatibility, an additional parameter was added: if `server_side_assembly: true` then we force server side assembly and the final blob push in chunks, if `server_side_assembly: false` we fall back to default client side assembly (we increase the boto timeout in this case to still support large layer upload):
~~~
DISTRIBUTED_STORAGE_CONFIG:
default:
- RadosGWStorage
- ...
maximum_chunk_size_mb: 100
server_side_assembly: true
~~~
* Fix formatting
* Added backward compatiblity switch and increased boto timeout
* Changed name of variable in config
* Small fixes to if statements
* storage: make cloudfront_distribution_org_overrides optional (PROJQUAY-5788)
This is causing issues with config editor where it
configure CloudFront provider because of the required
override param
When completing a chunked upload, if the chunk list is empty do not attempt to assemble anything.
Using oras to copy an artifact from an outside registry to quay results in a 5XX error. This is because at some point the upload chunk list is empty and attempting to complete the chunked upload causes an exception. Not trying to write to storage if there are no chunks allows the copy operation to successfully complete.
* chore: Add server side assembly of chunked metadata for RADOSGW driver (PROJQUAY-0000)
RadosGW did not support multipart copying from keys so we needed to do a local join and reupload of the whole blob. This creates issues for blobs which are fairly big.
Since the issue was fixed in 2015. on the Rados side, we no longer need this part of legacy code.
See [here](https://github.com/ceph/ceph/pull/5139) for more information.
* Fixed linting with black
This optimization ensures that we return the direct S3 URL for
CloudFront storage only for requests from the same region. This
ensures we don't get charged for cross-region traffic to S3
Currently the CI breaks due to a dependency of black, `click`, breaking with it's latest release with `ImportError: cannot import name '_unicodefun' from 'click'`. Since black does not pin it's version of click it pulls in the latest version containing the breaking change and fails the CI check. This updates black with the patch. [See the original issue here.](https://github.com/psf/black/issues/2964) The rest of the changes are format updates introduced with the latest version of black.
Boto3 behaves unexpectedly when the resource client is not set to use
the correct region. Boto3 can't seem to correctly set the
X-Amz-Credential header when generating presigned urls if the region
name is not explicitly set, and will always fall back to us-east-1.
To reproduce this:
- Create a bucket in a different region from us-east-1 (e.g
eu-north-1)
- Create a boto3 client/resource without specifying the region
- Generate a presigned url
This seems to be a DNS issue with AWS that only happens shortly after
a bucket has been created, and resolves itself eventually.
Ref:
- https://github.com/boto/boto3/issues/2989
- https://stackoverflow.com/questions/56517156/s3-presigned-url-works-90-minutes-after-bucket-creation
To workaround this, one can specify the bucket endpoint, either
explicitly via endpoint_url, or by setting s3_region, which will be
used to generate the bucket's virtual address.
Currently blobs leftover in the uploads directory during cancelled uploads do not get cleaned up since they are no longer tracked. This change cleans up the uploads storage directory directly.
The way it was formatted (with the top-level parentheses), the extra
comma was causing the connection parameters to be passed as a 1-tuple
instead of a string.
Explicitly call abort on a mpu if no bytes were written.
Noobaa and Rados will not clean the artifacts, resulting in empty
files being stuck in an "uploading" state.
Migrate from using boto2 to boto3. Changes include:
- Removes explicit bucket addressing style: Boto3 will initially try virtual-style addressing first then fallback to path-style addressing (https://github.com/boto/boto3/blob/develop/docs/source/guide/configuration.rst)
- GCS workarounds to use boto3:
- Handles CORS config
- Update signed url access key parameter name
- Uses ListBucket V1 API
- On client-side chunks join, copy using non-multipart api: Use copy_from instead of copy when joining chunks client-side. This is because copy assumes multipart upload should be used which GCS and Rados are not compatible with (S3's version. They have their own parallel upload api)
- Update RDS healthcheck to use boto3
* Convert all Python2 to Python3 syntax.
* Removes oauth2lib dependency
* Replace mockredis with fakeredis
* byte/str conversions
* Removes nonexisting __nonzero__ in Python3
* Python3 Dockerfile and related
* [PROJQUAY-98] Replace resumablehashlib with rehash
* PROJQUAY-123 - replace gpgme with python3-gpg
* [PROJQUAY-135] Fix unhashable class error
* Update external dependencies for Python 3
- Move github.com/app-registry/appr to github.com/quay/appr
- github.com/coderanger/supervisor-stdout
- github.com/DevTable/container-cloud-config
- Update to latest mockldap with changes applied from coreos/mockldap
- Update dependencies in requirements.txt and requirements-dev.txt
* Default FLOAT_REPR function to str in json encoder and removes keyword assignment
True, False, and str were not keywords in Python2...
* [PROJQUAY-165] Replace package `bencode` with `bencode.py`
- Bencode is not compatible with Python 3.x and is no longer
maintained. Bencode.py appears to be a drop-in replacement/fork
that is compatible with Python 3.
* Make sure monkey.patch is called before anything else (
* Removes anunidecode dependency and replaces it with text_unidecode
* Base64 encode/decode pickle dumps/loads when storing value in DB
Base64 encodes/decodes the serialized values when storing them in the
DB. Also make sure to return a Python3 string instead of a Bytes when
coercing for db, otherwise, Postgres' TEXT field will convert it into
a hex representation when storing the value.
* Implement __hash__ on Digest class
In Python 3, if a class defines __eq__() but not __hash__(), its
instances will not be usable as items in hashable collections (e.g sets).
* Remove basestring check
* Fix expected message in credentials tests
* Fix usage of Cryptography.Fernet for Python3 (#219)
- Specifically, this addresses the issue where Byte<->String
conversions weren't being applied correctly.
* Fix utils
- tar+stream layer format utils
- filelike util
* Fix storage tests
* Fix endpoint tests
* Fix workers tests
* Fix docker's empty layer bytes
* Fix registry tests
* Appr
* Enable CI for Python 3.6
* Skip buildman tests
Skip buildman tests while it's being rewritten to allow ci to pass.
* Install swig for CI
* Update expected exception type in redis validation test
* Fix gpg signing calls
Fix gpg calls for updated gpg wrapper, and add signing tests.
* Convert / to // for Python3 integer division
* WIP: Update buildman to use asyncio instead of trollius.
This dependency is considered deprecated/abandoned and was only
used as an implementation/backport of asyncio on Python 2.x
This is a work in progress, and is included in the PR just to get the
rest of the tests passing. The builder is actually being rewritten.
* Target Python 3.8
* Removes unused files
- Removes unused files that were added accidentally while rebasing
- Small fixes/cleanup
- TODO tasks comments
* Add TODO to verify rehash backward compat with resumablehashlib
* Revert "[PROJQUAY-135] Fix unhashable class error" and implements __hash__ instead.
This reverts commit 735e38e3c1d072bf50ea864bc7e119a55d3a8976.
Instead, defines __hash__ for encryped fields class, using the parent
field's implementation.
* Remove some unused files ad imports
Co-authored-by: Kenny Lee Sin Cheong <kenny.lee@redhat.com>
Co-authored-by: Tom McKay <thomasmckay@redhat.com>
This change replaces the metricqueue library with a native Prometheus
client implementation with the intention to aggregated results with the
Prometheus PushGateway.
This change also adds instrumentation for greenlet context switches.