Skip to content

Conversation

@lourencovales
Copy link
Contributor

This adds a resource to manage DNS on Cloudflare. Making use of the official Golang API, we add functionality to create, update, edit, delete records. A function to purge unmanaged records is also available. Since the REST API does not provide us with an event stream, we make use of polling to check for changes on Cloudflare.

I've tested each functionality with an mcl of this type, changing the values to test the different functionalities:

cloudflare:dns "public-website" {
    zone => "-domain name-",
    record_name => "www",
    type => "A",
    content => "192.0.4.1",
    state => "absent",
    purge => false,  
    ttl => 300,
    apitoken => "-token-",

    Meta:poll => 70, 
}

And everything seems to work. I've also tested changing two resources at once, and everything worked as well. Tests are also passing locally, so I'm submitting this for review.

@purpleidea
Copy link
Owner

Thanks for the patch. A brief design review with some broad questions:

  • How should the user use purge? Suppose you wrote a small snippet of mcl to do so. Does that look logical, sensible, etc?

  • Is there a way I could build a struct (basically adding a resource as mcl) where the combination of params passed is illogical? For example, if one specified doesn't get used for example.

  • Did you look into my comment about the DNS resolving as part of the Check phase to reduce on API polling? What's the mechanism/algorithm that you suggest to determine when to check via API and when to check via regular DNS?

  • How would you test this. Even presuming that writing extensive tests is out of scope for this PR, is there a way to write some tests? If we did write extensive tests later on, would this design be incompatible with that?

Thanks!

@lourencovales
Copy link
Contributor Author

How should the user use purge? Suppose you wrote a small snippet of mcl to do so. Does that look logical, sensible, etc?

The user needs to set the purge parameter to true, in any of the resources that are managing the given zone they want to purge. This would do it:

cloudflare:dns "public-website" {
    zone => "-domain name-",
    record_name => "www",
    type => "A",
    content => "192.0.4.1",
    state => "absent",
    purge => true,  
    ttl => 300,
    apitoken => "-token-",

    Meta:poll => 70, 
}

This will, in practice, wipe all the records, given that the only record that it's managed by the resource has state absent and there's no other records. If there were other records in the same zone with state exists, they'd be kept.

Is there a way I could build a struct (basically adding a resource as mcl) where the combination of params passed is illogical? For example, if one specified doesn't get used for example.

I don't understand this question, sorry. I've tested this with invalid params, and the Validate() checks are working. I don't know what you mean by "doesn't get used".

Did you look into my comment about the DNS resolving as part of the Check phase to reduce on API polling? What's the mechanism/algorithm that you suggest to determine when to check via API and when to check via regular DNS?

I saw it, but the rate limiting that CF imposes is pretty lax: 1200 requests over 5 minutes (when why I wrote ~4req/s in a comment of my code, which is just an average). I've also made the polling rate check to always be bigger than 60s. So I don't think this is something that 's worth doing, at least not unless someone is managing some hundreds of records.

How would you test this. Even presuming that writing extensive tests is out of scope for this PR, is there a way to write some tests? If we did write extensive tests later on, would this design be incompatible with that?

I tested this by running .mcl code like the one I pasted in the OP, and verifying that the changes were being applied correctly on CF's dashboard. It's of course possible to write tests, but we always need to provide an API key for auth, so that sort of precludes any automated testing that could be shared afaik.

Thanks!

My pleasure.

@purpleidea
Copy link
Owner

This will, in practice, wipe all the records

For example, let's say you only wanted to remove all unmanaged records.

unless someone is managing some hundreds of records

Well, I have over a 100 DNS records in my personal cloudflare account, and it's just small personal stuff mostly. But 100 shouldn't be seen as outrageous. I think it's elegant to have this nice fallback to regular DNS, that is if we have an algorithm to do that.

@purpleidea
Copy link
Owner

we always need to provide an API key for auth

Have you heard of "mock"-ing?

@lourencovales
Copy link
Contributor Author

For example, let's say you only wanted to remove all unmanaged records.

But this is the already the behavior of the purge function. Purge only removes unmanaged resources.

Well, I have over a 100 DNS records in my personal cloudflare account, and it's just small personal stuff mostly. But 100 shouldn't be seen as outrageous. I think it's elegant to have this nice fallback to regular DNS, that is if we have an algorithm to do that.

I was digging a bit more into this, and I realized something: if we're thinking in the case of managing hundreds of records, our limit might not be the 1200req/5min that CF has, but rather the 200req/s per IP (cf. this). Given that each request spends 2 API tokens at the moment per resource managed, managing over 100 records might already trip some limits.

Now, this might not be a problem at first, given that the GET requests will get a 429 and retry. This behavior can be configured at the library level when we're registering a new client (WithRequestTimeou and WithMaxRetries).

While I was reading about it, I also found about WithMiddleware , and this might be used in conjunction with the stdlib rate package to implement a simple rate limiting system.

I think that with these three options (setting those two options, and implementing the middleware) would allow us to be able to manage thousands of records, and always be inside the limits that CF imposes.

Have you heard of "mock"-ing?

Can't say I have!

@lourencovales
Copy link
Contributor Author

I pushed some small changes, namely I added the log functionality you suggested.

I also had a look at how to mock tests, and I think I understand now: basically we're feeding the test some fake data (as if it would come from Cloudflare). But as far as I can tell, this would require some light refactoring of the current implementation, as I'd need to implement an interface which I could then use for mocked tests. Does this make sense?

Also, how do you want to go about the rate-limiting part?

@lourencovales
Copy link
Contributor Author

lourencovales commented Oct 31, 2025

After what we talked about over Matrix, I've made the requested changes - this now checks against DNS records using the net package first, instead of directly going to the Cloudflare API to check the current state.

Still, there's a kink here that I'm not sure how to deal with: there's three fields that we can't query with the DNS request - TTL, Proxied, and Comment, as these are internal to Cloudflare. They will always require that we fetch the info from the Cloudflare API, and if any of these fields are set, there's no savings in terms of API calls. Still, we need to check them otherwise we might introduce drift between the desired state and what's on Cloudflare, and that's why I opted for the current implementation of always checking. Let me know what you think.

@purpleidea
Copy link
Owner

Sweet!

They will always require that we fetch the info from the Cloudflare API

Alright, now we're talking. In matrix I mentioned:

I think you will learn some fun things in (A) the DNS part...

Feels like the prophecy came true =D Was it fun to write the resolver code? Is there a chance that different casing or similar looking data would say "mismatch" even though the data is meant to be the same? (Cloudflare API might preserve case, for example, does resolver code and intermediate DNS servers do the same? -- I don't actually know the answer offhand, I don't remember enough DNS spec, but an easy question to ask.)

So what should we do about the missing values? Here's one possible approach:

CheckApply (the first time it runs) checks via API to CF. After it succeeds once with "values match, aka cloudflareCheckApply() -> true" then subsequent checks fall back to regular DNS.

Remember, mgmt's default job is not meant to be a "guardian against a sysadmin messing with values", but in this case, we can add a flag (pick a better name?) resolver => false or something to disable the classic DNS resolution at all times.

What do you think?

Some review comments:

  1. If I did Priority => whatever, and state => absent, would Validate() fail? Should it? Are there other examples like that?

  2. I'm still not convinced about purge the way it is. For example, imagine the code:

# when this node is added, it doesn't know there's a "purge" elsewhere.
cloudflare:dns "foo" {
    record_name => "example.com",
    type => "AAAA",
    content => "...",
}

cloudflare:dns "bar" {
    record_name => "example.com",
    type => "AAAA",
    content => "...",
    purge => true,
}

vs

cloudflare:dns "foo" {
    record_name => "example.com",
    type => "AAAA",
    content => "...",
}

cloudflare:dns "bar" {
    record_name => "example.com",
    type => "AAAA",
    content => "...",
}

# in this scenario if anyone anywhere sets something different than this exactly, then it's a runtime error...
cloudflare:dns:purge "example.com" {
    purge => true,
}

So if one person wanted to guarantee never any purges, all you have to do is:

# in this scenario if anyone anywhere sets something different than this exactly, then it's a runtime error...
cloudflare:dns:purge "example.com" {
    purge => false,
}

And you'd never be able to cause a purge.

  1. Shouldn't the resource Name mean something? Why or why not?

  2. PS: Please add an example in examples/lang/

LMK if that all makes sense! Thank you for your work.

This adds a resource to manage DNS on Cloudflare. Making use of the
official Golang API, we add functionality to create, update, edit,
delete records. A function to purge unmanaged records is also available.
Since the REST API does not provide us with an event stream, we make use
of polling to check for changes on Cloudflare.
@lourencovales
Copy link
Contributor Author

Alright, I pushed an improved version of the code. Changes of notes:

  • I changed the name of the resource to cloudflare_dnsmanager to make it more descriptive.
  • I added the examples as requested
  • I changed the logic of the resolution to better approximate what we talked about
  • I also ended up adding a caching layer that did end up reducing API calls by quite a fair bit (more than 50%, but it depends on exactly what we're doing), and to solve the DNS propagation issues I had mentioned earlier
  • I took care of all the corner cases I could find, especially surrounding the TXT records as Cloudflare expects that they are PUT/PATCH with quotes, but it returns them without quotes - this was a big headache to get properly

Was it fun to write the resolver code?

Once I understood how to write one of the lookup functions, everything fell into place, as all the other ones were pretty similar. I wouldn't say "fun" is the right word, but it was a good exercise for sure!

Is there a chance that different casing or similar looking data would say "mismatch" even though the data is meant to be the same?

Yes, and I already accounted for all the corner cases I could find.

If I did Priority => whatever, and state => absent, would Validate() fail? Should it? Are there other examples like that?

It wouldn't fail. It shouldn't fail. I don't follow the question.

And you'd never be able to cause a purge.

This is not what happens. I just tried it in my most recent commit - as long as there's a single purge => true in the mcl, all unmanaged records will be deleted from Cloudflare. This has no bearing on managed resources, as those are handled by the state parameter, which is required to be on all managed resources.

@purpleidea
Copy link
Owner

Looking good, I think just the purge discussion needs to be handled how I discussed unless I missed a reason why it's better the original way?

@lourencovales
Copy link
Contributor Author

Looking good, I think just the purge discussion needs to be handled how I discussed unless I missed a reason why it's better the original way?

Sorry, could you point me to where there's a spec of how you want the purge function to work?

@purpleidea
Copy link
Owner

Sorry, could you point me to where there's a spec of how you want the purge function to work?

purge param, not function, but:

#824 (comment)

@lourencovales
Copy link
Contributor Author

#824 (comment)

I answered this at the end of this comment. So it's not clear to me where you think it needs to change/improve.

@purpleidea purpleidea force-pushed the master branch 2 times, most recently from 3221a93 to 4ad2b9a Compare December 12, 2025 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants