This seems to be a recurring theme, so I thought it'd be worth looking into it for a bit. My own investigation followed from some weird that I'd discovered while setting up my blog, which makes the purpose of this post twofold: on one hand I'm recounting some of my own adventures in MP-WP-land; on the other I'm explaining how to approach the thing in a sane manner, assuming there's a problem to solve there.
In particular, the problem I stumbled upon sounded like this:
So, to explain the scenario in more detail: I'm inputting "ampersand-lt-semicolon b ampersand-gt-semicolon this is a quoted bold text ampersand-lt-semicolon slash b ampersand-gt-semicolon", saving, previewing, which gives same as your description, i.e. "left quote b right quote this is a quoted bold text left quote slash b right quote".
However, upon saving, the text in the box is also converted to "left quote b right quote this is a quoted bold text left quote slash b right quote", so if I save and preview again, this is going to yield "this is a quoted bold text" in bold.
This one's a pretty weird one, isn't it? Also, it hasn't got anything to do with previewing things, the whole thing can be reproduced on my setup using the following steps:
- create a new draft post; then
- input (minimally) the text "derpitude <" (sans quotes) in the content field; then finally
- press "save draft".
After the third step, the observed behaviour is that "derpitude <" is magically transformed into "derpitude <", while the expected behaviour is that no transformation should in fact occur. There's a few observations that we can proceed from, let's take them one by one.
First off, the scenario has two distinct parts, as per the workings of web browsers and the HTTP protocol. In the first part, the user adds some content and then presses a form submit button -- behind the scenes this translates in a HTTP request, which may be examined using e.g. the tools provided in your browser of choice, or by examining the page elements and trying to reproduce the same request using curl. The latter approach is more labour-intensive, so e.g. Firefox's "developer tools" thingie proves to be very useful here. Anyway, once the (in our case POST) request is sent to the server, the server processes it and sends some response.
Which brings us to the second part of the scenario, in which the server has finished processing, so it cooks up a response, sends it back to the client, which client processes the response and e.g. renders some shit on the screen. Things may get more complicated if you have JavaScript enabled in your browser, which actually begs the question of whether the issue can be reproduced on some setup where JS is disabled. Keep this in mind, lest you end up also debugging a bunch of JS along with PHP.
Now the question becomes: where in this "type text; send POST; process; make response; send response; render" pipeline does the transformation occur? We can easily discard the first two steps: for the first, we have JS disabled, so we're confident that no text manipulation occurs while we type stuff; while for the second, we can look at the request content in the browser as we press the button, and... well, as far as I've seen, it looked exactly as expected, i.e. "derpitude <".
The second observation that we can proceed from is that the post editor being part of Wordpress' admin interface, we expect that the processing and response cooking code can be found somewhere in the wp-admin directory of our MP-WP installation. I wouldn't know, to be honest: in fact, I'll readily confess I have very little experience with MP-WP, so I'm stuck digging. Digging where, though? We need to start somewhere, don't we?
Looking at our POST request, we notice that the destination is wp-admin/post.php. The code here is a sausage of conditional switches, so it'll probably take a while to figure out what goes on in there. One intuition coming out of nowhere is that the HTTP POST request sets an "action" variable to the value "editpost", and that there's also a variable named $action in this code -- yes, we're doing a lot of guesswork and quasi-arbitrary inspection; annoying, but what can I do. The easiest way to test that this code path is exercised is to simply add a
echo 'is this printed?'; exit();
statement to our code and rerun the scenario. Noticing that the code behaves as expected, we delete our print and we start reading the thing line by line. It's pretty short, so let's quote it here:
    $post_ID = (int) $_POST['post_ID'];
    check_admin_referer('update-post_' . $post_ID);
    $post_ID = edit_post();
    redirect_post($post_ID);
    exit();
    break;
So the post ID is taken from the client input; then the referer is checked; then the edit_post function is called. This is interesting, but this "edit post" is nowhere in sight, so we're going to go to the MP-WP installation directory and run1:
$ <span style="background-color:#d3d3d3" id="select">grep -rw edit_post | grep function
which outputs exactly one line, which looks exactly like the definition of our function, in wp-admin/includes/post.php. Anyway, looking at the definition of this edit_post, we notice that it does some processing, most of which is a bunch of unknown, then it calls wp_update_post, which is a wrapper over wp_insert_post, which does some sanitization and updates the database with the new content. Speaking of which, let's take a look at how the database looks:
mysql> select post_content from wp_posts where ID = 246;
+-----------------------------------------------------+
| post_content                                        |
+-----------------------------------------------------+
| derpitude <                                      |
+-----------------------------------------------------+
1 row in set (0.00 sec)
This means that the content we've inputted reaches the database as we gave it, so the "post content" side of our scenario works the way we want. This is the good part; the bad part is that we've wasted about half an hour digging into a bunch of code that doesn't reveal anything about our problem. Finally, we've only dug into about half of the code and we don't yet know a thing about how the display side of this loop works.
There isn't a single way to look at this either. We could start from one end, i.e. finding out where the post editor printing code lies and what it does, or from the other, that is, looking at the generated page sources and trying to reverse from there. Unfortunately there is no such thing as a "best" approach here, you'll just have to pick one and work through it as long as the approach looks reasonable, letting past experience guide you.
As far as we're concerned, we'll start where we left off, in our "editpost" snippet above. We observe that edit_post returns a post ID, which is passed as parameter to redirect_post, which sends a redirect to the client, causing the latter to do (this time) a GET with the "post" parameter set to the post ID and the "action" parameter set to "edit". So now we're back to that switch-case sausage, more precisely at the "edit" case; which does a lot of (on the surface) incomprehensible shit, then the following:
    include('edit-form-advanced.php');
i.e. it calls the script that literally prints the page content. So, are you going to read six hundred lines of that?
Of course, we're not interested in all the shit, but we do want to see how our input field is displayed. So we show the page source in the browser, do a search after "derpitude" (why else do you think I put that there?) and find a div id="postdiv" etc. So we... actually, wait, before anything else, do you notice how our "derpitude" input is displayed there? It's precisely with an "<", only the browser interprets that as a "<" and converts it accordingly when sending the POST, which confirms that the bug is on the display side: the page source should have really contained "&lt;" for the browser to show what we intended it to.
Anyway, we go back to edit-form-advanced.php, search for "postdiv" and we find that it's exactly before a call to a function called the_editor, which after some grepping can be found in wp-includes/general-template.php. There's not much to see here except some content being displayed, and if we debug-print the content here (using the "echo" method above), we'll see that it looks precisely the same as in the page sources.
So this is where we stop and think... maybe the displayer code should contain a call to PHP's htmlentities somewhere? At least that's what I'm thinking, that the database should contain the format as given by the user and the displayer should do all the escaping2. But really, is there a contract between this the_editor and its callers regarding expected input? because the function is called from no less than three places (page, comment and post editor) and there's nothing in the description to give us a hint of what goes where.
More generally: it's a well known fact that the Wordpress code has been written by idiots, since it lacks a proper spec of how to use those functions; but at the same time, and weirdly enough, it's very well structured and hacking it is mostly a matter of grepping for function F that does thing X, laying some prints in there and observing its behaviour under user input. The process is rather fast too, there's usually no kernel rebuilding involved -- meaning that yes, you're encouraged to fuck with things and fuck them up, assuming you don't let the fuckups reach production stage.
- 
The -rin grep stands for "recursive". The-wstands for "word", i.e. this instructs our grep to match "edit_post" but not e.g. "reedit_post". There's probably other ways to skin this cat, e.g. installing an indexer for C-like languages, but I'll be damned if I'm going to bother with this. ↩
- 
I'm not going to beat that "HTML is a piece of shit" horse again. I'm sure we understand each other. ↩ 
This doesn't manage to sound like a bug at all to me ; if you want the browser to show something else, write something else, don't expect it to figure it out for you.
Well, notice how my original «should have really contained "&lt;"» got converted to «should have really contained "<"» when you copy-pasted it? If you look at the page source, it does contain the value you pasted, but the browser transforms it, because that's what it's supposed to do, as per the HTML crapitude.
Let's look at this again: the user (by which I mean "I", because no one else seems to have stumbled upon this bad juju) a. types "ampersand lt semicolon" into the editor's text box; b. presses "save draft", upon which c. the browser sends "ampersand lt semicolon" via POST, which d. gets saved ad-litteram in the database by the server, which also e. retrieves it again and sends it back to the browser in a response; then the browser f. takes the response and renders it, and more specifically it g. pre-populates the input field with the content, where all HTML entities are transformed back to "what they mean"; so in particular h. "ampersand lt semicolon" is transformed into "left angular bracket".
> write something else
Now let's say I have a 2K-word post where a subset of these angular brackets are HTML tags, while another subset are supposed to be displayed as actual brackets, so I'm manually converting the "actual brackets" to HTML entities when passing through the content. If you're suggesting that I should do this conversion each time after I press "save draft", then I'm telling you I've done this when I imported the posts and it's utterly fucking insane.
Don't get me wrong, I dislike the in-browser post editor as it is and I have a bullet on my to-do list to write a CLI post editor for MP-WP, but I'd rather have it behave sanely while I'm using it... and the fact that this isn't reproducible on other installs only adds to the weird.
[...] be used as an illustration of MP-WP hacking -- I'm definitely going to reuse some parts for that CLI post editor I'm planning to get running, and yes, that'll also be a patch on this side of the tree. If nothing [...]
[...] Long story short, look for wp_recent_posts and get_comments in your MP-WP tree, using the grep -rw method outlined in the exploration tutorial. ↩ [...]